summaryrefslogtreecommitdiff
path: root/doc/unigbrk.texi
diff options
context:
space:
mode:
Diffstat (limited to 'doc/unigbrk.texi')
-rw-r--r--doc/unigbrk.texi126
1 files changed, 126 insertions, 0 deletions
diff --git a/doc/unigbrk.texi b/doc/unigbrk.texi
new file mode 100644
index 0000000..196bd9f
--- /dev/null
+++ b/doc/unigbrk.texi
@@ -0,0 +1,126 @@
+@node unigbrk.h
+@chapter Grapheme cluster breaks in strings @code{<unigbrk.h>}
+
+@cindex grapheme cluster breaks
+@cindex grapheme cluster boundaries
+@cindex breaks, grapheme cluster
+@cindex boundaries, between grapheme clusters
+This include file declares functions for determining where in a string
+``grapheme clusters'' start and end. A ``grapheme cluster'' is an
+approximation to a user-perceived character, which sometimes
+corresponds to multiple Unicode characters. Editing operations such as
+mouse selection, cursor movement, and backspacing often operate on
+grapheme clusters as units, not on individual characters.
+
+Some grapheme clusters are built from a base character and a combining
+character. The letter @samp{@'e},
+for example, is most commonly represented in Unicode as a single
+character U+00E8 @sc{LATIN SMALL LETTER E WITH ACUTE}. It is,
+however, equally valid to use the pair of characters U+0065 @sc{LATIN
+SMALL LETTER E} followed by U+0301 @sc{COMBINING ACUTE ACCENT}. Since
+the user would perceive this pair of characters as a single character,
+they would be grouped into a single grapheme cluster.
+
+But there are also grapheme clusters that consist of several base characters.
+For example, a Devanagari letter and a Devanagari vowel sign that follows it
+may form a grapheme cluster. Similarly, some pairs of Thai characters and
+Hangul syllables (formed by two or three Hangul characters) are grapheme
+clusters.
+
+@menu
+* Grapheme cluster breaks in a string::
+* Grapheme cluster break property::
+@end menu
+
+@node Grapheme cluster breaks in a string
+@section Grapheme cluster breaks in a string
+
+The following functions find a single boundary between grapheme
+clusters in a string.
+
+@deftypefun void u8_grapheme_next (const uint8_t *@var{s}, const uint8_t *@var{end})
+@deftypefunx void u16_grapheme_next (const uint16_t *@var{s}, const uint16_t *@var{end})
+@deftypefunx void u32_grapheme_next (const uint32_t *@var{s}, const uint32_t *@var{end})
+Returns the start of the next grapheme cluster following @var{s},
+or @var{end} if no grapheme cluster break is encountered before it.
+Returns NULL if and only if @code{@var{s} == @var{end}}.
+@end deftypefun
+
+@deftypefun void u8_grapheme_prev (const uint8_t *@var{s}, const uint8_t *@var{start})
+@deftypefunx void u16_grapheme_prev (const uint16_t *@var{s}, const uint16_t *@var{start})
+@deftypefunx void u32_grapheme_prev (const uint32_t *@var{s}, const uint32_t *@var{start})
+Returns the start of the grapheme cluster preceding @var{s}, or
+@var{start} if no grapheme cluster break is encountered before it.
+Returns NULL if and only if @code{@var{s} == @var{start}}.
+@end deftypefun
+
+The following functions determine all of the grapheme cluster
+boundaries in a string.
+
+@deftypefun void u8_grapheme_breaks (const uint8_t *@var{s}, size_t @var{n}, char *@var{p})
+@deftypefunx void u16_grapheme_breaks (const uint16_t *@var{s}, size_t @var{n}, char *@var{p})
+@deftypefunx void u32_grapheme_breaks (const uint32_t *@var{s}, size_t @var{n}, char *@var{p})
+@deftypefunx void ulc_grapheme_breaks (const char *@var{s}, size_t @var{n}, char *@var{p})
+Determines the grapheme cluster break points in @var{s}, an array of
+@var{n} units, and stores the result at @code{@var{p}[0..@var{n}-1]}.
+@table @asis
+@item @code{@var{p}[i] = 1}
+means that there is a grapheme cluster boundary between
+@code{@var{s}[i-1]} and @code{@var{s}[i]}.
+@item @code{@var{p}[i] = 0}
+means that @code{@var{s}[i-1]} and @code{@var{s}[i]} are part of the
+same grapheme cluster.
+@end table
+@code{@var{p}[0]} is always set to 1, because there is always a
+grapheme cluster break at start of text.
+@end deftypefun
+
+@node Grapheme cluster break property
+@section Grapheme cluster break property
+
+This is a more low-level API. The grapheme cluster break property is a
+property defined in Unicode Standard Annex #29, section ``Grapheme Cluster
+Boundaries'', see
+@url{http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries}.@texnl{}
+It is used for determining the grapheme cluster breaks in a string.
+
+The following are the possible values of the grapheme cluster break
+property. More values may be added in the future.
+
+@deftypevr Constant int GBP_OTHER
+@deftypevrx Constant int GBP_CR
+@deftypevrx Constant int GBP_LF
+@deftypevrx Constant int GBP_CONTROL
+@deftypevrx Constant int GBP_EXTEND
+@deftypevrx Constant int GBP_PREPEND
+@deftypevrx Constant int GBP_SPACINGMARK
+@deftypevrx Constant int GBP_L
+@deftypevrx Constant int GBP_V
+@deftypevrx Constant int GBP_T
+@deftypevrx Constant int GBP_LV
+@deftypevrx Constant int GBP_LVT
+@end deftypevr
+
+The following function looks up the grapheme cluster break property of a
+character.
+
+@deftypefun int uc_graphemeclusterbreak_property (ucs4_t @var{uc})
+Returns the Grapheme_Cluster_Break property of a Unicode character.
+@end deftypefun
+
+The following function determines whether there is a grapheme cluster
+break between two Unicode characters. It is the primitive upon which
+the higher-level functions in the previous section are directly based.
+
+@deftypefun bool uc_is_grapheme_break (ucs4_t @var{a}, ucs4_t @var{b})
+Returns true if there is an grapheme cluster boundary between Unicode
+characters @var{a} and @var{b}.
+
+There is always a grapheme cluster break at the start or end of text.
+You can specify zero for @var{a} or @var{b} to indicate start of text or end
+of text, respectively.
+
+This implements the extended (not legacy) grapheme cluster rules
+described in the Unicode standard, because the standard says that they
+are preferred.
+@end deftypefun