[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
<unigbrk.h>
This include file declares functions for determining where in a string “grapheme clusters” start and end. A “grapheme cluster” is an approximation to a user-perceived character, which sometimes corresponds to multiple Unicode characters. Editing operations such as mouse selection, cursor movement, and backspacing often operate on grapheme clusters as units, not on individual characters.
Some grapheme clusters are built from a base character and a combining character. The letter ‘é’, for example, is most commonly represented in Unicode as a single character U+00E8 LATIN SMALL LETTER E WITH ACUTE. It is, however, equally valid to use the pair of characters U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT. Since the user would perceive this pair of characters as a single character, they would be grouped into a single grapheme cluster.
But there are also grapheme clusters that consist of several base characters. For example, a Devanagari letter and a Devanagari vowel sign that follows it may form a grapheme cluster. Similarly, some pairs of Thai characters and Hangul syllables (formed by two or three Hangul characters) are grapheme clusters.
The following functions find a single boundary between grapheme clusters in a string.
Returns the start of the next grapheme cluster following s,
or end if no grapheme cluster break is encountered before it.
Returns NULL if and only if s == end
.
Returns the start of the grapheme cluster preceding s, or
start if no grapheme cluster break is encountered before it.
Returns NULL if and only if s == start
.
The following functions determine all of the grapheme cluster boundaries in a string.
Determines the grapheme cluster break points in s, an array of
n units, and stores the result at p[0..n-1]
.
p[i] = 1
means that there is a grapheme cluster boundary between
s[i-1]
and s[i]
.
p[i] = 0
means that s[i-1]
and s[i]
are part of the
same grapheme cluster.
p[0]
is always set to 1, because there is always a
grapheme cluster break at start of text.
This is a more low-level API. The grapheme cluster break property is a property defined in Unicode Standard Annex #29, section “Grapheme Cluster Boundaries”, see http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries. It is used for determining the grapheme cluster breaks in a string.
The following are the possible values of the grapheme cluster break property. More values may be added in the future.
The following function looks up the grapheme cluster break property of a character.
Returns the Grapheme_Cluster_Break property of a Unicode character.
The following function determines whether there is a grapheme cluster break between two Unicode characters. It is the primitive upon which the higher-level functions in the previous section are directly based.
Returns true if there is an grapheme cluster boundary between Unicode characters a and b.
There is always a grapheme cluster break at the start or end of text. You can specify zero for a or b to indicate start of text or end of text, respectively.
This implements the extended (not legacy) grapheme cluster rules described in the Unicode standard, because the standard says that they are preferred.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated by Daiki Ueno on July, 8 2015 using texi2html 1.78a.