[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
<uninorm.h>
This include file defines functions for transforming Unicode strings to one of the four normal forms, known as NFC, NFD, NKFC, NFKD. These transformations involve decomposition and — for NFC and NFKC — composition of Unicode characters.
The following enumerated values are the possible types of decomposition of a Unicode character.
Denotes canonical decomposition.
UCD marker: <font>
. Denotes a font variant (e.g. a blackletter form).
UCD marker: <noBreak>
.
Denotes a no-break version of a space or hyphen.
UCD marker: <initial>
.
Denotes an initial presentation form (Arabic).
UCD marker: <medial>
.
Denotes a medial presentation form (Arabic).
UCD marker: <final>
.
Denotes a final presentation form (Arabic).
UCD marker: <isolated>
.
Denotes an isolated presentation form (Arabic).
UCD marker: <circle>
.
Denotes an encircled form.
UCD marker: <super>
.
Denotes a superscript form.
UCD marker: <sub>
.
Denotes a subscript form.
UCD marker: <vertical>
.
Denotes a vertical layout presentation form.
UCD marker: <wide>
.
Denotes a wide (or zenkaku) compatibility character.
UCD marker: <narrow>
.
Denotes a narrow (or hankaku) compatibility character.
UCD marker: <small>
.
Denotes a small variant form (CNS compatibility).
UCD marker: <square>
.
Denotes a CJK squared font variant.
UCD marker: <fraction>
.
Denotes a vulgar fraction form.
UCD marker: <compat>
.
Denotes an otherwise unspecified compatibility character.
The following constant denotes the maximum size of decomposition of a single Unicode character.
This macro expands to a constant that is the required size of buffer passed to
the uc_decomposition
and uc_canonical_decomposition
functions.
The following functions decompose a Unicode character.
Returns the character decomposition mapping of the Unicode character uc.
decomposition must point to an array of at least
UC_DECOMPOSITION_MAX_LENGTH
ucs_t
elements.
When a decomposition exists, decomposition[0..n-1]
and
*decomp_tag
are filled and n is returned. Otherwise -1 is
returned.
Returns the canonical character decomposition mapping of the Unicode character
uc. decomposition must point to an array of at least
UC_DECOMPOSITION_MAX_LENGTH
ucs_t
elements.
When a decomposition exists, decomposition[0..n-1]
is filled
and n is returned. Otherwise -1 is returned.
The following function composes a Unicode character from two Unicode characters.
Attempts to combine the Unicode characters uc1, uc2. uc1 is known to have canonical combining class 0.
Returns the combination of uc1 and uc2, if it exists. Returns 0 otherwise.
Not all decompositions can be recombined using this function. See the Unicode file ‘CompositionExclusions.txt’ for details.
The Unicode standard defines four normalization forms for Unicode strings. The following type is used to denote a normalization form.
An object of type uninorm_t
denotes a Unicode normalization form.
This is a scalar type; its values can be compared with ==
.
The following constants denote the four normalization forms.
Denotes Normalization form D: canonical decomposition.
Normalization form C: canonical decomposition, then canonical composition.
Normalization form KD: compatibility decomposition.
Normalization form KC: compatibility decomposition, then canonical composition.
The following functions operate on uninorm_t
objects.
Tests whether the normalization form nf does compatibility decomposition.
Tests whether the normalization form nf includes canonical composition.
Returns the decomposing variant of the normalization form nf. This maps NFC,NFD → NFD and NFKC,NFKD → NFKD.
The following functions apply a Unicode normalization form to a Unicode string.
Returns the specified normalization form of a string.
The following functions compare Unicode string, ignoring differences in normalization.
Compares s1 and s2, ignoring differences in normalization.
nf must be either UNINORM_NFD
or UNINORM_NFKD
.
If successful, sets *resultp
to -1 if s1 < s2,
0 if s1 = s2, 1 if s1 > s2, and returns 0.
Upon failure, returns -1 with errno
set.
Converts the string s of length n to a NUL-terminated byte
sequence, in such a way that comparing u8_normxfrm (s1)
and
u8_normxfrm (s2)
with the u8_cmp2
function is equivalent to
comparing s1 and s2 with the u8_normcoll
function.
nf must be either UNINORM_NFC
or UNINORM_NFKC
.
Compares s1 and s2, ignoring differences in normalization, using the collation rules of the current locale.
nf must be either UNINORM_NFC
or UNINORM_NFKC
.
If successful, sets *resultp
to -1 if s1 < s2,
0 if s1 = s2, 1 if s1 > s2, and returns 0.
Upon failure, returns -1 with errno
set.
A “stream of Unicode characters” is essentially a function that accepts an
ucs4_t
argument repeatedly, optionally combined with a function that
“flushes” the stream.
This is the data type of a stream of Unicode characters that normalizes its input according to a given normalization form and passes the normalized character sequence to the encapsulated stream of Unicode characters.
Creates and returns a normalization filter for Unicode characters.
The pair (stream_func, stream_data) is the encapsulated stream.
stream_func (stream_data, uc)
receives the Unicode
character uc and returns 0 if successful, or -1 with errno
set
upon failure.
Returns the new filter, or NULL with errno
set upon failure.
Stuffs a Unicode character into a normalizing filter.
Returns 0 if successful, or -1 with errno
set upon failure.
Brings data buffered in the filter to its destination, the encapsulated stream.
Returns 0 if successful, or -1 with errno
set upon failure.
Note! If after calling this function, additional characters are written into the filter, the resulting character sequence in the encapsulated stream will not necessarily be normalized.
Brings data buffered in the filter to its destination, the encapsulated stream, then closes and frees the filter.
Returns 0 if successful, or -1 with errno
set upon failure.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated by Bruno Haible on August, 17 2009 using texi2html 1.78a.