[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
<unictype.h>
This include file declares functions that classify Unicode characters and that test whether Unicode characters have specific properties.
The classification assigns a “general category” to every Unicode
character. This is similar to the classification provided by ISO C in
<wctype.h>
.
Properties are the data that guides various text processing algorithms in the presence of specific Unicode characters.
Every Unicode character or code point has a general category assigned to it. This classification is important for most algorithms that work on Unicode text.
The GNU libunistring library provides two kinds of API for working with
general categories. The object oriented API uses a variable to denote
every predefined general category value or combinations thereof. The
low-level API uses a bit mask instead. The advantage of the object oriented
API is that if only a few predefined general category values are used,
the data tables are relatively small. When you combine general category
values (using uc_general_category_or
, uc_general_category_and
,
or uc_general_category_and_not
), or when you use the low level
bit masks, a big table is used thats holds the complete general category
information for all Unicode characters.
This data type denotes a general category value. It is an immediate type that can be copied by simple assignment, without involving memory allocation. It is not an array type.
The following are the predefined general category value. Additional general categories may be added in the future.
The following are alias names for predefined General category values.
This is another name for UC_CATEGORY_L
.
This is another name for UC_CATEGORY_Lu
.
This is another name for UC_CATEGORY_Ll
.
This is another name for UC_CATEGORY_Lt
.
This is another name for UC_CATEGORY_Lm
.
This is another name for UC_CATEGORY_Lo
.
This is another name for UC_CATEGORY_M
.
This is another name for UC_CATEGORY_Mn
.
This is another name for UC_CATEGORY_Mc
.
This is another name for UC_CATEGORY_Me
.
This is another name for UC_CATEGORY_N
.
This is another name for UC_CATEGORY_Nd
.
This is another name for UC_CATEGORY_Nl
.
This is another name for UC_CATEGORY_No
.
This is another name for UC_CATEGORY_P
.
This is another name for UC_CATEGORY_Pc
.
This is another name for UC_CATEGORY_Pd
.
This is another name for UC_CATEGORY_Ps
(“start punctuation”).
This is another name for UC_CATEGORY_Pe
(“end punctuation”).
This is another name for UC_CATEGORY_Pi
.
This is another name for UC_CATEGORY_Pf
.
This is another name for UC_CATEGORY_Po
.
This is another name for UC_CATEGORY_S
.
This is another name for UC_CATEGORY_Sm
.
This is another name for UC_CATEGORY_Sc
.
This is another name for UC_CATEGORY_Sk
.
This is another name for UC_CATEGORY_So
.
This is another name for UC_CATEGORY_Z
.
This is another name for UC_CATEGORY_Zs
.
This is another name for UC_CATEGORY_Zl
.
This is another name for UC_CATEGORY_Zp
.
This is another name for UC_CATEGORY_C
.
This is another name for UC_CATEGORY_Cc
.
This is another name for UC_CATEGORY_Cf
.
This is another name for UC_CATEGORY_Cs
. All code points in this
category are invalid characters.
This is another name for UC_CATEGORY_Co
.
This is another name for UC_CATEGORY_Cn
. Some code points in this
category are invalid characters.
The following functions combine general categories, like in a boolean algebra, except that there is no ‘not’ operation.
Returns the union of two general categories. This corresponds to the unions of the two sets of characters.
Returns the intersection of two general categories as bit masks. This does not correspond to the intersection of the two sets of characters.
Returns the intersection of a general category with the complement of a second general category, as bit masks. This does not correspond to the intersection with complement, when viewing the categories as sets of characters.
The following functions associate general categories with their name.
Returns the name of a general category. Returns NULL if the general category corresponds to a bit mask that does not have a name.
Returns the general category given by name, e.g. "Lu"
.
The following functions view general categories as sets of Unicode characters.
Returns the general category of a Unicode character.
This function uses a big table.
Tests whether a Unicode character belongs to a given category. The category argument can be a predefined general category or the combination of several predefined general categories.
The following are the predefined general category value as bit masks. Additional general categories may be added in the future.
The following function views general categories as sets of Unicode characters.
Tests whether a Unicode character belongs to a given category. The bitmask argument can be a predefined general category bitmask or the combination of several predefined general category bitmasks.
This function uses a big table comprising all general categories.
Every Unicode character or code point has a canonical combining class assigned to it.
What is the meaning of the canonical combining class? Essentially, it indicates the priority with which a combining character is attached to its base character. The characters for which the canonical combining class is 0 are the base characters, and the characters for which it is greater than 0 are the combining characters. Combining characters are rendered near/attached/around their base character, and combining characters with small combining classes are attached "first" or "closer" to the base character.
The canonical combining class of a character is a number in the range 0..255. The possible values are described in the Unicode Character Database http://www.unicode.org/Public/UNIDATA/UCD.html. The list here is not definitive; more values can be added in future versions.
The canonical combining class value for “Not Reordered” characters. The value is 0.
The canonical combining class value for “Overlay” characters.
The canonical combining class value for “Nukta” characters.
The canonical combining class value for “Kana Voicing” characters.
The canonical combining class value for “Virama” characters.
The canonical combining class value for “Attached Below Left” characters.
The canonical combining class value for “Attached Below” characters.
The canonical combining class value for “Attached Above Right” characters.
The canonical combining class value for “Below Left” characters.
The canonical combining class value for “Below” characters.
The canonical combining class value for “Below Right” characters.
The canonical combining class value for “Left” characters.
The canonical combining class value for “Right” characters.
The canonical combining class value for “Above Left” characters.
The canonical combining class value for “Above” characters.
The canonical combining class value for “Above Right” characters.
The canonical combining class value for “Double Below” characters.
The canonical combining class value for “Double Above” characters.
The canonical combining class value for “Iota Subscript” characters.
The following function looks up the canonical combining class of a character.
Returns the canonical combining class of a Unicode character.
Every Unicode character or code point has a bidirectional category assigned to it.
The bidirectional category guides the bidirectional algorithm (http://www.unicode.org/reports/tr9/). The possible values are the following.
The bidirectional category for `Left-to-Right`” characters.
The bidirectional category for “Left-to-Right Embedding” characters.
The bidirectional category for “Left-to-Right Override” characters.
The bidirectional category for “Right-to-Left” characters.
The bidirectional category for “Right-to-Left Arabic” characters.
The bidirectional category for “Right-to-Left Embedding” characters.
The bidirectional category for “Right-to-Left Override” characters.
The bidirectional category for “Pop Directional Format” characters.
The bidirectional category for “European Number” characters.
The bidirectional category for “European Number Separator” characters.
The bidirectional category for “European Number Terminator” characters.
The bidirectional category for “Arabic Number” characters.
The bidirectional category for “Common Number Separator” characters.
The bidirectional category for “Non-Spacing Mark” characters.
The bidirectional category for “Boundary Neutral” characters.
The bidirectional category for “Paragraph Separator” characters.
The bidirectional category for “Segment Separator” characters.
The bidirectional category for “Whitespace” characters.
The bidirectional category for “Other Neutral” characters.
The following functions implement the association between a bidirectional category and its name.
Returns the name of a bidirectional category.
Returns the bidirectional category given by name, e.g. "LRE"
.
The following functions view bidirectional categories as sets of Unicode characters.
Returns the bidirectional category of a Unicode character.
Tests whether a Unicode character belongs to a given bidirectional category.
Decimal digits (like the digits from ‘0’ to ‘9’) exist in many scripts. The following function converts a decimal digit character to its numerical value.
Returns the decimal digit value of a Unicode character. The return value is an integer in the range 0..9, or -1 for characters that do not represent a decimal digit.
Digit characters are like decimal digit characters, possibly in special forms, like as superscript, subscript, or circled. The following function converts a digit character to its numerical value.
Returns the digit value of a Unicode character. The return value is an integer in the range 0..9, or -1 for characters that do not represent a digit.
There are also characters that represent numbers without a digit system, like the Roman numerals, and fractional numbers, like 1/4 or 3/4.
The following type represents the numeric value of a Unicode character.
This is a structure type with the following fields:
int numerator; int denominator; |
An integer n is represented by numerator = n
,
denominator = 1
.
The following function converts a number character to its numerical value.
Returns the numeric value of a Unicode character.
The return value is a fraction, or the pseudo-fraction { 0, 0 }
for
characters that do not represent a number.
Character mirroring is used to associate the closing parenthesis character to the opening parenthesis character, the closing brace character with the opening brace character, and so on.
The following function looks up the mirrored character of a Unicode character.
Stores the mirrored character of a Unicode character uc in
*puc
and returns true
, if it exists. Otherwise it
stores uc unmodified in *puc
and returns false
.
This section defines boolean properties of Unicode characters. This means, a character either has the given property or does not have it. In other words, the property can be viewed as a subset of the set of Unicode characters.
The GNU libunistring library provides two kinds of API for working with
properties. The object oriented API uses a type uc_property_t
to designate a property. In the function-based API, which is a bit more
low level, a property is merely a function.
The following type designates a property on Unicode characters.
This data type denotes a boolean property on Unicode characters. It is an immediate type that can be copied by simple assignment, without involving memory allocation. It is not an array type.
Many Unicode properties are predefined.
The following are general properties.
The following properties are related to case folding.
The following properties are related to identifiers.
The following properties have an influence on shaping and rendering.
The following properties relate to bidirectional reordering.
The following properties deal with number representations.
The following properties deal with CJK.
Other miscellaneous properties are:
The following function looks up a property by its name.
Returns the property given by name, e.g. "White space"
. If a property
with the given name exists, the result will satisfy the
uc_property_is_valid
predicate. Otherwise the result will not satisfy
this predicate and must not be passed to functions that expect an
uc_property_t
argument.
This function references a big table of all predefined properties. Its use can significantly increase the size of your application.
Returns true
when the given property is valid, or false
otherwise.
The following function views a property as a set of Unicode characters.
Tests whether the Unicode character uc has the given property.
The following are general properties.
The following properties are related to case folding.
The following properties are related to identifiers.
The following properties have an influence on shaping and rendering.
The following properties relate to bidirectional reordering.
The following properties deal with number representations.
The following properties deal with CJK.
Other miscellaneous properties are:
The Unicode characters are subdivided into scripts.
The following type is used to represent a script:
This data type is a structure type that refers to statically allocated read-only data. It contains the following fields:
const char *name; |
The name
field contains the name of the script.
The following functions look up a script.
Returns the script of a Unicode character. Returns NULL if uc does not belong to any script.
Returns the script given by its name, e.g. "HAN"
. Returns NULL if a
script with the given name does not exist.
The following function views a script as a set of Unicode characters.
Tests whether a Unicode character belongs to a given script.
The following gives a global picture of all scripts.
Get the list of all scripts. Stores a pointer to an array of all scripts in
*scripts
and the length of this array in *count
.
The Unicode characters are subdivided into blocks. A block is an interval of Unicode code points.
The following type is used to represent a block.
This data type is a structure type that refers to statically allocated data. It contains the following fields:
ucs4_t start; ucs4_t end; const char *name; |
The start
field is the first Unicode code point in the block.
The end
field is the last Unicode code point in the block.
The name
field is the name of the block.
The following function looks up a block.
Returns the block a character belongs to.
The following function views a block as a set of Unicode characters.
Tests whether a Unicode character belongs to a given block.
The following gives a global picture of all block.
Get the list of all blocks. Stores a pointer to an array of all blocks in
*blocks
and the length of this array in *count
.
The following properties are taken from language standards. The supported language standards are ISO C 99 and Java.
Tests whether a Unicode character is considered whitespace in ISO C 99.
Tests whether a Unicode character is considered whitespace in Java.
The following enumerated values are the possible return values of the functions
uc_c_ident_category
and uc_java_ident_category
.
This return value means that the given character is valid as first or subsequent character in an identifier.
This return value means that the given character is valid as subsequent character only.
This return value means that the given character is not valid in an identifier.
This return value (only for Java) means that the given character is ignorable.
The following function determine whether a given character can be a constituent of an identifier in the given programming language.
Returns the categorization of a Unicode character with respect to the ISO C 99 identifier syntax.
Returns the categorization of a Unicode character with respect to the Java identifier syntax.
The following character classifications mimic those declared in the ISO C
header files <ctype.h>
and <wctype.h>
. These functions are
deprecated, because this set of functions was designed with ASCII in mind and
cannot reflect the more diverse reality of the Unicode character set. But
they can be a quick-and-dirty porting aid when migrating from wchar_t
APIs to Unicode strings.
Tests for any character for which uc_is_alpha
or uc_is_digit
is
true.
Tests for any character for which uc_is_upper
or uc_is_lower
is
true, or any character that is one of a locale-specific set of characters for
which none of uc_is_cntrl
, uc_is_digit
, uc_is_punct
, or
uc_is_space
is true.
Tests for any control character.
Tests for any character that corresponds to a decimal-digit character.
Tests for any character for which uc_is_print
is true and
uc_is_space
is false.
Tests for any character that corresponds to a lowercase letter or is one
of a locale-specific set of characters for which none of uc_is_cntrl
,
uc_is_digit
, uc_is_punct
, or uc_is_space
is true.
Tests for any printing character.
Tests for any printing character that is one of a locale-specific set of
characters for which neither uc_is_space
nor uc_is_alnum
is true.
Test for any character that corresponds to a locale-specific set of characters
for which none of uc_is_alnum
, uc_is_graph
, or uc_is_punct
is true.
Tests for any character that corresponds to an uppercase letter or is one
of a locale-specific set of characters for which none of uc_is_cntrl
,
uc_is_digit
, uc_is_punct
, or uc_is_space
is true.
Tests for any character that corresponds to a hexadecimal-digit character.
Tests for any character that corresponds to a standard blank character or
a locale-specific set of characters for which uc_is_alnum
is false.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated by Bruno Haible on July, 1 2009 using texi2html 1.78a.