4. Elementary Unicode string functions `<unistr.h>`

This include file declares elementary functions for Unicode strings. It is essentially the equivalent of what <string.h> is for C strings.

4.1 Elementary string checks

The following function is available to verify the integrity of a Unicode string.

Function: const uint8_t * u8_check (const uint8_t *s, size_t n)
Function: const uint16_t * u16_check (const uint16_t *s, size_t n)
Function: const uint32_t * u32_check (const uint32_t *s, size_t n): This function checks whether a Unicode string is well-formed. It returns NULL if valid, or a pointer to the first invalid unit otherwise.

4.2 Elementary string conversions

The following functions perform conversions between the different forms of Unicode strings.

Function: uint16_t * u8_to_u16 (const uint8_t *s, size_t n, uint16_t *resultbuf, size_t *lengthp)

Converts an UTF-8 string to an UTF-16 string.

The resultbuf and lengthp arguments are as described in chapter Conventions.

Function: uint32_t * u8_to_u32 (const uint8_t *s, size_t n, uint32_t *resultbuf, size_t *lengthp)

Converts an UTF-8 string to an UTF-32 string.

The resultbuf and lengthp arguments are as described in chapter Conventions.

Function: uint8_t * u16_to_u8 (const uint16_t *s, size_t n, uint8_t *resultbuf, size_t *lengthp)

Converts an UTF-16 string to an UTF-8 string.

The resultbuf and lengthp arguments are as described in chapter Conventions.

Function: uint32_t * u16_to_u32 (const uint16_t *s, size_t n, uint32_t *resultbuf, size_t *lengthp)

Converts an UTF-16 string to an UTF-32 string.

The resultbuf and lengthp arguments are as described in chapter Conventions.

Function: uint8_t * u32_to_u8 (const uint32_t *s, size_t n, uint8_t *resultbuf, size_t *lengthp)

Converts an UTF-32 string to an UTF-8 string.

The resultbuf and lengthp arguments are as described in chapter Conventions.

Function: uint16_t * u32_to_u16 (const uint32_t *s, size_t n, uint16_t *resultbuf, size_t *lengthp)

Converts an UTF-32 string to an UTF-16 string.

The resultbuf and lengthp arguments are as described in chapter Conventions.

4.3 Elementary string functions

4.3.1 Iterating over a Unicode string

The following functions inspect and return details about the first character in a Unicode string.

Function: int u8_mblen (const uint8_t *s, size_t n)

Function: int u16_mblen (const uint16_t *s, size_t n)

Function: int u32_mblen (const uint32_t *s, size_t n)

Returns the length (number of units) of the first character in s, which is no longer than n. Returns 0 if it is the NUL character. Returns -1 upon failure.

This function is similar to mblen, except that it operates on a Unicode string and that s must not be NULL.

Function: int u8_mbtouc (ucs4_t *puc, const uint8_t *s, size_t n)

Function: int u16_mbtouc (ucs4_t *puc, const uint16_t *s, size_t n)

Function: int u32_mbtouc (ucs4_t *puc, const uint32_t *s, size_t n)

Returns the length (number of units) of the first character in s, putting its ucs4_t representation in *puc. Upon failure, *puc is set to 0xfffd, and an appropriate number of units is returned.

The number of available units, n, must be > 0.

This function fails if an invalid sequence of units is encountered at the beginning of s, or if additional units (after the n provided units) would be needed to form a character.

This function is similar to mbtowc, except that it operates on a Unicode string, puc and s must not be NULL, n must be > 0, and the NUL character is not treated specially.

Function: int u8_mbtouc_unsafe (ucs4_t *puc, const uint8_t *s, size_t n)
Function: int u16_mbtouc_unsafe (ucs4_t *puc, const uint16_t *s, size_t n)
Function: int u32_mbtouc_unsafe (ucs4_t *puc, const uint32_t *s, size_t n): This function is identical to u8_mbtouc/u16_mbtouc/u32_mbtouc. Earlier versions of this function performed fewer range-checks on the sequence of units.

Function: int u8_mbtoucr (ucs4_t *puc, const uint8_t *s, size_t n)

Function: int u16_mbtoucr (ucs4_t *puc, const uint16_t *s, size_t n)

Function: int u32_mbtoucr (ucs4_t *puc, const uint32_t *s, size_t n)

Returns the length (number of units) of the first character in s, putting its ucs4_t representation in *puc. Upon failure, *puc is set to 0xfffd, and -1 is returned for an invalid sequence of units, -2 is returned for an incomplete sequence of units.

The number of available units, n, must be > 0.

This function is similar to u8_mbtouc, except that the return value gives more details about the failure, similar to mbrtowc.

4.3.2 Creating Unicode strings one character at a time

The following function stores a Unicode character as a Unicode string in memory.

Function: int u8_uctomb (uint8_t *s, ucs4_t uc, ptrdiff_t n)

Function: int u16_uctomb (uint16_t *s, ucs4_t uc, ptrdiff_t n)

Function: int u32_uctomb (uint32_t *s, ucs4_t uc, ptrdiff_t n)

Puts the multibyte character represented by uc in s, returning its length. Returns -1 upon failure, -2 if the number of available units, n, is too small. The latter case cannot occur if n >= 6/2/1, respectively.

This function is similar to wctomb, except that it operates on a Unicode strings, s must not be NULL, and the argument n must be specified.

4.3.3 Copying Unicode strings

The following functions copy Unicode strings in memory.

Function: uint8_t * u8_cpy (uint8_t *dest, const uint8_t *src, size_t n)

Function: uint16_t * u16_cpy (uint16_t *dest, const uint16_t *src, size_t n)

Function: uint32_t * u32_cpy (uint32_t *dest, const uint32_t *src, size_t n)

Copies n units from src to dest.

This function is similar to memcpy, except that it operates on Unicode strings.

Function: uint8_t * u8_move (uint8_t *dest, const uint8_t *src, size_t n)

Function: uint16_t * u16_move (uint16_t *dest, const uint16_t *src, size_t n)

Function: uint32_t * u32_move (uint32_t *dest, const uint32_t *src, size_t n)

Copies n units from src to dest, guaranteeing correct behavior for overlapping memory areas.

This function is similar to memmove, except that it operates on Unicode strings.

The following function fills a Unicode string.

Function: uint8_t * u8_set (uint8_t *s, ucs4_t uc, size_t n)

Function: uint16_t * u16_set (uint16_t *s, ucs4_t uc, size_t n)

Function: uint32_t * u32_set (uint32_t *s, ucs4_t uc, size_t n)

Sets the first n characters of s to uc. uc should be a character that occupies only 1 unit.

This function is similar to memset, except that it operates on Unicode strings.

4.3.4 Comparing Unicode strings

The following function compares two Unicode strings of the same length.

Function: int u8_cmp (const uint8_t *s1, const uint8_t *s2, size_t n)

Function: int u16_cmp (const uint16_t *s1, const uint16_t *s2, size_t n)

Function: int u32_cmp (const uint32_t *s1, const uint32_t *s2, size_t n)

Compares s1 and s2, each of length n, lexicographically. Returns a negative value if s1 compares smaller than s2, a positive value if s1 compares larger than s2, or 0 if they compare equal.

This function is similar to memcmp, except that it operates on Unicode strings.

The following function compares two Unicode strings of possibly different lengths.

Function: int u8_cmp2 (const uint8_t *s1, size_t n1, const uint8_t *s2, size_t n2)

Function: int u16_cmp2 (const uint16_t *s1, size_t n1, const uint16_t *s2, size_t n2)

Function: int u32_cmp2 (const uint32_t *s1, size_t n1, const uint32_t *s2, size_t n2)

Compares s1 and s2, lexicographically. Returns a negative value if s1 compares smaller than s2, a positive value if s1 compares larger than s2, or 0 if they compare equal.

This function is similar to the gnulib function memcmp2, except that it operates on Unicode strings.

4.3.5 Searching for a character in a Unicode string

The following function searches for a given Unicode character.

Function: uint8_t * u8_chr (const uint8_t *s, size_t n, ucs4_t uc)

Function: uint16_t * u16_chr (const uint16_t *s, size_t n, ucs4_t uc)

Function: uint32_t * u32_chr (const uint32_t *s, size_t n, ucs4_t uc)

Searches the string at s for uc. Returns a pointer to the first occurrence of uc in s, or NULL if uc does not occur in s.

This function is similar to memchr, except that it operates on Unicode strings.

4.3.6 Counting the characters in a Unicode string

The following function counts the number of Unicode characters.

Function: size_t u8_mbsnlen (const uint8_t *s, size_t n)

Function: size_t u16_mbsnlen (const uint16_t *s, size_t n)

Function: size_t u32_mbsnlen (const uint32_t *s, size_t n)

Counts and returns the number of Unicode characters in the n units from s.

This function is similar to the gnulib function mbsnlen, except that it operates on Unicode strings.

4.4 Elementary string functions with memory allocation

The following function copies a Unicode string.

Function: uint8_t * u8_cpy_alloc (const uint8_t *s, size_t n)
Function: uint16_t * u16_cpy_alloc (const uint16_t *s, size_t n)
Function: uint32_t * u32_cpy_alloc (const uint32_t *s, size_t n): Makes a freshly allocated copy of s, of length n.

4.5 Elementary string functions on NUL terminated strings

4.5.1 Iterating over a NUL terminated Unicode string

The following functions inspect and return details about the first character in a Unicode string.

Function: int u8_strmblen (const uint8_t *s)
Function: int u16_strmblen (const uint16_t *s)
Function: int u32_strmblen (const uint32_t *s): Returns the length (number of units) of the first character in s. Returns 0 if it is the NUL character. Returns -1 upon failure.

Function: int u8_strmbtouc (ucs4_t *puc, const uint8_t *s)
Function: int u16_strmbtouc (ucs4_t *puc, const uint16_t *s)
Function: int u32_strmbtouc (ucs4_t *puc, const uint32_t *s): Returns the length (number of units) of the first character in s, putting its ucs4_t representation in *puc. Returns 0 if it is the NUL character. Returns -1 upon failure.

Function: const uint8_t * u8_next (ucs4_t *puc, const uint8_t *s)
Function: const uint16_t * u16_next (ucs4_t *puc, const uint16_t *s)
Function: const uint32_t * u32_next (ucs4_t *puc, const uint32_t *s): Forward iteration step. Advances the pointer past the next character, or returns NULL if the end of the string has been reached. Puts the character's ucs4_t representation in *puc.

The following function inspects and returns details about the previous character in a Unicode string.

Function: const uint8_t * u8_prev (ucs4_t *puc, const uint8_t *s, const uint8_t *start)
Function: const uint16_t * u16_prev (ucs4_t *puc, const uint16_t *s, const uint16_t *start)
Function: const uint32_t * u32_prev (ucs4_t *puc, const uint32_t *s, const uint32_t *start): Backward iteration step. Advances the pointer to point to the previous character (the one that ends at s), or returns NULL if the beginning of the string (specified by start) had been reached. Puts the character's ucs4_t representation in *puc. Note that this function works only on well-formed Unicode strings.

4.5.2 Length of a NUL terminated Unicode string

The following functions determine the length of a Unicode string.

Function: size_t u8_strlen (const uint8_t *s)

Function: size_t u16_strlen (const uint16_t *s)

Function: size_t u32_strlen (const uint32_t *s)

Returns the number of units in s.

This function is similar to strlen and wcslen, except that it operates on Unicode strings.

Function: size_t u8_strnlen (const uint8_t *s, size_t maxlen)

Function: size_t u16_strnlen (const uint16_t *s, size_t maxlen)

Function: size_t u32_strnlen (const uint32_t *s, size_t maxlen)

Returns the number of units in s, but at most maxlen.

This function is similar to strnlen and wcsnlen, except that it operates on Unicode strings.

4.5.3 Copying a NUL terminated Unicode string

The following functions copy portions of Unicode strings in memory.

Function: uint8_t * u8_strcpy (uint8_t *dest, const uint8_t *src)

Function: uint16_t * u16_strcpy (uint16_t *dest, const uint16_t *src)

Function: uint32_t * u32_strcpy (uint32_t *dest, const uint32_t *src)

Copies src to dest.

This function is similar to strcpy and wcscpy, except that it operates on Unicode strings.

Function: uint8_t * u8_stpcpy (uint8_t *dest, const uint8_t *src)

Function: uint16_t * u16_stpcpy (uint16_t *dest, const uint16_t *src)

Function: uint32_t * u32_stpcpy (uint32_t *dest, const uint32_t *src)

Copies src to dest, returning the address of the terminating NUL in dest.

This function is similar to stpcpy, except that it operates on Unicode strings.

Function: uint8_t * u8_strncpy (uint8_t *dest, const uint8_t *src, size_t n)

Function: uint16_t * u16_strncpy (uint16_t *dest, const uint16_t *src, size_t n)

Function: uint32_t * u32_strncpy (uint32_t *dest, const uint32_t *src, size_t n)

Copies no more than n units of src to dest.

This function is similar to strncpy and wcsncpy, except that it operates on Unicode strings.

Function: uint8_t * u8_stpncpy (uint8_t *dest, const uint8_t *src, size_t n)

Function: uint16_t * u16_stpncpy (uint16_t *dest, const uint16_t *src, size_t n)

Function: uint32_t * u32_stpncpy (uint32_t *dest, const uint32_t *src, size_t n)

Copies no more than n units of src to dest. Returns a pointer past the last non-NUL unit written into dest. In other words, if the units written into dest include a NUL, the return value is the address of the first such NUL unit, otherwise it is dest + n.

This function is similar to stpncpy, except that it operates on Unicode strings.

Function: uint8_t * u8_strcat (uint8_t *dest, const uint8_t *src)

Function: uint16_t * u16_strcat (uint16_t *dest, const uint16_t *src)

Function: uint32_t * u32_strcat (uint32_t *dest, const uint32_t *src)

Appends src onto dest.

This function is similar to strcat and wcscat, except that it operates on Unicode strings.

Function: uint8_t * u8_strncat (uint8_t *dest, const uint8_t *src, size_t n)

Function: uint16_t * u16_strncat (uint16_t *dest, const uint16_t *src, size_t n)

Function: uint32_t * u32_strncat (uint32_t *dest, const uint32_t *src, size_t n)

Appends no more than n units of src onto dest.

This function is similar to strncat and wcsncat, except that it operates on Unicode strings.

4.5.4 Comparing NUL terminated Unicode strings

The following functions compare two Unicode strings.

Function: int u8_strcmp (const uint8_t *s1, const uint8_t *s2)

Function: int u16_strcmp (const uint16_t *s1, const uint16_t *s2)

Function: int u32_strcmp (const uint32_t *s1, const uint32_t *s2)

Compares s1 and s2, lexicographically. Returns a negative value if s1 compares smaller than s2, a positive value if s1 compares larger than s2, or 0 if they compare equal.

This function is similar to strcmp and wcscmp, except that it operates on Unicode strings.

Function: int u8_strcoll (const uint8_t *s1, const uint8_t *s2)

Function: int u16_strcoll (const uint16_t *s1, const uint16_t *s2)

Function: int u32_strcoll (const uint32_t *s1, const uint32_t *s2)

Compares s1 and s2 using the collation rules of the current locale. Returns -1 if s1 < s2, 0 if s1 = s2, 1 if s1 > s2. Upon failure, sets errno and returns any value.

This function is similar to strcoll and wcscoll, except that it operates on Unicode strings.

Note that this function may consider different canonical normalizations of the same string as having a large distance. It is therefore better to use the function u8_normcoll instead of this one; see Normalization forms (composition and decomposition) <uninorm.h>.

Function: int u8_strncmp (const uint8_t *s1, const uint8_t *s2, size_t n)

Function: int u16_strncmp (const uint16_t *s1, const uint16_t *s2, size_t n)

Function: int u32_strncmp (const uint32_t *s1, const uint32_t *s2, size_t n)

Compares no more than n units of s1 and s2.

This function is similar to strncmp and wcsncmp, except that it operates on Unicode strings.

4.5.5 Duplicating a NUL terminated Unicode string

The following function allocates a duplicate of a Unicode string.

Function: uint8_t * u8_strdup (const uint8_t *s)

Function: uint16_t * u16_strdup (const uint16_t *s)

Function: uint32_t * u32_strdup (const uint32_t *s)

Duplicates s, returning an identical malloc'd string.

This function is similar to strdup and wcsdup, except that it operates on Unicode strings.

4.5.6 Searching for a character in a NUL terminated Unicode string

The following functions search for a given Unicode character.

Function: uint8_t * u8_strchr (const uint8_t *str, ucs4_t uc)

Function: uint16_t * u16_strchr (const uint16_t *str, ucs4_t uc)

Function: uint32_t * u32_strchr (const uint32_t *str, ucs4_t uc)

Finds the first occurrence of uc in str.

This function is similar to strchr and wcschr, except that it operates on Unicode strings.

Function: uint8_t * u8_strrchr (const uint8_t *str, ucs4_t uc)

Function: uint16_t * u16_strrchr (const uint16_t *str, ucs4_t uc)

Function: uint32_t * u32_strrchr (const uint32_t *str, ucs4_t uc)

Finds the last occurrence of uc in str.

This function is similar to strrchr and wcsrchr, except that it operates on Unicode strings.

The following functions search for the first occurrence of some Unicode character in or outside a given set of Unicode characters.

Function: size_t u8_strcspn (const uint8_t *str, const uint8_t *reject)

Function: size_t u16_strcspn (const uint16_t *str, const uint16_t *reject)

Function: size_t u32_strcspn (const uint32_t *str, const uint32_t *reject)

Returns the length of the initial segment of str which consists entirely of Unicode characters not in reject.

This function is similar to strcspn and wcscspn, except that it operates on Unicode strings.

Function: size_t u8_strspn (const uint8_t *str, const uint8_t *accept)

Function: size_t u16_strspn (const uint16_t *str, const uint16_t *accept)

Function: size_t u32_strspn (const uint32_t *str, const uint32_t *accept)

Returns the length of the initial segment of str which consists entirely of Unicode characters in accept.

This function is similar to strspn and wcsspn, except that it operates on Unicode strings.

Function: uint8_t * u8_strpbrk (const uint8_t *str, const uint8_t *accept)

Function: uint16_t * u16_strpbrk (const uint16_t *str, const uint16_t *accept)

Function: uint32_t * u32_strpbrk (const uint32_t *str, const uint32_t *accept)

Finds the first occurrence in str of any character in accept.

This function is similar to strpbrk and wcspbrk, except that it operates on Unicode strings.

4.5.7 Searching for a substring in a NUL terminated Unicode string

The following functions search whether a given Unicode string is a substring of another Unicode string.

Function: uint8_t * u8_strstr (const uint8_t *haystack, const uint8_t *needle)

Function: uint16_t * u16_strstr (const uint16_t *haystack, const uint16_t *needle)

Function: uint32_t * u32_strstr (const uint32_t *haystack, const uint32_t *needle)

Finds the first occurrence of needle in haystack.

This function is similar to strstr and wcsstr, except that it operates on Unicode strings.

Function: bool u8_startswith (const uint8_t *str, const uint8_t *prefix)
Function: bool u16_startswith (const uint16_t *str, const uint16_t *prefix)
Function: bool u32_startswith (const uint32_t *str, const uint32_t *prefix): Tests whether str starts with prefix.

Function: bool u8_endswith (const uint8_t *str, const uint8_t *suffix)
Function: bool u16_endswith (const uint16_t *str, const uint16_t *suffix)
Function: bool u32_endswith (const uint32_t *str, const uint32_t *suffix): Tests whether str ends with suffix.

4.5.8 Tokenizing a NUL terminated Unicode string

The following function does one step in tokenizing a Unicode string.

Function: uint8_t * u8_strtok (uint8_t *str, const uint8_t *delim, uint8_t **ptr)

Function: uint16_t * u16_strtok (uint16_t *str, const uint16_t *delim, uint16_t **ptr)

Function: uint32_t * u32_strtok (uint32_t *str, const uint32_t *delim, uint32_t **ptr)

Divides str into tokens separated by characters in delim.

This function is similar to strtok_r and wcstok, except that it operates on Unicode strings. Its interface is actually more similar to wcstok than to strtok.

This document was generated by Bruno Haible on October, 16 2022 using texi2html 1.78a.