The ISO C and POSIX standard creators made an attempt to fix the first problem mentioned in the section ‘char *’ strings. They introduced

a type ‘wchar_t’, designed to encapsulate an entire character,
a “wide string” type ‘wchar_t *’, and
functions declared in <wctype.h> that were meant to supplant the ones in <ctype.h>.

Unfortunately, this API and its implementation has numerous problems:

On AIX and Windows platforms, wchar_t is a 16-bit type. This means that it can never accommodate an entire Unicode character. Either the wchar_t * strings are limited to characters in UCS-2 (the “Basic Multilingual Plane” of Unicode), or — if wchar_t * strings are encoded in UTF-16 — a wchar_t represents only half of a character in the worst case, making the <wctype.h> functions pointless.
On Solaris and FreeBSD, the wchar_t encoding is locale dependent and undocumented. This means, if you want to know any property of a wchar_t character, other than the properties defined by <wctype.h> — such as whether it's a dash, currency symbol, paragraph separator, or similar —, you have to convert it to char * encoding first, by use of the function wctomb.
When you read a stream of wide characters, through the functions fgetwc and fgetws, and when the input stream/file is not in the expected encoding, you have no way to determine the invalid byte sequence and do some corrective action. If you use these functions, your program becomes “garbage in - more garbage out” or “garbage in - abort”.

As a consequence, it is better to use multibyte strings, as explained in the section ‘char *’ strings. Such multibyte strings can bypass limitations of the wchar_t type, if you use functions defined in gnulib and libunistring for text processing. They can also faithfully transport malformed characters that were present in the input, without requiring the program to produce garbage or abort.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Bruno Haible on October, 16 2022 using texi2html 1.78a.

A. The wchar_t mess

A. The `wchar_t` mess