Merge branch 'feature/upstream' into develop

author: Jörg Frings-Fürst <debian@jff.email> 2022-01-08 11:53:52 +0100
committer: Jörg Frings-Fürst <debian@jff.email> 2022-01-08 11:53:52 +0100
commit: fa838e76139763f902c7d27cb9e1d393ed6a15e4 (patch)
tree: 7d0ae09775ea950056193eaa2ca93844299d46f1 /doc/char32_t.texi
parent: c78359d9542c86b972aac373efcf7bc7a8a560e5 (diff)
parent: 2959e59fab3bab834368adefd90bd4b1b094366b (diff)
1 files changed, 50 insertions, 0 deletions
diff --git a/doc/char32_t.texi b/doc/char32_t.texi
new file mode 100644
index 0000000..040e298
--- /dev/null
+++ b/doc/char32_t.texi
@@ -0,0 +1,50 @@
+@node The char32_t problem
+@appendix The @code{char32_t} problem
+
+@cindex char32_t, type
+@cindex char16_t, type
+In response to the @code{wchar_t} mess described in the previous section,
+ISO C 11 introduces two new types: @code{char32_t} and @code{char16_t}.
+
+@code{char32_t} is a type like @code{wchar_t}, with the added guarantee that it
+is 32 bits wide.  So, it is a type that is appropriate for encoding a Unicode
+character.  It is meant to resolve the problems of the 16-bit wide
+@code{wchar_t} on AIX and Windows platforms, and allow a saner programming model
+for wide character strings across all platforms.
+
+@code{char16_t} is a type like @code{wchar_t}, with the added guarantee that it
+is 16 bits wide.  It is meant to allow porting programs that use the broken wide
+character strings programming model from Windows to all platforms.  Of course,
+no one needs this.
+
+These types are accompanied with a syntax for defining wide string literals with
+these element types: @code{u"..."} and @code{U"..."}.
+
+So far, so good.  What the ISO C designers forgot, is to provide standardized C
+library functions that operate on these wide character strings.  They
+standardized only the most basic functions, @code{mbrtoc32} and @code{c32rtomb},
+which are analogous to @code{mbrtowc} and @code{wcrtomb}, respectively.  For the
+rest, GNU gnulib @url{https://www.gnu.org/software/gnulib/} provides the
+functions:
+@itemize @bullet
+@item
+Functions for converting an entire string: @code{mbstoc32s} -- like
+@code{mbstowcs}, @code{c32stombs} -- like @code{wcstombs}.
+@item
+Functions for testing the properties of a 32-bit wide character:
+@code{c32isalnum}, @code{c32isalpha}, etc. -- like @code{iswalnum},
+@code{iswalpha}, etc.
+@end itemize
+
+Still, this API has two problems:
+@itemize @bullet
+@item
+The @code{char32_t} encoding is locale dependent and undocumented.  This means,
+if you want to know any property of a @code{char32_t} character, other than the
+properties defined by @code{<wctype.h>} -- such as whether it's a dash, currency
+symbol, paragraph separator, or similar --, you have to convert it to
+@code{char *} encoding first, by use of the function @code{c32tomb}.
+@item
+Even on platforms where @code{wchar_t} is 32 bits wide, the @code{char32_t}
+encoding may be different from the @code{wchar_t} encoding.
+@end itemize
author	Jörg Frings-Fürst <debian@jff.email>	2022-01-08 11:53:52 +0100
committer	Jörg Frings-Fürst <debian@jff.email>	2022-01-08 11:53:52 +0100
commit	fa838e76139763f902c7d27cb9e1d393ed6a15e4 (patch)
tree	7d0ae09775ea950056193eaa2ca93844299d46f1 /doc/char32_t.texi
parent	c78359d9542c86b972aac373efcf7bc7a8a560e5 (diff)
parent	2959e59fab3bab834368adefd90bd4b1b094366b (diff)