mirror of
https://github.com/fltk/fltk.git
synced 2026-06-07 09:13:58 +08:00
documentation/unicode.dox: added to the Unicode and UTF-8 Support chapter
added references to RFC 3629 as the source of the 21-bit U+10FFFF limit, outlined the illegal character strategy of fl_utf8decode(), and added warnings that fl_utf8len() is unsafe git-svn-id: file:///fltk/svn/fltk/branches/branch-1.3@7610 ea41ed52-d2ee-0310-a9c1-e6b18d33e121
This commit is contained in:
@@ -19,6 +19,7 @@ For further information, please see:
|
||||
- http://www.iso.org
|
||||
- http://en.wikipedia.org/wiki/Unicode
|
||||
- http://www.cl.cam.ac.uk/~mgk25/unicode.html
|
||||
- http://www.apps.ietf.org/rfc/rfc3629.html
|
||||
|
||||
\par The Unicode Standard
|
||||
|
||||
@@ -64,9 +65,20 @@ and are usually shown using 'U+' and the code in hexadecimal,
|
||||
e.g. U+0041 is the "Latin capital letter A".
|
||||
The UCS characters U+0000 to U+007F correspond to US-ASCII,
|
||||
and U+0000 to U+00FF correspond to ISO 8859-1 (Latin1).
|
||||
|
||||
ISO 10646 was originally designed to handle a 31-bit character set
|
||||
from U+00000000 to U+7FFFFFFF, but the current idea is that 21-bits
|
||||
will be sufficient for all future needs, giving characters up to
|
||||
U+10FFFF. The complete character set is sub-divided into \e planes.
|
||||
<i>Plane 0</i>, also known as the <b>Basic Multilingual Plane</b>
|
||||
(BMP), ranges from U+0000 to U+FFFD and consists of the most commonly
|
||||
used characters from previous encoding standards. Other planes
|
||||
contain characters for specialist applications.
|
||||
\todo
|
||||
Do we need this info about planes?
|
||||
|
||||
The UCS also defines various methods of encoding characters as
|
||||
a sequence of bytes.
|
||||
|
||||
UCS-2 encodes Unicode characters into two bytes,
|
||||
which is wasteful if you are only dealing with ASCII or Latin1 text,
|
||||
and insufficient if you need characters above U+00FFFF.
|
||||
@@ -77,6 +89,8 @@ but this is even more wasteful for ASCII or Latin1.
|
||||
|
||||
The Unicode standard defines various UCS Transformation Formats.
|
||||
UTF-16 and UTF-32 are based on units of two and four bytes.
|
||||
UCS characters requiring more than 16-bits are encoded using
|
||||
"surrogate pairs" in UTF-16.
|
||||
|
||||
UTF-8 encodes all Unicode characters into variable length
|
||||
sequences of bytes. Unicode characters in the 7-bit ASCII
|
||||
@@ -86,7 +100,7 @@ making the transformation to Unicode quick and easy.
|
||||
All UCS characters above U+007F are encoded as a sequence of
|
||||
several bytes. The top bits of the first byte are set to show
|
||||
the length of the byte sequence, and subseqent bytes are
|
||||
always in the range 0x80 to 8x8F. This combination provides
|
||||
always in the range 0x80 to 0x8F. This combination provides
|
||||
some level of synchronisation and error detection.
|
||||
|
||||
<table summary="Unicode character byte sequences" align="center">
|
||||
@@ -128,9 +142,13 @@ library.
|
||||
|
||||
\section unicode_in_fltk Unicode in FLTK
|
||||
|
||||
FLTK will be entirely converted to Unicode in UTF-8 encoding.
|
||||
If a different encoding is required by the underlying operatings
|
||||
system, FLTK will convert string as needed.
|
||||
\todo
|
||||
Work through the code and this documentation to harmonize
|
||||
the [<b>OksiD</b>] and [<b>fltk2</b>] functions.
|
||||
|
||||
FLTK will be entirely converted to Unicode using UTF-8 encoding.
|
||||
If a different encoding is required by the underlying operating
|
||||
system, FLTK will convert the string as needed.
|
||||
|
||||
It is important to note that the initial implementation of
|
||||
Unicode and UTF-8 in FLTK involves three important areas:
|
||||
@@ -138,7 +156,7 @@ Unicode and UTF-8 in FLTK involves three important areas:
|
||||
- provision of Unicode character tables and some simple related functions;
|
||||
|
||||
- conversion of char* variables and function parameters from single byte
|
||||
per character representation to UTF-8 variable length characters;
|
||||
per character representation to UTF-8 variable length sequences;
|
||||
|
||||
- modifications to the display font interface to accept general
|
||||
Unicode character or UCS code numbers instead of just ASCII or Latin1
|
||||
@@ -147,9 +165,15 @@ Unicode and UTF-8 in FLTK involves three important areas:
|
||||
The current implementation of Unicode / UTF-8 in FLTK will impose
|
||||
the following limitations:
|
||||
|
||||
- An implementation note in the code says that all functions are
|
||||
LIMITED to 24 bit Unicode values, but also says that only 16 bits
|
||||
- An implementation note in the [<b>OksiD</b>] code says that all functions
|
||||
are LIMITED to 24 bit Unicode values, but also says that only 16 bits
|
||||
are really used under linux and win32.
|
||||
<b>[Can we verify this?]</b>
|
||||
|
||||
- The [<b>fltk2</b>] %fl_utf8encode() and %fl_utf8decode() functions are
|
||||
designed to handle Unicode characters in the range U+000000 to U+10FFFF
|
||||
inclusive, which covers all UTF-16 characters, as specified in RFC 3629.
|
||||
<i>Note that the user must first convert UTF-16 surrogate pairs to UCS.</i>
|
||||
|
||||
- FLTK will only handle single characters, so composed characters
|
||||
consisting of a base character and floating accent characters
|
||||
@@ -164,8 +188,54 @@ the following limitations:
|
||||
Verify 16/24 bit Unicode limit for different character sets?
|
||||
OksiD's code appears limited to 16-bit whereas the FLTK2 code
|
||||
appears to handle a wider set. What about illegal characters?
|
||||
See comments in fl_utf8fromwc() and fl_utf8toUtf16().
|
||||
See comments in %fl_utf8fromwc() and %fl_utf8toUtf16().
|
||||
|
||||
\section unicode_illegals Illegal Unicode and UTF8 sequences
|
||||
|
||||
Three pre-processor variables are defined in the source code that
|
||||
determine how %fl_utf8decode() handles illegal UTF8 sequences:
|
||||
|
||||
- if ERRORS_TO_CP1252 is set to 1 (the default), %fl_utf8decode() will
|
||||
assume that a byte sequence starting with a byte in the range 0x80
|
||||
to 0x9f represents a Microsoft CP1252 character, and will instead
|
||||
return the value of an equivalent UCS character. Otherwise, it
|
||||
will be processed as an illegal byte value as described below.
|
||||
|
||||
- if STRICT_RFC3629 is set to 1 (not the default!) then UTF-8
|
||||
sequences that correspond to illegal UCS values are treated as
|
||||
errors. Illegal UCS values include those above U+10FFFF, or
|
||||
corresponding to UTF-16 surrogate pairs. Illegal byte values
|
||||
are handled as described below.
|
||||
|
||||
- if ERRORS_TO_ISO8859_1 is set to 1 (the default), the illegal
|
||||
byte value is returned unchanged, otherwise 0xFFFD, the Unicode
|
||||
REPLACEMENT CHARACTER, is returned instead.
|
||||
|
||||
%fl_utf8encode() is less strict, and only generates the UTF-8
|
||||
sequence for 0xFFFD, the Unicode REPLACEMENT CHARACTER, if it is
|
||||
asked to encode a UCS value above U+10FFFF.
|
||||
|
||||
Many of the [<b>fltk2</b>] functions below use %fl_utf8decode() and
|
||||
%fl_utf8encode() in their own implementation, and are therefore
|
||||
somewhat protected from bad UTF-8 sequences.
|
||||
|
||||
The [<b>OksiD</b>] %fl_utf8len() function assumes that the byte it is
|
||||
passed is the first byte in a UTF-8 sequence, and returns the length
|
||||
of the sequence. Trailing bytes in a UTF-8 sequence will return -1.
|
||||
|
||||
- \b WARNING:
|
||||
%fl_utf8len() can not distinguish between single
|
||||
bytes representing Microsoft CP1252 characters 0x80-0x9f and
|
||||
those forming part of a valid UTF-8 sequence. You are strongly
|
||||
advised not to use %fl_utf8len() in your own code unless you
|
||||
know that the byte sequence contains only valid UTF-8 sequences.
|
||||
|
||||
- \b WARNING:
|
||||
Some of the [OksiD] functions below use still use %fl_utf8len() in
|
||||
their implementations. These may need further validation.
|
||||
|
||||
Please see the individual function description for further details
|
||||
about error handling and return values.
|
||||
|
||||
\section unicode_fltk_calls FLTK Unicode and UTF8 functions
|
||||
|
||||
|
||||
Reference in New Issue
Block a user