EuroHorseMotörheadPoo.txt - Test file for UTF-8 / Unicode-16 encoding
-----------------------------------------------------------------
€uro馬orseMotörhead💩oo.txt - Tëŝt fílè før ÜTF-⑻ / 𝓤𝓷𝓲𝓬𝓸𝓭𝓮-①⑥ 𝓮𝓷𝓬𝓸𝓭𝓲𝓷𝓰
VLSI Solution 2024-10-18, version 1.02



OVERVIEW
--------

This file explains the support VLSI Solution has for UTF-8 / Unicode-16
encodings.

VSOS doesn't have any 8-bit data types. Internally, VSOS handles its
characters as 16-bit words. While this creates certain imcompatibilities
with the outside world, the upshot of this is that VSOS characters are
Unicode-16 encoded, and thus capable of natively represent almost all
world's scripts. Partial support for Unicode-16 surrogate pairs extends
this to the full 21-bit character range.

As its outside representation, VSOS uses UTF-8. However, UTF-8 support
doesn't come from the core of the operating system, rather it is a
combination of support from the following pieces of software that
convert between UTF-8 and VSOS's internal Unicode-16 representation:
- UARTIN.DL3: UART In/Out driver.
- More: Program to show text or binary files.
- Edit: File editor. Does represent characters outside of 7-bit range
  correctly, but will save them to file unmodified, so they will not
  be destroyed if left untouched.

VSOS natively only supports scripts that are read from left to right
and top to bottom.



CODE POINT RANGES
-----------------

UTF-8 sequences can be from 1 to 4 bytes.

Unicode-16 can represents all up to 16-bit code points, but for code
points 0x10000 and above, a double-size surrogate pair representation
is required.

CODE
POINT    UNICODE-16     UTF-8               CHAR  DESCRIPTION
0x41     0x0041         0x41                A     'A', as in ABC.
0xD6     0x00D6         0xC3 0x96           Ö     O with two dots on top
0x3A3    0x03A3         0xCE 0xA3           Σ     Greek upper-case Sigma
0x20AC   0x20AC         0xE2 0x82 0xAC      €     Euro sign
0x99AC   0x99AC         0xE9 0xA6 0xAC      馬    Chinese horse
0x1F4A9  0xD83D 0xDCA9  0xF0 0x9F 0x92 0xA9 💩    Poo emoji

See file Utf8Test1.png to verify that you see the previous table correctly.

Different lengths of UTF-8 sequences can represent the ranges as follows.
Let's present the code point as 0xUVWXYZ.

There is overlap in the ranges: the longer UTF-8 encodings could represent
the whole range of all shorter representations. However, using overlong
representations is forbidden. In other words, even though 'A' could also
technically be represented with code sequences 0xF0 0x80 0x81 0x81,
0xE0 0x81 0x81, or 0xC1 0c81, its only valid representation is 0x41.



1-BYTE UTF-8 SEQUENCE, example: 0x41 'A'
----------------------------------------

Code points 0x0 .. 0x7f.

UTF-8 bit/byte sequence: 0YYYZZZZ

This covers the whole 7-bit ASCII range, in which UTF-8 is compatible
with ASCII.

Example: The first letter of the alphabet, 'A', code point 0x41, is
presented as 0x41 in UTF-8.
8x8 ASCII graphics representation:
:   ##   :
:  #  #  :
: #    # :
: ###### :
: #    # :
: #    # :
: #    # :
:        :



2-BYTE UTF-8 SEQUENCE, example: 0xD6 'Ö' (Upper-Case O with Two Dots on Top)
----------------------------------------------------------------------------

Code points 0x80 .. 0x7ff

UTF-8 bit/byte sequence: 110XXXYY 10YYZZZZ

The UTF-8 sequence could technically also represent codes below 0x80,
but overlong encodings are forbidden.

This covers most Latin and Cyrillic alphabets. Code points 0x80-0xff are
the same as in ISO Latin 8859-1, but their 2-byte representations are not.

Example: An upper-case O with two dots on top, 'Ö', code point 0xD6, is
presented as 0xC4 0x96 in UTF-8.

8x8 ASCII graphics representation:
: #    # :
:  ####  :
: #    # :
: #    # :
: #    # :
: #    # :
:  ####  :
:        :



3-BYTE UTF-8 SEQUENCE, examples: 0x20AC '€' (Euro Sign), 0x99AC '馬' (Horse)
----------------------------------------------------------------------------

Code points 0x800 .. 0xffff

UTF-8 bit/byte sequence: 1110WWWW 10XXXXYY 10YYZZZZ

The UTF-8 sequence could technically also represent codes below 0x800,
but overlong encodings are forbidden.

This covers most scripts used in the world.

Example 1: Euro sign '€', code point 0x20AC, is presented as
0xE2 0x82 0xAC in UTF-8.

8x8 ASCII graphics representation:
:  ####  :
: #    # :
:####    :
: #      :
:####    :
: #    # :
:  ####  :
:        :

Example 2: Chinese Horse symbol '馬', code point 0x99AC, is presented as
0xE9 0xA6 0xAC in UTF-8.

10x10 ASCII graphics representation:
: #######  :
: #  #     :
: ######   :
: #  #     :
: ######   :
: #  #     :
: ######## :
:# # # # # :
:#      #  :
:          :

4-BYTE UTF-8 SEQUENCE, example: 0x1F4A9 '💩' (Poo Emoji)
--------------------------------------------------------

Code points 0x10000 .. 0x10ffff

UTF-8 bit/byte sequence: 11110UVV 10VVWWWW 10XXXXYY 10YYZZZZ

The UTF-8 sequence could technically also represent codes below 0x10000,
but overlong encodings are forbidden. Also high codes that could be
representated with this amount of bits, i.e. 0x110000 .. 0x1fffff, are
forbidden.

This covers emojis and many other fringe cases.

Example: The emoji for poo, '💩', code point 0x1F4A9, is presented as
0xE9 0xA6 0xAC in UTF-8.

VSOS's internal representation for its character set is 16-bit Unicode.
Unfortunately the 21-bit code point range is outside of the direct range
that can be represented with Unicode-16. So, they are presented as
co-called surrogate pairs where you encode a code point to two 16-bit
Unicode characters, so-called surrogate pairs, as in the following code:

  u_int32 codePoint = 0x1f4a9;       /* Range: 0x10000 .. 0x10FFFF; */
  u_int32 t32 = codePoint - 0x10000; /* Range:     0x0 ..  0xFFFFF; */
  u_int16 utf16[2];
  utf16[0] = 0xD800 | (u_int16)(t32 >> 10);   // High surrogate
  utf16[1] = 0xDC00 | ((u_int16)t32 & 0x3ff); // Low surrogate

VSOS support for surrogate pairs is partial. They show correctly on file
listings, using the display program More, and they may work partially on
the command line, but there is no protection against cutting a surrogate
pair into its two separate Unicode-16 symbols.

10x10 ASCII graphics representation:
:      #   :
:     #  # :
:       #  :
:  #       :
:   #####  :
:  #     # :
:  ####### :
: #       #:
: #########:
:          :



TEST STRINGS FOR VLSI'S INTERNAL USE
------------------------------------

1234567890123456789012345678901234567890123456789012345678901234567890123456馬901234
1234567890123456789012345678901234567890123456789012345678901234567890123456馬馬1234
1234567890123456789012345678901234567890123456789012345678901234567890123456馬馬馬34
12345678901234567890123456789012345678901234567890123456789012345678901234567馬01234
12345678901234567890123456789012345678901234567890123456789012345678901234567馬馬234
12345678901234567890123456789012345678901234567890123456789012345678901234567馬馬馬4
1234567890123456789012345678901234567890123456789012345678901234567890123456馬
1234567890123456789012345678901234567890123456789012345678901234567890123456馬馬
1234567890123456789012345678901234567890123456789012345678901234567890123456馬馬馬
12345678901234567890123456789012345678901234567890123456789012345678901234567馬
12345678901234567890123456789012345678901234567890123456789012345678901234567馬馬
12345678901234567890123456789012345678901234567890123456789012345678901234567馬馬馬
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
