Intrduction to coded character sets - ASCII and Unicode

Doug Kerr · May 19, 2010

In connection with the work on metadata, there has been discussion of a number of coded characters set issues - ASCII, Unicode, Windows code pages, UTF-8, and so forth.

I thought a tutorial on these matters might be of value.

In this first part we will speak of ASCII and Windows code pages.

ASCII

The designation

A predecessor of the character set we call ASCII (having only upper-case alphabetic characters) was promulgated by the American Standards Association (ASA) in its standards document "American Standard Code for Information Interchange" on April 17, 1963 (two days before the birth of my second daughter). The name of the document also was taken to be the name of the code, and the acronym for that name, ASCII, became the short designation for the code.

Before the "complete" code was finished, the name of the cognizant organization was changed to the U.S.A. Standards Institute (USASI), and under the corresponding new doctrine for entitling standards it was known that the new standard would be entitled "U.S.A. Standard Code for Information Interchange". In anticipation of this, some groups began referring to the standard (perhaps implying its nascent "complete" form) as "USASCII". (There were some political ulterior motives behind this, but this is beyond the scope olf this note.)

Concerned that this situation, a precursor of other name changes to follow, would result in an ongoing confusing inconsistency in the short designation for the code, one of the editors of the 1967 standard (moi) inserted a clause that prescribed "ASCII" to be the "permanent" short designation for the code, independent of any changes in the title of the specifying standard.

The ASCII coded character set

The "completed" (1967 version and later) ASCII character set comprises 128 characters, coded as 7-bit values. These comprise 95 "graphic" characters (one of which, Space, is "invisible") and 33"control characters" (which have several different natures).

The standard does not prescribe how these 7-bit numbers are to be treated in various platform contexts (7-bit systems, 8-bit systems, 16-bit systems, etc), how they are to be electrically represented or transmitted, and so forth. These were to be the subjects of collateral standards.

This standard has received only trivial changes since its introduction in 1967.

ASCII in an 8-bit context

When character data in ASCII is stored in an 8-bit context, the most common convention is that the 7 bits of the ASCII representation are placed in the 7 lowest-order bits of the 8-bit "cell", and the highest-order bit of the 8-bit cell is "0".

ASCII supersets

When the IBM PC was introduced (with the DOS operating system, "MS-DOS" beyond IBM), the decision was made to have its native character set be a superset of ASCII, 256 characters in all, to be represented by 8-bit values (consistent with the 8-bit basic data structure of the system) The first 128 characters were identically those of ASCII, and the remaining 128 positions were assigned to additional graphic characters.

When the Windows environment (later recognized as an operating system) was introduced, similarly a superset of ASCII was used as its native character code (or rather, a catalog of supersets of ASCII, catering to the character/language needs of various markets worldwide). These 256-character coded character set were called Windows "code pages".

The basic one for the US market was identified as Windows code page 1252. Microsoft visualized that the various Windows code pages would become industry standards, perhaps under the American National Standards Institute (ANSI) (by then the name of the body that had earlier been ASA and USASI). But that was not to be.

What did happen is that variants of Windows code page 1252 and other important code pages were standardized by the Intentional Organization for Standardization (ISO) as their standard series 8859. The one based on Windows code page 1252 was ISO 8859-1.

The principal difference between ISO 8859-1 and Windows Code page 1252 is that in ISO 8859-1 the first 32 positions of the "second half" of the code table are assigned to exactly the same characters as the first 32 positions of the "first half" (control characters). In Windows code page 1252, those positions are assigned to further graphic characters. The philosophy behind this difference is beyond the scope of this note.

The IANA (ICANN) coded character set designation for Windows code page 1252 (for use in Web documents to identify the specific coded character set in use) is "windows-1252" (note lower case "w").

It is fairly common to speak of information recorded in an ASCII superset (notably Windows code page 1252) as being "ASCII data", but this is incorrect.

However, if Windows code page 1252 or the ISO 8859-1 character set is nominally used, but the data only comprises characters from ASCII (from the "lower halves" of those 16-bit character sets), then the representation is the same as that described earlier for the storage of ASCII characters in an 8-bit context (and we cannot tell by inspection which of the three character sets is considered to be in use).

Doug Kerr · May 19, 2010

Part 2 - Unicode

Part 2 - Unicode

***********************

UNICODE

Introduction

Unicode is a coded character set (and coded character set structure) that represents an enormous repertoire of graphic characters (and a modest repertoire of control characters). It can be thought of as a superset of ASCII (or the corresponding international 7-bit coded character set, ISO 646). In fact the ISO standard for Unicode is ISO 10646 - isn't that cute). But its characterization as a "superset of ASCII" hardly does justice to the enormous richness of Unicode.

The entries of the "Unicode code chart" are called code points, each one having a code point value [the latter being my term]. These values are just numbers, and do not imply any representation in bits, words, or bytes, or in electrical form in storage or in transmission.

Many of the code points are assigned to represent characters; for graphic characters, a glyph (graphic symbol) is implied. There are large areas of the code space, however, that are either reserved for future assignment or are reserved for "private" use.

The range of the code space is from 0x0 to 0x10FFFF (in decimal, 0 to 1114111). Thus, to represent the entire code space (abstractly), we need to use a 21-bit number.

The code space is divided into 17 code planes, each potentially having 65536 code points. "Code plane 0" (which embraces code point values from 0x0 through 0xFFFF) is called the "Basic Multilingual plane". It covers the needed characters to write in a large range of languages. Most of our concern with Unicode will be only within the Basic Multilingual plane.

Unicode code points are written in this form (to give an example):

U+01F7

where the number after the "+" is the code point value in hexadecimal form. For code point values not over 0xFFFF (that is, within the Basic Multilingual Plane), this number is always built out to four hexadecimal characters, for example:

U+000D

The first 128 code points correspond to the characters of ASCII (or, more precisely, ISO 646). The next 128 code points correspond to the remaining characters of ISO ISO 8859-1 (which, for the most part, corresponds to Windows code page 1252, the "ASCII superset" coded character set used in typical Western Windows systems). This repertoire includes a modest repertoire of accented characters for use in various "Western" non-English languages.

Encoding

So far, code point values are just numbers (essentially, 21-bit numbers). When we actually want to store Unicode characters in computer memory or a data file, we need to employ some form of encoding suitable to the context.

The two encodings of most interest to us are forms of the Unicode Transformation Format, called UTF-16 and UTF-8.

UTF-16

The UTF-16 encoding is intended for the representation of Unicode characters in an environment or context emphasizing 16-bit words. (I use word here in its basic sense, not of itself implying any particular length in bits.)

For characters in the Basic Multilingual Plane, with code point values in the range 0x0-0xFFFF, the UTF-16 encoding is just the code point value itself, as a 16-bit word.

However, at the next more physical level, we come to grips with two conventions found in information systems with a 16-bit semantic architecture but an 8-bit physical structure. These are the big-endian and little-endian conventions.

In the big-endian convention, the highest-order 8 bits of a 16-bit word are carried in one byte, and the lowest-order 8 bits of the word in the next byte.

In the little-endian convention, the lowest-order 8 bits of a 16-bit word are carried in one byte, and the highest-order 8 bits of the word in the next byte.

Thus, at the byte level, the byte sequence for the UTF-16 encoding of a Unicode character will vary.

For characters beyond those in the Basic Multilingual Plane, the UTF-16 encoding involves two 16-bit words. The arrangement is made possible by the fact that certain code point values in the range of the Basic Multilingual Plane are not assigned to characters, but rather to a special role in connection with this "32-bit" encoding. The details of this are beyond this article. We will rarely, if ever, encounter this aspect of UTF-16 encoding.

UTF-8

The UTF-8 encoding is intended for the representation of Unicode characters in an environment or context emphasizing 8-bit words (octets or bytes). It uses a clever variable-length scheme.

Code points with values through 0x7F (the ASCII character set) are coded in single-byte form. These bytes can be recognized as single-byte representations since their highest-order bit is always "0":

0xxxxxxx

Code points with values through 0x80 through 0x7FF are coded in two-byte form. This includes all the characters needed to write the scripts of most of the non-Asian languages. The first byte of such a sequence can be recognized since its three highest-order bits will always have the value 110; for the second bytes of such sequences, the two highest-order bits will always have the value 10:

110xxxx 10xxxxxx

Code points in the remainder of the Basic Multilingual Plane require three bytes. The first byte of such a sequence can be recognized since its three highest-order bits will always have the value 1110; for the second and third bytes of such sequences, the two highest-order bits will always have the value 10:

1110xxxx 10xxxxxx 10xxxxxx

Code points beyond the Basic Multilingual Plane are coded in four bytes (following the same scheme):

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For ASCII characters, the UTF-8 encoding is identical to a straight ASCII representation in a byte context (or, in fact, the Windows code page 1252 representation). That is not true, however, for characters in the "upper half" of Windows code Pages, which natively are in one byte (the "ASCII" superset oncept) but which, in UTF-8, require two bytes.

There is no issue comparable to big-endian- vs. little-endian-ness in connection with the use of UTF-8 encoding.

A PERSONAL COMMENT

I was involved, in a modest way, in the development, political defense, documentation, implementation, application, and propagation of ASCII. It is very gratifying for me that, almost 50 years after the initial roll-out of ASCII, it (and its enlarged cousins, the Windows code pages) have continued to serve as the basic character-level language of information technology, a sine qua non of the Information Age.

But, just as a father takes pride in the accomplishments of his offspring, I am just thrilled to see how, building upon the structure and intellectual concepts of ASCII, and working in the far richer and sophisticated context of today's information technology, with intellectual potency far beyond mine (then or now), the developers of Unicode have taken us not just to the stars, but beyond.

Bravi!

Best regards,

Doug

Winston Mitchell · May 19, 2010

I don't suppose that you were a big fan of EBCDIC.

Doug Kerr · May 19, 2010

Hi, Winston,

Winston Mitchell said:
I don't suppose that you were a big fan of EBCDIC.

Indeed.

As RCA's announcement of its flagship mainframe line, the Spectra 70, loomed (1965), there was great interest in what its "native" character set would be.

When RCA announced that it would be EBCDIC, I reported this in a memo to my standards colleagues entitled, "What's good For General Sarnoff is good for General Bullmoose."

Best regards,

Doug

Intrduction to coded character sets - ASCII and Unicode

Doug Kerr

Well-known member

Doug Kerr

Well-known member

Winston Mitchell

Member

Doug Kerr

Well-known member