Doug Kerr
Well-known member
In connection with the work on metadata, there has been discussion of a number of coded characters set issues - ASCII, Unicode, Windows code pages, UTF-8, and so forth.
I thought a tutorial on these matters might be of value.
In this first part we will speak of ASCII and Windows code pages.
ASCII
The designation
A predecessor of the character set we call ASCII (having only upper-case alphabetic characters) was promulgated by the American Standards Association (ASA) in its standards document "American Standard Code for Information Interchange" on April 17, 1963 (two days before the birth of my second daughter). The name of the document also was taken to be the name of the code, and the acronym for that name, ASCII, became the short designation for the code.
Before the "complete" code was finished, the name of the cognizant organization was changed to the U.S.A. Standards Institute (USASI), and under the corresponding new doctrine for entitling standards it was known that the new standard would be entitled "U.S.A. Standard Code for Information Interchange". In anticipation of this, some groups began referring to the standard (perhaps implying its nascent "complete" form) as "USASCII". (There were some political ulterior motives behind this, but this is beyond the scope olf this note.)
Concerned that this situation, a precursor of other name changes to follow, would result in an ongoing confusing inconsistency in the short designation for the code, one of the editors of the 1967 standard (moi) inserted a clause that prescribed "ASCII" to be the "permanent" short designation for the code, independent of any changes in the title of the specifying standard.
The ASCII coded character set
The "completed" (1967 version and later) ASCII character set comprises 128 characters, coded as 7-bit values. These comprise 95 "graphic" characters (one of which, Space, is "invisible") and 33"control characters" (which have several different natures).
The standard does not prescribe how these 7-bit numbers are to be treated in various platform contexts (7-bit systems, 8-bit systems, 16-bit systems, etc), how they are to be electrically represented or transmitted, and so forth. These were to be the subjects of collateral standards.
This standard has received only trivial changes since its introduction in 1967.
ASCII in an 8-bit context
When character data in ASCII is stored in an 8-bit context, the most common convention is that the 7 bits of the ASCII representation are placed in the 7 lowest-order bits of the 8-bit "cell", and the highest-order bit of the 8-bit cell is "0".
ASCII supersets
When the IBM PC was introduced (with the DOS operating system, "MS-DOS" beyond IBM), the decision was made to have its native character set be a superset of ASCII, 256 characters in all, to be represented by 8-bit values (consistent with the 8-bit basic data structure of the system) The first 128 characters were identically those of ASCII, and the remaining 128 positions were assigned to additional graphic characters.
When the Windows environment (later recognized as an operating system) was introduced, similarly a superset of ASCII was used as its native character code (or rather, a catalog of supersets of ASCII, catering to the character/language needs of various markets worldwide). These 256-character coded character set were called Windows "code pages".
The basic one for the US market was identified as Windows code page 1252. Microsoft visualized that the various Windows code pages would become industry standards, perhaps under the American National Standards Institute (ANSI) (by then the name of the body that had earlier been ASA and USASI). But that was not to be.
What did happen is that variants of Windows code page 1252 and other important code pages were standardized by the Intentional Organization for Standardization (ISO) as their standard series 8859. The one based on Windows code page 1252 was ISO 8859-1.
The principal difference between ISO 8859-1 and Windows Code page 1252 is that in ISO 8859-1 the first 32 positions of the "second half" of the code table are assigned to exactly the same characters as the first 32 positions of the "first half" (control characters). In Windows code page 1252, those positions are assigned to further graphic characters. The philosophy behind this difference is beyond the scope of this note.
The IANA (ICANN) coded character set designation for Windows code page 1252 (for use in Web documents to identify the specific coded character set in use) is "windows-1252" (note lower case "w").
It is fairly common to speak of information recorded in an ASCII superset (notably Windows code page 1252) as being "ASCII data", but this is incorrect.
However, if Windows code page 1252 or the ISO 8859-1 character set is nominally used, but the data only comprises characters from ASCII (from the "lower halves" of those 16-bit character sets), then the representation is the same as that described earlier for the storage of ASCII characters in an 8-bit context (and we cannot tell by inspection which of the three character sets is considered to be in use).
I thought a tutorial on these matters might be of value.
In this first part we will speak of ASCII and Windows code pages.
ASCII
The designation
A predecessor of the character set we call ASCII (having only upper-case alphabetic characters) was promulgated by the American Standards Association (ASA) in its standards document "American Standard Code for Information Interchange" on April 17, 1963 (two days before the birth of my second daughter). The name of the document also was taken to be the name of the code, and the acronym for that name, ASCII, became the short designation for the code.
Before the "complete" code was finished, the name of the cognizant organization was changed to the U.S.A. Standards Institute (USASI), and under the corresponding new doctrine for entitling standards it was known that the new standard would be entitled "U.S.A. Standard Code for Information Interchange". In anticipation of this, some groups began referring to the standard (perhaps implying its nascent "complete" form) as "USASCII". (There were some political ulterior motives behind this, but this is beyond the scope olf this note.)
Concerned that this situation, a precursor of other name changes to follow, would result in an ongoing confusing inconsistency in the short designation for the code, one of the editors of the 1967 standard (moi) inserted a clause that prescribed "ASCII" to be the "permanent" short designation for the code, independent of any changes in the title of the specifying standard.
The ASCII coded character set
The "completed" (1967 version and later) ASCII character set comprises 128 characters, coded as 7-bit values. These comprise 95 "graphic" characters (one of which, Space, is "invisible") and 33"control characters" (which have several different natures).
The standard does not prescribe how these 7-bit numbers are to be treated in various platform contexts (7-bit systems, 8-bit systems, 16-bit systems, etc), how they are to be electrically represented or transmitted, and so forth. These were to be the subjects of collateral standards.
This standard has received only trivial changes since its introduction in 1967.
ASCII in an 8-bit context
When character data in ASCII is stored in an 8-bit context, the most common convention is that the 7 bits of the ASCII representation are placed in the 7 lowest-order bits of the 8-bit "cell", and the highest-order bit of the 8-bit cell is "0".
ASCII supersets
When the IBM PC was introduced (with the DOS operating system, "MS-DOS" beyond IBM), the decision was made to have its native character set be a superset of ASCII, 256 characters in all, to be represented by 8-bit values (consistent with the 8-bit basic data structure of the system) The first 128 characters were identically those of ASCII, and the remaining 128 positions were assigned to additional graphic characters.
When the Windows environment (later recognized as an operating system) was introduced, similarly a superset of ASCII was used as its native character code (or rather, a catalog of supersets of ASCII, catering to the character/language needs of various markets worldwide). These 256-character coded character set were called Windows "code pages".
The basic one for the US market was identified as Windows code page 1252. Microsoft visualized that the various Windows code pages would become industry standards, perhaps under the American National Standards Institute (ANSI) (by then the name of the body that had earlier been ASA and USASI). But that was not to be.
What did happen is that variants of Windows code page 1252 and other important code pages were standardized by the Intentional Organization for Standardization (ISO) as their standard series 8859. The one based on Windows code page 1252 was ISO 8859-1.
The principal difference between ISO 8859-1 and Windows Code page 1252 is that in ISO 8859-1 the first 32 positions of the "second half" of the code table are assigned to exactly the same characters as the first 32 positions of the "first half" (control characters). In Windows code page 1252, those positions are assigned to further graphic characters. The philosophy behind this difference is beyond the scope of this note.
The IANA (ICANN) coded character set designation for Windows code page 1252 (for use in Web documents to identify the specific coded character set in use) is "windows-1252" (note lower case "w").
It is fairly common to speak of information recorded in an ASCII superset (notably Windows code page 1252) as being "ASCII data", but this is incorrect.
However, if Windows code page 1252 or the ISO 8859-1 character set is nominally used, but the data only comprises characters from ASCII (from the "lower halves" of those 16-bit character sets), then the representation is the same as that described earlier for the storage of ASCII characters in an 8-bit context (and we cannot tell by inspection which of the three character sets is considered to be in use).