File formats: Non-Unicode

What a character is, is determined by an encoding, which is a system to map characters to sequences of bits.

The most ubiquitous character encoding is ASCII. It encodes a set of 128 characters. This is a basic set consisting of letters, uppercase and lowercase, digits, punctuation, arithmetical symbols, a few currency symbols, space, tab, newline, carriage return, and a few others.

Later came the extensions for letters with accents, for other scripts such as Cyrillic and Greek. The first was IBM’s CP437. These extension sets were defined by code pages, each of which defined a limited supply of non-ascii characters. Windows had its own notion of code page: 125x.

All this was common before Unicode. Text files from this era pose the difficulty that nothing in the file itself declares which code page is being used. It is a matter of trial and error to determine the right code page, and sometimes it is impossible. This problem is carried over to older text-based formats such as CSV and SQL. While the structure of SQL and CSV files is usually well-defined, the use of undeclared code pages remains a liability

Non-Unicode text is a non-preferred format for the Plain text file type.

About DANS

Services

Support