Text Encoding Refers to any means of preparing, storing, accessing, and exchanging digital text in digital software and hardware systems, and sharing text reliably between multiple devices. Today, the international standard for text encoding is the Unicode Standard, which all major operating system and device manufacturers (Apple, Google, Microsoft), modern web browsers, and third-party applications have adopted a mandate to follow. In order for text encoding to work correctly, all characters must be within the common standard (Unicode) and all devices, applications (Microsoft Word) and language tools (keyboards and fonts) must follow this standard.

Unicode Unicode is a non-profit organization that maintains several projects related to digital text encoding, including the The Unicode Standard, which is the international standard for text encoding on digital systems and devices for text storage and interchange. Unicode also maintains several other projects under its organization including the CLDR project (please see "CLDR" for more information).

CLDR stands for "Common Locale Data Repository", and is a project of the Unicode Consortium. The CLDR is a repository of language data for use in systems in order to provide language-specific environments, both on operating system platforms (Android, iOS(iPhone), macOS, Windows, etc.), for third-party applications (Microsoft Excel, Word), and for web browsers. The language data in the CLDR allows software and hardware manufacturers to adapt their software to the conventions of different languages for common software tasks and navigation, such as menu labels and file names, date and time, in order to show these conventions for a local language region.

ASCII Is the abbreviated from of American Standard Code for Information Interchange, which is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of technical limitations of computer systems at the time it was invented, ASCII has just 128 code points, of which only 95 are printable characters, which severely limited its scope. Modern computer systems have evolved to use Unicode, which has millions of code points, but the first 128 of these are the same as the ASCII set. The first 128 character code points in the Unicode Standard are these 128 ASCII characters for backward compatibility.

Unicode Code Charts The Unicode Standard presents a visual definition of all characters encoded (available or that are published) in the Unicode Standard, providing both a visual, general graphic representation of each character, it's associated unique code point, and a character names list. An example of a Unicode code chart is shown here.

Text Shaping Is the process of taking Unicode character sequences (text) input by a keyboard and in conjunction with a font, representing that text in the way that it must be composed for a specific orthography. From the context of Indigenous languages in North America, many Latin script-based writing systems require the shaping of diacritical marks that must "stack" above a given base letter to modify it's sound. This is made possible through Unicode's mark-to-mark attachment technology, and is only possible if a receiving font supports this technology. For more detailed information on shaping, please see this resource on HarfBuzz, a text shaping engine used by all modern web browsers.

combining mark attachment

Text Clipping Text clipping occurs when diacritic marks (or any letterform elements) exceed a pre-defined boundary that an application's developers have specified, which all text elements must be contained within in a given line of text:

clipping explanation

ISO The International Standards Organization which provides technical standardization specifications for many products, including language encoding. ISO is a non-governmental organization that is international in scope with over 170 member countries. Canada, for example, is a member of ISO, and is represented at ISO through the Standards Council of Canada (SCC).

ISO-106464, known as the "Universal Coded Character Set" (UCS, Unicode), is a specification managed by ISO that is specifically concerned with character encoding standards. It is intentionally kept in-sync with the Unicode Standard, and provides a way for national bodies to have a direct input path for character encoding requirements and concerns. ISO-106464 is concerned only with character encoding repertoires to support all language orthography requirements within its region and the identity of those characters.

Unicode Casing In the Unicode Standard, individual characters are given unique code points in order to distinguish them from any other character in the Unicode Standard. In script's such as the Latin script, many languages require case variation between upper and lowercases for proper nouns. In order for applications to convert lowercase letters to capital letters and vice versa, these letter "pairs" must be encoded as members of the same script.

For example, the recent proposal to add new capital Latin script characters for Haíɫzaqvḷa required the addition of a new lowercase Latin script "lambda" in order to allow for the lowercase λ character in Haíɫzaqvḷa to convert to the new capital letter (U+A7DA LATIN CAPITAL LETTER LAMBDA). Prevously, Haíɫzaqvḷa orthgraphy had made use of the lowercase Greek lambda character (U+03BB λ GREEK SMALL LETTER LAMDA) which can only convert to the uppercase Greek capital Lambda letter (U+039B Λ GREEK CAPITAL LETTER LAMDA), which is not the correct letter pairing required for Haíɫzaqvḷa:

latin vs greek casing

Unicode Confusable Characters

In the Unicode Standard, there are distinct characters with very similar graphic representations that can create confusion for human readers of these characters, due to their similarity in common font representations:

ktunaxa encoding variation

The above example shows confusable characters that can be used to represent one's language in digital text. The characters marked in orange in lines (1) and (2) above, respectively, are distinct character codes with similar graphic representations. They are read distinctly by computer devices, however, human readers may be confused by their visual form.

There may be even greater confusability of some characters in Unicode depending on the script in question, for example, the Syllabics script (UCAS), which features some characters that are almost visually identical, if not identical:

ucas confusables

Text Spoofing and Security Risks The act of intentionally using visually-similar characters for malicious purposes in digital text. This means that a bad actor could create visually-confusing web domain addresses, email subtitles and text, or other applications in order to do harm.

To illustrate, the word “ᑭᐢᑫᔨᐦᑕᒼ” in ᓀᐦᐃᔭᐍᐏᐣ (nêhiyawêwin) (Plains Cree) is encoded using U+14BC ᒼ CANADIAN SYLLABICS WEST-CREE M for the final character ᒼ, however, one could also type this same word with U+1466 ᑦ CANADIAN SYLLABICS T as “ᑭᐢᑫᔨᐦᑕᑦ”. We can observe the difference here, but it would be graphically very hard to tell the difference in everyday situations which could lead to the creation of "fake" words and labels that can result in security risks.