Language data is any type of data related to a language that makes it possible for computer software to represent a language in digital text on devices. Operating system and device manufacturers (Apple, Microsoft, Google) and application developers require certain – but not all – aspects of a given languages data in order to accurately render and represent that language on their devices and in applications. In order to make the rendering and representation possible, device and application developers require keyboard and font tools that are capable of allowing the input of the required characters (keyboards) and rendering their correct visual appearance (fonts). As such, keyboard and font developers also require certain elements of a languages data and typographic knowledge in order to accurately represent the language's entry on the device and it's graphic representation.

This section presents the language data principles by which the Typotheque Indigenous North American Type research project adheres to in order to facilitate self-determination of all Indigneous languages we collaborate with in this process, as well as to protect Indigenous language data and each language community's data sovereignty.

Principles

In our work in partnership with Indigenous communities, we adhere to the First Nations principles of OCAP as well as the CARE principles for Indigenous data governance towards how Indigenous language data and information will be collected, stored, used, and made available to the public for the purpose of supporting language support in digital systems and overall sovereignty. Our project maintains, above all else, that each individual Indigenous community always must retain the right to full ownership of all aspects of their language data and self-determination over how their language data may be accessed and used, and whether it may or may not be made publicly available. We ensure that the Indigenous communities that we work in partnership with have full access to all of their language data at all times, during and after the project.

For more information towards the First Nations principles of OCAP and CARE principles for Indigenous data governance, please feel free to follow the above links to learn more.

Before beginning any language software work and to establish outcomes for a project, it is advisable to first assess questions towards the current digital language support situation for your community, and identify goals and required steps that are needed for a given project. The First Peoples' Cultural Council (FPCC) provides the wonderful resource "Check Before you Tech" which provides information and a list of questions to consult with first to help establish goals for a prospective project and partnership.

Purposes

Following our projects guiding principles towards ensuring Indigenous language data sovereignty outlined above, the below section presents purposes towards supporting one's language on digital devices, along with the corresponding language data requirements required by each to achieve each purpose. Alongside the language data elements listed under each purpose, you will find the corresponding minimum licensing type that would be required for developers and designers to be able to work with the language data in order to incoporate it into language tools in order to achieve each purpose:

1. Font Support

  • Required Unicode character set for your language. public, reference-only
  • Required rendering of orthography. public, reference-only
  • Corpora example of language (5,000 words). public, reference-only
  • Character / Kerning pairs. public, reference-only
  • Knowledge of typographic conventions. public, reference-only
  • Preferred typographic forms. public, reference-only
  • ISO and OpenType LangSys language tags. public domain

The above language data and knowledge is required in order to allow all fonts (those on Apple, Google, and Microsoft devices as well as third-party fonts) to support your language accurately and as is expected by readers in your community. By making this knowledge and data publicly-available, device manufacturer's can ensure that their core fonts (which are used on desktop computers, tablets and smartphones) display your language and it's required rendering and typography correctly. It also allows for other font companies to meet the same standards of rendering and typography for your language community.


2. Default Keyboard on Devices

  • Required Unicode character set. public, reference-only
  • Required rendering of orthography (shaping). public, reference-only
  • Character occurrence frequencies. public, reference-only
  • Keyboard source file made available on GitHub. open source CC0

In order for major operating system manufacturers (Apple, Google, Microsoft) to add your language's keyboard to their platform, they require that the keyboard source file is available under an open source license so they can implement it legally on their devices. An example of this can be seen on the Nattilik community's GitHub.


3. Operating System Language Environment

  • Contribution to Unicode's CLDR locale data set for label and menu translations. Unicode CLA license
  • Knowledge of typographic conventions and expected behaviours. public, reference-only
  • Preferred typographic forms. public, reference-only
  • ISO and OpenType LangSys language tags. public domain

Unicode's CLDR project is a collection of language data and translations that allows for all operating system menu labels and date & time to be displayed on your computer, tablet, or smartphone, and therefore allows for a language environment on your device in your language. In order to contribute to CLDR, your community must register an organization account with CLDR, and agree to Unicode's CLA license agreement. "The Unicode CLAs are license agreements that ensure that a contributor retains ownership of any intellectual property rights in their contribution while granting the Unicode Consortium the necessary legal rights to use and redistribute that contribution in the various Consortium products."

4. Map Place Names and Locations

  • List of correct community names, streets, rivers, lakes, etc. in your language. open source CC0
  • The geo locale data for each respective community for accurate location. open source CC0
  • Required Unicode character set. public, reference-only
  • Required rendering of orthography. public, reference-only
  • Knowledge of typographic conventions and expected behaviours. public, reference-only
  • Preferred typographic forms. public, reference-only
  • ISO and OpenType LangSys language tags. public domain

Making map place names and locations for your community is important not only so that place names are correctly represented in your community's geographic region and traditional lands, but so that it provides a very strong requirement for device manufacturers (Apple, Microsoft, Google) to adopt and implement full support for how your writing system must render and appear graphically in digital text in order for their Maps softwares to display names correctly and accurately. This in turn also pushes these companies to provide a default keyboard on their system to ensure that users can input text for their language when using the Maps application.