Which is the smallest unicode character

Roland Unger

Help on operation

  1. What is unicode
  2. Problems of coding the character
    1. ASCII
    2. ANSI, Windows CP-1252
    3. ISO 8859
  3. Coding using Unicode
  4. Various unicode formats
  5. Representation of the Unicode characters
  6. Unicode in domain names
  7. credentials

What is unicode

Unicode is an international standard, the aim of which is to define a code for every graphic character or element of all known writing cultures and character systems. He wants to eliminate the problem of the various incompatible codings in the different countries. In practice, it was often impossible to use different languages ​​in the same text. The coding suggestions are processed or developed by the Unicode Consortium. Since its version 2.0, the Unicode has been coordinated with the international standard ISO 10646.

Problems of coding the character

In order to be able to process letters from a digital system, they have to be coded, to put it simply, they have to be assigned a number uniquely, since computers can only “think” in numbers. The constant problem with coding is how extensive the value range is created for coding, the larger the value range, the more characters can be coded, but the more storage space is required. In the early days of computer technology, for example, they limited themselves to 7 or 8-bit coding. (The bit as the smallest unit in Boolean algebra can represent two states, one byte represents the grouping of 8 bits). With a 7-bit coding, a maximum of 27 = 128, encode a maximum of 256 characters with 8-bit coding.

ASCII

The most important 7-bit character set is the ASCII character set (ISO 646), but with this character set it is no longer even possible to encode the umlauts and other Western special characters commonly used in German.

ANSI, Windows CP-1252

With the introduction of Windows 3.x, Windows 95 etc., an attempt was made to support the most important Western languages. For this purpose, the character space was increased to 256 (8 bit), which technically does not lead to any major encoding effort. The lower 128 characters are identical to ASCII. The special characters used in European languages ​​are in the upper half. However, ANSI is not compatible with anything either, the ANSI characters in the range 128–159 are not adopted by ISO 8859-1 (Latin-1). Of course, Windows needed other variants, e.g. CP-1255 for Hebrew support. The incompatibility remained, the Japanese and Germans could not communicate in their natural languages ​​again.

ISO 8859

To solve the dilemma with ISO 646, ISO 8859 defines the range 160–255, i.e. not the range 128–159 occupied by the Windows variants CP-xxxx. For example, some typographically important characters are missing here. But of course, like ANSI, ISO 8859 cannot solve the problem of internationality either.

Coding using Unicode

The only way out remains ISO 10646, Unicode. On 16 levels (planes) of 65,536 characters, all conceivable characters can probably be defined. In extreme cases, however, this requires 4 bytes (= 32 bits) of storage space per character.

Level 0 ("Basic Multilingual Plane") includes the languages ​​still spoken, including approx. 31,000 characters for Chinese, Japanese and Korean. Furthermore, level 0 contains, among other things, control characters, typographical, mathematical, commercial and technical special characters. In addition, this level contains an area for the definition of approx. 6000 user-defined characters. The range 0 to 127 of level 0 is identical to ASCII.

The following levels are used for rarely or no longer used languages, e.g. for cuneiform script and ancient Egyptian hieroglyphs.

Various unicode formats

However, since it is not very economical to transmit 4 bytes for each character, especially if the ASCII standard would have been sufficient in most cases, different variants have been defined, which are, however, essentially compatible:

  • UTF-16 or. UCS-2: UTF-16 uses a two-byte (= 16 bit) coding, which is sufficient to code level 0, the “Basic Multilingual Plane”.
  • UTF-32 or. UCS-4 As a 4-byte representation, it enables all Unicode characters to be encoded.
  • In addition to the formats mentioned, there is also UTF-8 very common. UTF-8 expresses each character as a sequence of 8-bit data words. The number of bytes required varies; only one byte is required for the first 128 characters. If you essentially only use ASCII characters, you get a memory-saving document on the one hand, and an encoding that is similar to ASCII on the other.

Representation of the Unicode characters

Even if the majority of all conceivable characters are defined, one is still confronted with two problems:

  1. Conventional fonts can no longer be used for display purposes, and new, Unicode-compliant fonts are only gradually spreading. However, modern operating systems should at least (partially) display the characters of the first level. Suitable fonts include Arial Unicode MS for the Windows operating system, which is part of Microsoft Office 2000 or 2003. Support for Semitic and / or Asian languages ​​may have to be installed later.
  2. The appearance of many characters can be changed by ligature formation or use in different languages. For this, alternative fonts are occasionally necessary, which are selected via the language attribute. The formation of ligatures or combinations of basic letters with diacritics are achieved using OpenType- Fonts (Microsoft Windows) implemented.

Unicode in domain names

With the standardization of the Unicode, there was also interest in being able to use the enlarged character set on the domain name. This is made possible by using the “Internationalizing Domain Names in Applications” standard (IDNA, RFC 3490, Punycode). Modern browsers support this procedure.

credentials

Top of page