This includes not only the Latin, Greek, Cyrillic, Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese, Japanese and Korean Han ideographs as well as scripts such as Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian, Ogham, Myanmar, Sinhala, Thaana, Yi, and others.
For scripts not yet covered, research on how to best encode them for computer usage is still going on and they will be added eventually.
This is especially important for scientific notations such as mathematical formulae and the International Phonetic Alphabet, where any possible combination of a base character and one or several diacritical marks could be needed.
UCS assigns to each character not only a code number but also an official name.
ISO 10646 originally defined a 31-bit character set. The most commonly used characters, including all those found in major older encoding standards, have been placed into the first plane (0x0000 to 0x FFFD), which is called the Basic Multilingual Plane (BMP) or Plane 0.
The characters that were later added outside the 16-bit BMP are mostly for specialist applications such as historic scripts and scientific notation.
Make sure that you are well familiar with it and that your software supports UTF-8 smoothly. This means simply that no information is lost if you convert any text string to UCS and then back to its original encoding.
UCS contains the characters required to represent practically all known languages.
It is an accent or other diacritical mark that is added to the previous character.