Unicode is a single, large set of characters including all presently used scripts of the world, with remaining historic scripts being added. Because Unicode includes all the characters of all the well-used legacy encodings, mapping from older encodings to Unicode is usually not a problem, although there are some issues where care is necessary in particular for East Asian character encodings.
In general, character encoding deals with how to denote characters by more basic or primitive elements, such as numbers, bytes octets or bits. This includes a number of separable decisions, and a number of abstract layers of representation, which we will look at in greater detail later. For the moment, we will use the term encoding somewhat loosely. The history of character encodings contains many ingenious designs, but also quite a few accidental developments. The search for the best encoding always to some extent was in conflict with the need to use a common encoding that met many needs, even if somewhat incompletely.
A brief history of character encoding is provided in Richard Gillam, Unicode Demystified , pp. To the examples for yourself:. It looks like this:. Because of its universal acceptance, some Unicode encodings will transform codepoints into series of ASCII characters so they can be transmitted without issue. Now, in the example above, we know the data is text because we authored it.
But you can never be sure, and sometimes you can guess wrong. At a base level, this can handle codepoints 0x to 0xFFFF, or for you humans out there. And 65, should be enough characters for anybody there are ways to store codepoints above , but read the spec for more details. Storing data in multiple bytes leads to my favorite conundrum: byte order! Some computers store the little byte first, others the big byte. If you see FFFE, the data came from another type of machine, and needs to be converted to your architecture.
This involves swapping every byte in the file. But unfortunately, things are not that simple. The BOM is actually a valid Unicode character — what if someone sent a file without a header, and that character was actually part of the file?
This is an open issue in Unicode. This opens up design observation 2: Multi-byte data will have byte order issues! ASCII never had to worry about byte order — each character was a single byte, and could not be misinterpreted.
Aside: UCS-2 stores data in a flat bit chunk. UTF allows up to 20 bits split between 2 bit characters, known as a surrogate pair. Each character in the surrogate pair is an invalid unicode character by itself, but together a valid one can be extracted.
Design observation 3: Consider backwards compatibility. How will an old program read new data? Ignoring new data is good.
Breaking on new data is bad. Enter UTF It is the default encoding for XML. These bytes start with A 2-byte example looks like this. This means there are 2 bytes in the sequence. In any case, UTF-8 still needs a header to indicate how the text was encoded. Feel free to use charmap to copy in some Unicode characters and see how they are stored in UTF Or, you can experiment online. So how do we send Unicode data through them?
UTF-7 works like this. If it follows a character, that item is interpreted literally. UTF is pretty clever, eh? Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one.
Before Unicode was invented, there were hundreds of different systems, called character encodings, for assigning these numbers. These early character encodings were limited and could not contain enough characters to cover all the world's languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.
Early character encodings also conflicted with one another. That is, two encodings could use the same number for two different characters, or use different numbers for the same character. Any given computer especially servers would need to support many different encodings. However, when data is passed through different computers or between different encodings, that data runs the risk of corruption.
Unicode has changed all that! The Unicode Standard provides a unique number for every character, no matter what platform, device, application or language.
0コメント