How does Unicode get parsed within a file?

Question:

M'enfin!

2010-03-16 09:19:46 UTC

I have been unable to find a simple explanation as to how Unicode characters get recognized and parsed within a file. For example, the heart character is listed as Unicode \u2665 (hexadecimal), and will appear as such when inserted in an HTML file, such as this page:

♥ = 0x2665

What I dont' understand, is that since a file (and an HTML page is just a file) is a stream of octets, why doesn't it show up as the two characters ampersand and 'e', like this?

& = 0x26
e = 0x65

I tried doing a hex dump of a file with Unicode characters in it, but I did not detect an obvious escape sequence that introduces the Unicode. And if there *were* an escape sequence, how is the escape sequence escaped itself?

I think this question requires an expert in the arcane realm of Unicode... or a really good programmer :-)

Thank you!

Four answers:

Jallan

2010-03-16 20:10:26 UTC

Unicode, at the highest level, doesn’t particularly recognize octets. There is obviously not room in an octet for more than 255 code points, while Unicode contains 1,114,112 code points in decimal which equates to 10 FFFF code points in hexadecimal. Assuming straightforward encoding, and a computer system that uses octets, then each character would take three octets, not one octet. More precisely one could encode Unicode in straight-forward fashion in 27 bits.

See http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf , page 24, for the official Unicode explanation of how Unicode is encoded, in three different official encoding forms.

UTF-32 is the longest, using 4 octets rather than 3 octets, because current computers work more efficiently with 4 octets. UTF-32 is very seldom used in files because of its length. It mostly tends to be used in programming applications as part of conversion routines.

UTF-16 uses 2 octets for the most commonly used characters and 4 octets for characters of lesser use. The characters using 4 octets are all built from specific characters known as surrogates so they can be immediately recognized as part of a four-octet character, either the beginning half or the ending half.

UTF-8 uses a coding in which a single character varies in size from 1 octet to 4 octets. Octets where the first bit is zero represent exactly the same values as standard 7-bit US ASCII.. If the first bit is 1, then the octet is part of a longer character. The first bits in a multi-octet string are 110 for a 2-octet string, 1110 for a 3-octet string, and 11110 for a four-octet string. The trailing octets in multi-octet strings all begin with 10.

There is no obvious or non-obvious escape sequence, any more than there is with ASCII or EBCDIC or any other character set. One really has to know what a character set is before one can parse it. One can often algorithmically figure it out. For example, if one finds ASCII NULL + ASCII SPACE used often in the first sections of a file, it is probably coded in UTF-16, because in UTF-16, 0x0020 is the space character.

However Unicode does contain a character known as BOM or Byte-Order-Mark which may be placed at the beginning of a file to indicate whether its octet-strings are in small-to-large or large-to-small order and the presence of this character is a good indication that one is dealing with a Unicode file. The BOM has the Unicode value U+FEFF which means it looks like þÿ when misinterpreted as Latin-1 or like ï»¿ in UTF-8.

Indicating what character encoding is being used is something to be done by a higher protocol like HTML or XML or coding in a word processor, not by an escape sequence. See http://en.wikipedia.org/wiki/Character_encodings_in_HTML for some examples.

Look up the heart symbol at http://www.fileformat.info/info/unicode/char/2665/index.htm . If you dump it in UTF-16, you get 0x2665 which would indeed by interpreted as &e if this string were part of an 8-bit encoding and interpreted as ASCII. But it is not part of an 8-bit encoding and is not to be interpreted as ASCII. In UTF-8 you get 0xe299a5 which would be â™¥ in Windows code page 1252 and other values in other code pages and character sets. But the higher protocol will instead interpret it properly as ♥ in either case if it can read the coding.

Similarly, if you read a file encoded in a double-byte character set like Shift-JIS and read it as if it were Windows 1252, or read it as ASCII, you will also get garbage for most characters.

If you read a UTF-16 file as if it were an ASCII encoding, i t t e n d s t o l o o k l i k e t h i s . The first byte of the characters comes out as a NULL character and sometimes looks like a space, or possibly, an empty character symbol.

The 16 bits of 0x2665 unequivocally indicate ♥ in UTF16. There is no confusion with &e which would be 0x00260065 in UTF-16. Similarly an escape character, if it occurred, would be 0x001B in UTF-16 (but simply 0x1B in UTF-8 because UTF-8 includes the US ASCII characters as they stand).

Unicode was originally introduced as a fixed-width 16-bit encoding, and some early Unicode products treated it that way. An 8-bit character in UTF-16 makes no more sense than would a 4-bit character in ASCII. You don’t ask why z in ASCII, which is 0x7A, is not interpreted as Bell (which is 0x7) followed by Linefeed (which is 0xA). Because Bell is actually 0x07 and Linefeed is actually 0x0A in ASCII in an 8-bit environment.

The Escape character is not used within Unicode, any more than it is used within plain ASCII or other character sets. It exists for use by higher protocols for control of formatting. In current technology it is mostly not used at all in creating files of formatted text.

anonymous

2010-03-16 09:32:50 UTC

Depends on the encoding. There are several ways to encode Unicode characters (see sources). One of the most common, especially if you are mixing it with plain text (i.e. ASCII) is UTF-8 (see sources). In this encoding, 0x2665 will be encoded as follows:

1. In binary, 0x2665 = 0010 0110 0110 0101. Call these nibbles a, b, c and d in sequence.

2. Substitute into the sequence: 1110a 10bc1 10c2d where c1 and c2 are the first and second halves of c. This yields 11100010 10011001 10100101

3. Convert back to hex: E2 99 A5

EDIT: note that the way you encode UTF-8 depends on the value of the character. Again, see sources.

koguchi

2016-12-15 09:18:44 UTC

now no longer constructive with regard to the pipes. whether in all probability in case you in all probability did a replace and altered the |'s with , 's this manner you're in a position to insert the education with some style of CSV (comma separated vales ) to sq. script. or perhaps get very own dwelling house cyber web internet site to benefit a line, enter the education between the |'s to a variety of of strings $a $b $c etc. then to a INSERT into Blaaa values $a,$b,$c ? and repeat till at last the appropriate of checklist. those style of questions would be appropriate spoke back on a dedicated very own dwelling house cyber web internet site communicate board. a minimum of there you will discover extra useful very own dwelling house cyber web internet site programmers.

wessel

2016-12-12 10:45:39 UTC

not helpful with regard to the pipes. even regardless of the incontrovertible fact that probable in case you probable did a replace and adjusted the |'s with , 's this form you're able to insert the education with some variety of CSV (comma separated vales ) to sq. script. or perhaps get own abode information superhighway website to examine a line, enter the education between the |'s to a determination of strings $a $b $c etc. then to a INSERT into Blaaa values $a,$b,$c ? and repeat till finally the suited of checklist. those variety of questions would be appropriate responded on a committed own abode information superhighway website communicate board. a minimum of there you will come across extra advantageous own abode information superhighway website programmers.

ⓘ

This content was originally posted on Y! Answers, a Q&A website that shut down in 2021.

about - legalese