Sarves's Hut: BOM

The very recent and Important thing I learnt in CS is about BOM – Byte Order Mark

We have been doing Moodle localization for last 8 months or so. We do not use any special tool for this, mostly use Dreamweaver to do this! As you all expect, we saved our works in UTF8. Until today, we had small problems when we test our language pack with Moodle. But we didn’t much worry about it and suddenly yesterday we got a serious one. When we try to test our pack, the pages start to give ‘header already sent’ error messages. Then only we realized the seriousness and start to dig the problem.

Today we found that till today we have saved our works in UTF8+BOM. But Moodle doesn’t support for BOM. Then we removed that and now everything works fine !

Here is a small FAQ from Unicode site :

Q: What is a BOM?

A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.

Q: Where is a BOM useful?

A: A BOM is useful at the beginning of files that are typed as text, but for which it is not known whether they are in big or little endian format—it can also serve as a hint indicating that the file is in Unicode, as opposed to in a legacy encoding and furthermore, it act as a signature for the specific encoding form used .

Q: When a BOM is used, is it only in 16-bit Unicode text?

A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising the BOM will be whatever the Unicode character FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Examples:

Bytes	Encoding Form
00 00 FE FF	UTF-32, big-endian
FF FE 00 00	UTF-32, little-endian
FE FF	UTF-16, big-endian
FF FE	UTF-16, little-endian
EF BB BF	UTF-8

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.

Sarves's Hut

Sunday, April 13, 2008

BOM

No comments: