Sarves's Hut: Unicode

Showing posts with label Unicode. Show all posts

Wednesday, May 25, 2011

Enable Tamil and Sinhala input methods

Here how I enabled Tamil (Renganathan IM) and Sinhala (Wijesekara) on Ubuntu 11.

Step 1 : As a root, do :

apt-get install ibus im-switch ibus-m17n m17n-db m17n-contrib ttf-tamil-fonts language-pack-ta-base ttf-sinhala-lklug language-pack-si-base

Step 2 : Just run from your user account (Not as root) :
rm -f ~/.xinput.d/* ; im-switch -z all_ALL -s ibus

Thereafter restart the session (Just logoff and login)

Thereafter do "ibus-setup" and configure your prefered input setup. In addition to that you can configure where it should appear in your screen.

Enjoy!

Reference : http://sinhala.sourceforge.net/

Thursday, June 12, 2008

Punycode

Very recently I got to know about this.
Punycode is a simple and efficient transfer encoding syntax designed for use with Internationalized Domain Names in Applications (IDNA). It uniquely and reversibly transforms a Unicode string into an ASCII string. ASCII characters in the Unicode string are represented literally, and non-ASCII characters are represented by ASCII characters that are allowed in host name labels (letters, digits, and hyphens).
Punycode is an instance of a more general algorithm called Bootstring, which allows strings composed from a small set of "basic" code points to uniquely represent any string of code points drawn from a larger set.  Punycode is Bootstring with particular parameter values appropriate for IDNA.
-http://www.faqs.org/rfcs/rfc3492.html

Basically the idea is to have the domain names in local languages. The current naming conventions allow us to have ASCII characters only. But to have the names in Unicode, Unicode should be mapped to ASCII. The punycode does the job.
For example : www.தமிழ்.com, to have this we may need to enter the punycode of this on DNS. This may look like : www.xn--rlcus7b3d.com. You can notice that, Punycode starts with "xn--".
There are converters, using which we can get punycode of out domain names (http://www.nameisp.com/puny.asp).

I think there are conventions to transform the top level domains too. But I didnt study about them yet.

Sunday, April 13, 2008

BOM

The very recent and Important thing I learnt in CS is about BOM – Byte Order Mark

We have been doing Moodle localization for last 8 months or so. We do not use any special tool for this, mostly use Dreamweaver to do this! As you all expect, we saved our works in UTF8. Until today, we had small problems when we test our language pack with Moodle. But we didn’t much worry about it and suddenly yesterday we got a serious one. When we try to test our pack, the pages start to give ‘header already sent’ error messages. Then only we realized the seriousness and start to dig the problem.

Today we found that till today we have saved our works in UTF8+BOM. But Moodle doesn’t support for BOM. Then we removed that and now everything works fine !

Here is a small FAQ from Unicode site :

Q: What is a BOM?

A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.

Q: Where is a BOM useful?

A: A BOM is useful at the beginning of files that are typed as text, but for which it is not known whether they are in big or little endian format—it can also serve as a hint indicating that the file is in Unicode, as opposed to in a legacy encoding and furthermore, it act as a signature for the specific encoding form used .

Q: When a BOM is used, is it only in 16-bit Unicode text?

A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising the BOM will be whatever the Unicode character FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Examples:

Bytes	Encoding Form
00 00 FE FF	UTF-32, big-endian
FF FE 00 00	UTF-32, little-endian
FE FF	UTF-16, big-endian
FF FE	UTF-16, little-endian
EF BB BF	UTF-8

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.