4. Locale Setup

4.1. Files and the kernel

You can now use any Unicode characters in file names. No kernel or file utilities need modifications. This is because file names in the kernel can be anything not containing a null byte, and '/' is used to delimit subdirectories. When encoded using UTF-8, non-ASCII characters will never be encoded using null bytes or slashes. All that happens is that file and directory names occupy more bytes than they contain characters. For example, a filename consisting of five greek characters will appear to the kernel as a 10-byte filename. The kernel does not know (and does not need to know) that these bytes are displayed as greek.

This is the general theory, so long as your files reside on Linux. On filesystems which are used from other operating systems, you have mount options to control conversion of filenames to or from UTF-8:

  • The "vfat" filesystems has a mount option "utf8". See file /usr/src/linux/Documentation/filesystems/vfat.txt. When you give an "iocharset" mount option different from the default (which is "iso8859-1"), the results with and without "utf8" are not consistent. Therefore, it is not I recommend to use the "iocharset" mount option.

  • The "msdos", "umsdos" filesystems have the same mount option, but appear to have no effect.

  • The "iso9660" filesystem has a mount option "utf8". See file /usr/src/linux/Documentation/filesystems/isofs.txt.

  • Since Linux 2.2.x kernels, the "ntfs" filesystem has a mount option "utf8". See file /usr/src/linux/Documentation/filesystems/ntfs.txt.

The other filesystems (nfs, smbfs, ncpfs, hpfs, etc.) don't convert filenames; therefore they support Unicode file names in UTF-8 encoding only if the other operating system supports them. Please note that to enable a mount option for all future remounts, you add it to the fourth column of the corresponding /etc/fstab line.

4.2. Locale environment variables

You should have the following environment variables set, containing locale names:

LANGUAGE

override for LC_MESSAGES

LC_ALL

override for all other LC_* variables

LC_CTYPE, LC_MESSAGES, LC_COLLATE, LC_NUMERIC, LC_MONETARY, LC_TIME

individual variables for: character types and encoding, natural language messages, sorting rules, number formatting, money amount formatting, date and time display.

LANG

default value for all LC_* variables. (See `man 7 locale' for a detailed description.)

In order to tell your system and all applications that you are using UTF-8, you need to add a codeset suffix of UTF-8 to your locale names. For example, if you want to run an application in UTF-8 Hindi locale then with bash shell, you can specify which environment variable to be passed to the application.

  $ LANG=hi_IN.UTF-8 xman

In order to set locale the Hindi locale globally for a particular user, you can append the following line in ~/.bashrc file.

  export LANG=hi_IN.UTF-8

After that you need not to set the LANG environment variable each time you run a specific application.