FB

Overview of Oracle Character Sets

I’ve been working with databases for a long time, and the topic of character sets took me a while to understand.

This article will answer some of the common questions, and questions I had about Oracle character sets and how they relate to the Oracle database.


What is a Character Set?

As defined by Webopedia:

A character set is a defined list of characters that is recognised by computers.

Each human language (such as English, French, Latin, or Japanese) has a range of characters that make up that language. A character set is a set of these characters.

An example of a character set is ASCII. It contains a range of numbers, symbols, letters, and some letters with accents on them.

Another example is Unicode, which contains a lot more characters than ASCII.

 

What’s the Difference Between a Character Set and a Character Encoding?

A character set is a range of characters. An encoding is how those characters are stored.

An encoding method matches a code against a character for a character set.

For example in the ASCII encoding method, the code of 23 is used for a # symbol.

Other encoding methods will have different codes to represent the same character, and the code of 23 will represent a different character.

Find out more about different SQL terms in my SQL glossary.

 

How Can I Find the Character Set Being Used by my Oracle Database?

Character sets are defined for the database in Oracle. Unlike MySQL, they cannot be defined at the table or the column level.

The character set being used by your database can be found in the NLS_DATABASE_PARAMETERS table.

SELECT value
FROM nls_database_parameters
WHERE parameter = 'NLS_CHARACTERSET';
VALUE
AL32UTF8

While you can change some of your session parameters (such as the language or the timezone), you can’t change the character set.

 

What Oracle Character Sets are Most Useful?

I’ve seen that UTF8 is quite popular (specifically called AL32UTF8). This is a character encoding for the Unicode character set, which uses one byte per character.

UTF16 is similar but uses two bytes per character.

The actual character set you choose will depend on your requirements for your data. When choosing a character set, note that it cannot be changed unless you drop and recreate the database, or perform a migration (which is not easy).

So, there’s my brief explanation of character sets and how you can see them in Oracle. I haven’t had a lot to do with them but sometimes it’s just good to know.

3 thoughts on “Overview of Oracle Character Sets”

  1. Just a minor point about UTF-8 encoding, it’s true that ASCII/ANSI characters encoded in UTF-8 are 1 byte long. However, UTF-8 is actually a *variable length* encoding, which means that a single “letter” may occupy from 1 to 4 *bytes*.

    As an example, the letter ‘a’ (Latin small letter a, character value 97, hex 0x61) is exactly the same when encoded in UTF-8.

    On the other hand the letter ‘ã’ (Latin small letter a with tilde, character value 227, hex 0xE3) is encoded in UTF-8 with TWO bytes: 0xC3 0xA3.

    The problem with this is when an application wants to store UTF-8 data in the database, but the developer created (or thought s/he was creating) a field to hold 200 characters:

    CREATE TABLE mytable (


    logfield VARCHAR2 (200)


    )

    During development the developer checks the length of the data they want to save in that field, just to be sure it fits:

    if (data.length() <= 200) {
    writeLogToDB(field);
    }

    It works, most of the time. But then one day it fails (database exception, field is not large enough).

    The developer steps through the code, hits the line with the length check and confirms it – the string is only 198 characters long, so why is it failing?

    The developer forgot two things:

    1. The string includes UTF-8 data (meaning the number of *bytes* may be – and in this case is – greater than the 200 BYTES the database has allocated);

    2. To guarantee that the database (which, as in the article, uses AL32UTF8 character encoding) can hold 200 UTF-8 *characters*, the definition of the field must be:


    logfield VARCHAR2 (200 CHAR)

    And yes, I got bitten by this for quite a while :-D

    And don't get me started about Oracle Linguistic sorting :-)

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Table of Contents