Character Set Conversion 10g and 11g – Part 1
Nov 14, 2012 / By Michael Abbey
This post is the first in a series of discussions. Perhaps you have never had the pleasure of needing to change a character set. If you have not and soon must, purchase a good sleeping bag as you may be at the office late for a few days to get the task accomplished.
A pair of clients are looking at this activity and let me tell you, at the database layer it’s FAR FROM trivial. There is a strong tool called csscan that can plow through the database and report on potential problems. Let me tell you: There is no shortage of potential issues. This is not the software’s fault; it’s just the way it is. MOS does a very nice job assisting with the process since a handful of very nice documents are there to accompany you throughout the journey. Installing the schema to support csscan (CSMIG) is the first step, and the appropriate document on MOS will hand-hold you through the process.
Then the fun starts. On the surface csscan behaves like export/import and data pump. It is parameter-driven, and its display resembles these two logical backup tools we all love. There are 4 types of conversion output displayed in scan.txt, one of csscan’s outputs. This is where the fun starts:
- CHANGELESS – The least of your headaches. It means that the data in the existing and soon-to-be character set will be exactly the same.
- CONVERTIBLE – The data will not “suffer” during the character set change, but to ensure this type of data makes it through the conversion, tables must be exported then truncated prior to running csalter to implement the change. Then the data is imported back into the empty tables.
- TRUNCATION – This section of output documents by TABLE.COLUMN situations where a column’s width in the existing character set will not accommodate all its data in the target set. This commonly occurs when going from a single to a double byte character set. The csscan output will tell you the number of bytes required in the target character set; the intervention required is to ALTER the reported columns and make sure they are at least as long as the Max Post Conversion Data Size.
- LOSSY – This is where the nightmare begins. Data reported in this section of the text file “will not make it through the conversion”. There are 2 dilemmas involved when you detect LOSSY data:
- A character represented in the existing character set is unknown in the target character set. Solution? Nothing more than this: Choose a different target character set that knows about these character. This is easier said than done as one could spend a lifetime running csscan with different character sets on the command-line for the TOCHAR parameter.
- The source database contains characters that are not defined in the source character set. That’s right, “the source character set“. This seems odd upon first reading, but I have experienced instances where characters NOT supported in a database character set (in theory) were found in that very same database! Initially one wonders (as do I) how, for example, a US7ASCII database may contain the Icelandic character ð not supported in US7ASCII. How that got in there is a mystery. This data will be lost in the character set change. MOS discusses how to detect invalid data in an existing database by specifying FROMCHAR and TOCHAR both as the existing character set when invoking csscan.
Stay tuned for part 2 of this database character set conversion discussion. You will not be disappointed!
One Response to “Character Set Conversion 10g and 11g – Part 1”
Leave a Reply