Oracle UTF8 Encoding and String Manipulation Overhead

Jun 16, 2010 / By Alex Gorbachev

Tags: , ,

For one of our customers, I’ve recently reviewed the strategy of migration from single-byte encoding to variable length multi-byte UTF8 (AL32UTF8 encoding in Oracle naming standards). These type of projects are coming up again and again so I think it must be common for many of you. Thus, this bit might be useful. I’m also interested in your experience – perhaps you can run this simple simulation on your platforms and provide the results in the comments?

Back to the project… One area was estimation of the string manipulation overhead. Based on Pythian experience as well as the feedback from my network, I could conclude that nobody has observed any measurable performance degradation or significant CPU usage growth clearly attributed to UTF8 migration.

Thus, I decided to simulate at least some operations and measure them. One of the concerns was sub-string extraction — fixed length encoding sub-string is super easy to implement as the offset and length in bytes are known. Variable length character set would require scanning the whole string from the very beginning because byte offset and byte length are not known until the string is traversed from the beginning character by character.

This is the PL/SQL block I came up with:

declare
  s1 varchar2(4000 char);
  s2 varchar2(4000 char);
  i INTEGER;
begin
  s1 := rpad('a',3999,'a');
  for i in 1..10000000 loop
    null;
    --s2 := 'a' || 'b' || 'c' || 'd';
    --s2 := substr(s1,3000,1) || substr(s1,3001,1) || substr(s1,3002,1) || substr(s1,3003,1);
  end loop;
  dbms_output.put_line(s2);
end;

There are 3 cases and only one of them needs to be un-commented when running the block. null; is there to just capture the baseline and measure time required to run an empty loop. Concatenation s2 := 'a' || 'b' || 'c' || 'd'; is one scenario and, finally, substring extraction s2 := substr(s1,3000,1) || ... is the second one.

I’ve run it in the VMs on my laptop as well as customer’s own physical servers to confirm the timing. Below is the result from my virtual machines but the percentage was pretty much the same. Platform is Linux. Oracle version 10.2 and 11.1. Note that 11.2 is smarter and optimizes the block with NULL loop so you would need to trick PL/SQL optimizer better.

Character set null concat substr
WE8ISO8859P1 0.5 1.8 8.0
AL32UTF8 0.5 2.2 9.9

The time in the table is seconds that PL/SQL block was executing. I ran if few times and averaged the result but it was very consistent anyway. Also, skip the very first measurement to avoid the impact of block parsing.

To interpret the results, the string manipulation/concatenation (excluding 0.5s of looping itself) accounts 1.3s vs 1.7s – 30% UTF8 overhead. SUBSTR function overhead (excluding loop and concat itself) accounts for 6.2s vs 7.7s which is only 25% overhead.

I found substring overhead rather small – I expected order of magnitude difference to be honest. However, 30% of concatenation overhead seemed to be little too much and I don’t see why it should be that high.

Now, while overhead is rather noticeable, I didn’t see databases that were doing mostly string manipulations in SQL and PL/SQL. I think that’s why in the big picture, we generally don’t see much performance impact moving to variable-length character set.

If you have access to the databases with fixed-length and variable-length character sets on the similar hardware, feel free to post your results below.

9 Responses to “Oracle UTF8 Encoding and String Manipulation Overhead”

  • Alex,
    It’s possible that any “real-world” problems due to increases in CPU for string manipulation are insignificant compared to other side effect. I’m thinking particularly of (a) increased storage for the the same number of characters and (b) changes in the execution plans because of the way the optimizer calculates stats for character columns.

    The optimization problem is particularly nasty since some of the figures are based (essentially) on the first six BYTES of the string representation – which means, in the worst case, the first 1.5 (?!) characters in a multibyte set that can use 4 bytes per character.

    You might also examine the impact on parallel queries which do a lot of messaging between layers of slaves – with longer strings being passed around you get fewer rows into a PX message buffer.

    P.S. Having commented on the 6 byte thing, I realise I haven’t checked 11g to see if it has changed the way it handles a multi-byte set.

    • Jonathan,

      I agree regarding impact on the real-world workloads and I tried to mention it. Thanks for emphasizing it again.

      Regarding string sizes – this is something that’s emphasized in all guides to multi-byte encodings so it’s tough to miss. Even then, as long as most of the text is in US7ASCII range, size increase during migration is usually insignificant and this what I’ve observer but, YMVV. What people often miss is how CLOB content doubles. Maybe that’s something for another blog.

      Histogram issue is a very tricky one and I recall you mention it to me already. Maybe I’ll get around the test case just to demonstrate.

      Parallel messaging overhead — true in case text size does indeed grows significantly.

  • It looks strange to me that there is a difference at all in the concat case; shouldn’t concatenation a matter of

    - summing the length “bytes”
    - concatenating the encodings

    whatever the charset is ?

    Moreover, ASCII characters such as ‘a’,’b’,’c’ and ‘d’ have the same encoding in both WE8ISO8859P1 and AL32UTF8…

  • Øyvind Isene says:

    I ran the same tests and also repeated them for Norwegian characters (æ,ø,å); they are single-byte in WE8ISO8859P1 and 2 bytes in UTF8. I found no difference in response time between tests with (a,b,c) and (æ,ø,å) on both databases. The UTF8 database is running 11.1, the other is running 10.2 (both linux, but with different hardware).

    Set Test Result (1/100 sec)
    8bit null 25
    8bit cc-E 146
    8bit cc-N 145
    8bit ss-E 574
    8bit ss-N 576

    utf8 null 39
    utf8 cc-E 119
    utf8 cc-N 117
    utf8 ss-E 479
    utf8 ss-N 481

    My concern with UTF8 is not cpu, but refactoring. Columns containing non-english characters stored in varchar2(4000) require migration to CLOB when the byte-length surpasses 4000 bytes (e.g. when close to 4000 characters on WE8ISO8859P1 contains Nordic characters they are expanded to two-bytes characters in UTF8 and the total length exceeds 4000 byts). The limit is 4000 bytes, not 4000 characters, even if Oracle lets you define (foo varchar2(4000 char)) in UTF8.

    • @Øyvind: This is odd. Do I read your numbers right so that UTF8 database is performing faster than fixed length single-byte encoding? And odd that NULL loop is actually slower in UTF8 DB. Something is not consistent it seems. Are they on the same hardware?

  • Øyvind Isene says:

    As stated somewhere in there they are on different hardware :) I believe the null-loop has not improved much from 10.2 to 11.1, so I imagine the 11.1 with utf8 is running on “slower” hardware and that the string functions used here in 11.1 has improved from 10.2 to 11.1 I ran the tests several times, there were deviations of course, but the average was quite stable. I initially wanted to show that your tests on utf8 in the same database did not change much as one replace characters encoded as single byte in utf8 with characters that are encoded as two or more bytes (as is the case for Norwegian characters); ref comments from that fine Italian :)

  • rashmi says:

    We are converting WE8ISO8859P1 oracle db characterset to AL32UTF8. Before conversion, i was asked to check implication on PL/SQL code for byte based SQL functions. What all points to consider while checking implications on PL/SQL code. I could find 3 methods on google surfing, SUBSTRB, LENGTHB, INSTRB. What do I check if these methods are used in PL/SQL code? What all other methods should I check? What do I check in PL/SQL if varchar and char type declarations exist in code? How do i check implication of database characterset change to AL32UTF8 for byte bases SQL function. Suggest me as if I know only CRUD statements in oracle DB.

    Regards,
    Rashmi

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>