Wednesday, July 22, 2009

Using linguistic indexes for sorting in Oracle

In Estonia we have our own language and Estonian language has its own alphabet that is a little bit different from English :) And it would be good if a database application sorts text fields according to the correct language rules (I have seen many bug reports from clients, that "the sort order is wrong").

In Oracle database its quite easy to sort according to the correct language rules, you just need to set the correct database session language from the client. This can be done using ALTER SESSION SET NLS_LANGUAGE or quite often the database client picks up the correct language from OS locale.

Small example to demonstrate this (I use Oracle Database 11.1.0.7 for these examples):

create table t (
  t varchar2(100) not null
);

insert into t values ('a');
insert into t values ('ä');
insert into t values ('o');
insert into t values ('õ');
commit;

SQL> alter session set nls_language='estonian';

Session altered.

SQL> select t from t order by t;

T
---------
a
o
õ
ä

SQL> alter session set nls_language='german';

Session altered.

SQL> select t from t order by t;

T
---------
a
ä
o
õ

But what happens if I create an index to speed up the sorting?

SQL> truncate table t;

SQL> create index idxt on t(t);

SQL> 
declare
  s t.t%type;
begin
  for i in 1..100000 loop
    s:= case mod(i,4) 
          when 0 then 'a'
          when 1 then 'ä'
          when 2 then 'o'
          when 3 then 'õ'
        end;
    if mod(i,3) = 0 then
      s:= upper(s);
    end if;
    s:= s||dbms_random.string('x',50);
    insert into t values (s);
  end loop;
  commit;
end;
/
SQL> exec dbms_stats.gather_table_stats(user,'t',cascade=>true);

PL/SQL procedure successfully completed.

SQL> alter session set nls_language='estonian';

Session altered.

SQL> select t from t order by t;

100000 rows selected.

Elapsed: 00:00:02.89

-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |   100K|  5175K|       |  1482   (2)| 00:00:18 |
|   1 |  SORT ORDER BY     |      |   100K|  5175K|  5896K|  1482   (2)| 00:00:18 |
|   2 |   TABLE ACCESS FULL| T    |   100K|  5175K|       |   162   (2)| 00:00:02 |
-----------------------------------------------------------------------------------

SQL> alter session set nls_language='american';

Session altered.

SQL> select t from t order by t;

100000 rows selected.

Elapsed: 00:00:01.56

-------------------------------------------------------------------------
| Id  | Operation        | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT |      |   100K|  5175K|   895   (1)| 00:00:11 |
|   1 |  INDEX FULL SCAN | IDXT |   100K|  5175K|   895   (1)| 00:00:11 |
-------------------------------------------------------------------------

First I did a query with Estonian language and then for comparison with American language. If you compare these two, the query in American language works as expected, it reads the correct sort order directly from index. But query in Estonian language requires extra sorting. This is also clearly demonstrated in the query elapsed times. Usually this is not a problem at all, but if there is a need for more performance, what if the data sets are really large and could benefit a sorting help from an index?

This is because normal indexes in oracle are always built with BINARY sort mode. This is the fastest sorting mode that orders the characters by their numerical value in the character set. But this is only good for the English language.

Here come the linguistic indexes.

SQL> create index idx_ling on t (nlssort(t, 'nls_sort=estonian'));

Index created.

SQL> alter session set nls_language='estonian';

Session altered.

SQL> select /*+index(t idx_ling)*/ t from t order by t;

----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |   100K|    14M|   101K  (1)| 00:20:20 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T        |   100K|    14M|   101K  (1)| 00:20:20 |
|   2 |   INDEX FULL SCAN           | IDX_LING |   100K|       |  1646   (1)| 00:00:20 |
----------------------------------------------------------------------------------------

So it works, there is no extra sorting step anymore. But it's useless in this case, because the index contains only the sorting value, not the actual column value and the slow TABLE ACCESS BY INDEX ROWID needs to be performed.

This isn't a problem when I do not need to return the actual value of column t and only need to do the sorting:

SQL> select 1 from t order by t;

-----------------------------------------------------------------------------
| Id  | Operation        | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT |          |   100K|    14M|  1646   (1)| 00:00:20 |
|   1 |  INDEX FULL SCAN | IDX_LING |   100K|    14M|  1646   (1)| 00:00:20 |
-----------------------------------------------------------------------------
Or I could include the actual column value in the index itself:
SQL> create index idx_ling2 on t (nlssort(t, 'nls_sort=estonian'), t);

SQL> select t from t order by t;

------------------------------------------------------------------------------
| Id  | Operation        | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |           |   100K|    14M|  2388   (1)| 00:00:29 |
|   1 |  INDEX FULL SCAN | IDX_LING2 |   100K|    14M|  2388   (1)| 00:00:29 |
------------------------------------------------------------------------------

If I change the language, the index is not used anymore.

SQL> drop index idxt;

SQL> alter session set nls_language='american';

Session altered.

SQL> select 1 from t order by t;

-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |   100K|  5175K|       |  1482   (2)| 00:00:18 |
|   1 |  SORT ORDER BY     |      |   100K|  5175K|  5896K|  1482   (2)| 00:00:18 |
|   2 |   TABLE ACCESS FULL| T    |   100K|  5175K|       |   162   (2)| 00:00:02 |
-----------------------------------------------------------------------------------

The same linguistic index can also be used for searching, but for that NLS_COMP needs to be set to LINGUISTIC:

SQL> alter session set nls_language='estonian';

Session altered.

SQL> select t from t where t = 'x';

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    53 |   163   (3)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| T    |     1 |    53 |   163   (3)| 00:00:02 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("T"='x')

SQL> alter session set nls_comp='linguistic';

Session altered.

SQL> select t from t where t = 'x';

------------------------------------------------------------------------------
| Id  | Operation        | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |           |     1 |   157 |     3   (0)| 00:00:01 |
|*  1 |  INDEX RANGE SCAN| IDX_LING2 |     1 |   157 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access(NLSSORT("T",'nls_sort=''ESTONIAN''')=HEXTORAW('80000100')
              )

It is also possible to combine search and sorting:

SQL> select t from t where t between 'a' and 'b' order by t;

------------------------------------------------------------------------------
| Id  | Operation        | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |           |   250 | 39250 |    13   (0)| 00:00:01 |
|*  1 |  INDEX RANGE SCAN| IDX_LING2 |   250 | 39250 |    13   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access(NLSSORT("T",'nls_sort=''ESTONIAN''')>=HEXTORAW('14000100')
                AND NLSSORT("T",'nls_sort=''ESTONIAN''')<=HEXTORAW('19000100') )

In conclusion, its possible to use index for sorting in non-English languages, but the solution is not very straight-forward and simple. But at the same time it is powerful, you can index the same column in different languages at the same time! But I'm not yet sure, that I like this FBI kind of solution by Oracle, because when doing SQL tuning I really like the possibility of reading the actual column value from index and not doing the always painfully slow TABLE ACCESS BY INDEX ROWID, if at all possible.
In one of the next posts I will take a look at other databases, how have they solved sorting according to national language rules.

More examples about Linguistic Indexes can be found from Richard Foote blog, he focuses on the case-insensitive search feature:
Introduction To Linguistic Indexes – Part I

Also take a look at Metalink note 227335.1:
Linguistic Sorting - Frequently Asked Questions

And of course... documentation:
Oracle® Database Globalization Support Guide - Linguistic Sorting and String Searching

No comments:

Post a Comment