Monday, July 27, 2009

Using linguistic indexes for sorting in open source databases

Here I'm following up my previous post Using linguistic indexes for sorting in Oracle. I don't much like the Oracle solution, that requires creating a special index to speed up sorting, but... at the same time its very powerful, allows to index in many languages and no database changes are needed.

In this post I’ll take a look at the two popular open source databases MySQL and PostgreSQL. I'll take a look only at features, that the database has included and that can be used without any special application changes.

PostgreSQL 8.4



Starting from 8.4, collation (sorting) rules can be defined per database and there is no possibility to set it in session level. All sorting and all indexes are ordered according to the database collation locale. In previous versions there was only one collation locale allowed for the entire database cluster.

For my example I create two databases, one with Estonian locale and one with German.
$ createdb -E utf8 -l 'et_EE.utf8' -O ilmar ilmar_ee
$ createdb -E utf8 -l 'de_DE.utf8' -O ilmar -T template0 ilmar_de
Currently PostgreSQL relies on the underlying OS to do the collations, so the OS must also support the specified locale. Check it with:
$ locale -a | grep et_EE
et_EE
et_EE.iso88591
et_EE.iso885915
et_EE.utf8
To change the collation you need to dump the entire database to a text file (pg_dump), create a new database and reload the data. So, a pretty painful procedure for large databases.

A small test if it really works. In ilmar_ee and ilmar_de I create table test_coll and load it with 4 rows:
CREATE TABLE test_coll (
  t character varying(100) NOT NULL
);

begin;
insert into test_coll values ('a');
insert into test_coll values ('o');
insert into test_coll values ('õ');
insert into test_coll values ('ä');
commit;


ilmar_ee=> select t from test_coll order by t;
a
o
õ
ä

ilmar_de=> select t from test_coll order by t;
a
ä
o
õ
Now can index be used for sorting?
CREATE TABLE t (
  t character varying(100) NOT NULL
);

CREATE INDEX idxt
  ON t
  USING btree
  (t);

CREATE OR REPLACE FUNCTION fill_t(p_num_rows bigint) RETURNS bigint AS
$BODY$
declare
  s t.t%type;
  i integer;
begin
  for i in 1..p_num_rows loop
    s:= case mod(i, 4) when 0 then 'a' when 1 then 'ä' when 2 then 'o' when 3 then 'õ' end;
    if mod(i,2) = 0 then
      s:= upper(s);
    end if;
    s:= s ||' wqe wqe wqe wqeqwdsa asd asdasd sasss we qwewq dssas';
    insert into t (t) values (s);
  end loop;
  return 0;
end;
$BODY$
  LANGUAGE 'plpgsql' VOLATILE;

select fill_t(10000);
vacuum analyze t;
After the test data is created, some tests. Oracle users will find the following very strange:
ilmar=> explain select t from t order by t;
                          QUERY PLAN
--------------------------------------------------------------
 Sort  (cost=868.39..893.39 rows=10000 width=55)
   Sort Key: t
   ->  Seq Scan on t  (cost=0.00..204.00 rows=10000 width=55)

ilmar=> explain select t from t where t between 'a' and 'b' order by t;
                               QUERY PLAN
-------------------------------------------------------------------------
 Sort  (cost=395.10..401.35 rows=2500 width=55)
   Sort Key: t
   ->  Seq Scan on t  (cost=0.00..254.00 rows=2500 width=55)
         Filter: (((t)::text >= 'a'::text) AND ((t)::text <= 'b'::text))

ilmar=> explain select t from t where t between 'a' and 'ak' order by t;
                               QUERY PLAN
------------------------------------------------------------------------
 Index Scan using idxt on t  (cost=0.00..8.27 rows=1 width=55)
   Index Cond: (((t)::text >= 'a'::text) AND ((t)::text <= 'ak'::text))
It seems that Postgres optimizer only consideres using index for sorting, when there is only a small fraction of the table filtered. A reason for this in the documentation is:
The planner will consider satisfying an ORDER BY specification either by scanning an available index that matches the specification, or by scanning the table in physical order and doing an explicit sort. For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than using an index because it requires less disk I/O due to following a sequential access pattern. Indexes are more useful when only a few rows need be fetched.

So it seems that Postgres cannot fast full scan an index. If I fill the table up even more, then finally optimizer is costing the sort operation higher than index scan. But the query is slow, 4125ms on my system.
ilmar=> select fill_t(90000);
ilmar=> vacuum analyze t;

ilmar=> explain select t from t order by t;
                              QUERY PLAN
-----------------------------------------------------------------------
 Index Scan using idxt on t  (cost=0.00..9771.11 rows=100000 width=55)
A note from documentation:
The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.


Better collation supports seems to be a work in progress:
Todo:Collate
Todo:IDU
Its also possible to use third party Orafce package, that enables the use on NLSSORT function, that is similar to Oracle.

MySQL 5.1



MySQL supports collations at the column level.
All available collations for the given charset can be queried like this:
mysql> show collation where charset = 'utf8';
+--------------------+---------+-----+---------+----------+---------+
| Collation          | Charset | Id  | Default | Compiled | Sortlen |
+--------------------+---------+-----+---------+----------+---------+
| utf8_general_ci    | utf8    |  33 | Yes     | Yes      |       1 |
| utf8_bin           | utf8    |  83 |         | Yes      |       1 |
| utf8_unicode_ci    | utf8    | 192 |         | Yes      |       8 |
| utf8_icelandic_ci  | utf8    | 193 |         | Yes      |       8 |
| utf8_latvian_ci    | utf8    | 194 |         | Yes      |       8 |
| utf8_romanian_ci   | utf8    | 195 |         | Yes      |       8 |
| utf8_slovenian_ci  | utf8    | 196 |         | Yes      |       8 |
| utf8_polish_ci     | utf8    | 197 |         | Yes      |       8 |
| utf8_estonian_ci   | utf8    | 198 |         | Yes      |       8 |
| utf8_spanish_ci    | utf8    | 199 |         | Yes      |       8 |
| utf8_swedish_ci    | utf8    | 200 |         | Yes      |       8 |
| utf8_turkish_ci    | utf8    | 201 |         | Yes      |       8 |
| utf8_czech_ci      | utf8    | 202 |         | Yes      |       8 |
| utf8_danish_ci     | utf8    | 203 |         | Yes      |       8 |
| utf8_lithuanian_ci | utf8    | 204 |         | Yes      |       8 |
| utf8_slovak_ci     | utf8    | 205 |         | Yes      |       8 |
| utf8_spanish2_ci   | utf8    | 206 |         | Yes      |       8 |
| utf8_roman_ci      | utf8    | 207 |         | Yes      |       8 |
| utf8_persian_ci    | utf8    | 208 |         | Yes      |       8 |
| utf8_esperanto_ci  | utf8    | 209 |         | Yes      |       8 |
| utf8_hungarian_ci  | utf8    | 210 |         | Yes      |       8 |
+--------------------+---------+-----+---------+----------+---------+
21 rows in set (0.04 sec)
I'll create a table test, in where column e uses Estonian sorting and column h uses hungarian sorting.
mysql> create table test (
  e varchar(100) not null collate 'utf8_estonian_ci',
  h varchar(100) not null collate 'utf8_hungarian_ci'
) charset=utf8;

mysql> insert into test values ('a','a');
mysql> insert into test values ('ä','ä');
mysql> insert into test values ('o','o');
mysql> insert into test values ('õ','õ');
mysql> select * from test;
+---+---+
| e | h |
+---+---+
| a | a |
| ä | ä |
| o | o |
| õ | õ |
+---+---+
4 rows in set (0.02 sec)

mysql> select * from test order by e;
+---+---+
| e | h |
+---+---+
| a | a |
| o | o |
| õ | õ |
| ä | ä |
+---+---+
4 rows in set (0.04 sec)

mysql> select * from test order by h;
+---+---+
| e | h |
+---+---+
| a | a |
| ä | ä |
| o | o |
| õ | õ |
+---+---+
4 rows in set (0.01 sec)
Perfect, now what about indexes?
mysql> create index idxe on test (e);
Query OK, 4 rows affected (0.71 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> create index idxh on test (h);
Query OK, 4 rows affected (0.18 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> explain select e from test order by e;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | test  | index | NULL          | idxe | 302     | NULL |    4 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

mysql> explain select h from test order by h;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | test  | index | NULL          | idxh | 302     | NULL |    4 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)
If I force a different collation on a column, then values are read from index, but extra filesort step is needed:
mysql> explain select h from test order by h collate utf8_estonian_ci;
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | test  | index | NULL          | idxh | 302     | NULL |    4 | Using index; Using filesort |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)
I'll add more rows to see how much difference there is when reading the order from index and when doing extra sorting.
mysql> delimiter //
mysql> create procedure load_data(p_num_rows INT)
    -> BEGIN
    ->   SET @i = 0;
    ->   REPEAT
    ->     SET @i = @i + 1;
    ->     INSERT INTO test (e, h) VALUES (CONCAT('eeeaad sadsa dasd asd', @i),
    ->       CONCAT('213aad sadsa dasd asd', @i));
    ->   UNTIL @i > p_num_rows END REPEAT;
    -> END
    -> //
Query OK, 0 rows affected (0.44 sec)

mysql> delimiter ;
mysql> call load_data(10000);
mysql> call load_data(20000);
mysql> ANALYZE TABLE test;

mysql> select sql_no_cache count(*) from (select h from test order by h collate utf8_estonian_ci) a;
+----------+
| count(*) |
+----------+
|    30002 |
+----------+
1 row in set (1.22 sec)

mysql> select sql_no_cache count(*) from (select h from test order by h) a;
+----------+
| count(*) |
+----------+
|    30002 |
+----------+
1 row in set (0.09 sec)
It is also possible to set collation at connection level, but this does not change the row sorting order like in Oracle.
mysql> truncate table test;
Query OK, 0 rows affected (0.07 sec)

mysql> insert into test values ('a','a');
Query OK, 1 row affected (0.03 sec)

mysql> insert into test values ('ä','ä');
Query OK, 1 row affected (0.08 sec)

mysql> insert into test values ('o','o');
Query OK, 1 row affected (0.03 sec)

mysql> insert into test values ('õ','õ');
Query OK, 1 row affected (0.03 sec)

mysql> set collation_connection=utf8_estonian_ci;
Query OK, 0 rows affected (0.01 sec)

mysql> select h from test order by h;
+---+
| h |
+---+
| a |
| ä |
| o |
| õ |
+---+
4 rows in set (0.00 sec)

mysql> set collation_connection=utf8_hungarian_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> select h from test order by h;
+---+
| h |
+---+
| a |
| ä |
| o |
| õ |
+---+
4 rows in set (0.00 sec)

No comments:

Post a Comment