Considerations about text searchs in big fields and planner costs

Posted in: Technical Track

Imagine the following scenario: you have a table with a small set of records with columns containing tsvector and text data type. But, text fields has almost 20 megabytes of text or more.

So, what happens? Postgres planner checks the amount of reltuples (tuples) in the relation to calculate the cost of data extraction. If the amount of tuples reached by the query is higher than the 20% of the total result set or the amount of blocks in the relation is too small, then the planner will choose sequential scan, otherwise random access will be the chosen one.

But, in that case we face a problem. The amount of rows and blocks in the relation is not the “total” content. That happens because Postgres use a technique called TOAST (The Oversized Attribute Storage Technique) which store all the data that is over 200 bytes, apart from the block. So, in the relation you will have only the firsts bytes of your column data. You can modify some options about how the columns get toasted (the level of compression), but the default behaviour is compress and store apart. This methodology is pretty cool, because if you have i.e. 500MB of data in only one field, your relation will contain all that data in the same file and in the most usual cases that is not convenient.

So, meanwhile you have small number of rows, the amount of data to read from disk is quite big in relation of the data stored in the relfilenode (the data file). But for FTS, we need to dig into all the content and for make the searchs faster, we need GIN or GIST indexes. This type of indexes, stores the data inside the index (differently from B-Tree indexes), making possible the search using the index.

So, you want to access by index to your data, but the planner estimates that the amount of rows is so small that is better to access by sequential scan instead of use index. So, the only way to force the read of index is *SET enable_seqscan TO off;* before execute your query. This variable is per session, so you might want to add it inside the transaction and set again to “on” for other queries.

But, by the way, once your table is populated with enough amount of data, you wouldn’t need anymore this execution, but if your table is quite static, could be a little fix. Besides this, it doesn’t represents a problem of performance, but this kind of things could show you how the planner works.

Let’s see an example:

CREATE TABLE documents_en (
author_id char(2),
text_id serial,
author varchar(45) NOT NULL,
title varchar(127) NOT NULL,
content text NOT NULL,
cont_ftsed tsvector ,
PRIMARY KEY (author_id, text_id)
);

CREATE FUNCTION ftsed_upd_en() RETURNS trigger AS $$
DECLARE
BEGIN
NEW.cont_ftsed:=(setweight(to_tsvector(‘english’, coalesce(NEW.title, ”)), ‘A’) ||
setweight(to_tsvector(‘english’, coalesce(NEW.author, ”)), ‘B’) ||
setweight(to_tsvector(‘english’, coalesce(NEW.content, ”)), ‘C’));
RETURN NEW;
END
$$
LANGUAGE plpgsql;

CREATE TRIGGER tgr_ins_ftsed_en BEFORE INSERT ON documents_en
FOR EACH ROW EXECUTE PROCEDURE ftsed_upd_en();

CREATE TRIGGER tgr_upd_ftsed_en BEFORE UPDATE OF author,title,content
ON documents_en
FOR EACH ROW EXECUTE PROCEDURE ftsed_upd_en();

CREATE INDEX documents_en ON nspc_es USING gin(
(setweight(to_tsvector(‘english’, coalesce(title, ”)), ‘A’) ||
setweight(to_tsvector(‘english’, coalesce(author, ”)), ‘B’) ||
setweight(to_tsvector(‘english’, coalesce(content, ”)), ‘C’))
);

CREATE INDEX ix_ftsed ON documents_en USING GIST(cont_ftsed);

Before continuing, we need to clarify 3 things: text_id is related with author_id, but just for testing we use it as a serial, to avoid collisions in the primary key; we use triggers to automatically set up the tsvector column; the last thing is the indexes, I created 2 indexes to show the examples. IMHO I don’t recommend functional indexes, for performance purposes, but is still an option.

To fill up the table, I’ve made a script to catch up all the READMEs in the server and put them into the DB.

#!/bin/bash
FILENAME=”$@”
TMPBFF=temp_file
PSQL=/opt/pg92dev/bin/psql
WORDS=words

for _FILE_ in $FILENAME
do
egrep -oi ‘[a-z]+’ $_FILE_ | egrep -i ‘[a-z]+{4}+’ > $WORDS
AUTHOR=$(cat $WORDS | sort -u | head -n2 | awk ‘{printf(“%s “, $0) }’)
AUTHOR_ID=${AUTHOR:0:2}
TITLE=$(cat ${WORDS} | sort | uniq -c | sort -nr | head -n3 | awk ‘{printf(“%s “, $2)}’)
BODY=$(cat $WORDS | sort -u | awk ‘{printf(“%s “, $0) }’)

$PSQL -Upostgres fts -c “INSERT INTO documents_en (author_id, author, title, content) VALUES (‘${AUTHOR_ID}’, ‘${AUTHOR}’, ‘${TITLE}’,’${BODY}’ );”
done

I know, is a bit tricky, but useful for this test. to execute it, you’ll need something like:

locate README.txt | xargs ./script

First let’s get some stats from the table:

fts=# select count(*) from documents_en ;
count : 22
fts=# select pg_size_pretty(pg_relation_size(‘documents_en’));
pg_size_pretty : 24 kB (size of the table)
fts=# select relpages from pg_class where relname = ‘documents_en’;
relpages :3 (8kb each)

Now, a simple query with a EXPLAIN ANALYZE:

fts=# EXPLAIN ANALYZE SELECT author, title
FROM documents_en WHERE cont_ftsed @@ to_tsquery(‘english’,’editor&find’);
QUERY PLAN
——————————————————————————————————-
Seq Scan on documents_en (cost=0.00..3.27 rows=1 width=34) (actual time=0.228..0.464 rows=2 loops=1)
Filter: (cont_ftsed @@ ”’editor” & ”find”’::tsquery)
Rows Removed by Filter: 20
Total runtime: 0.501 ms
(4 rows)

Oh! It’s using sequential scan (as we expect ), but now let’s add a trick:

fts=# SET enable_seqscan TO off;
SET
fts=# explain ANALYZE SELECT author, title
FROM documents_en WHERE cont_ftsed @@ to_tsquery(‘english’,’editor&find’);
QUERY PLAN
—————————————————————————————————————————
Index Scan using ix_ftsed_en on documents_en (cost=0.00..8.27 rows=1 width=34) (actual time=0.127..0.191 rows=2 loops=1)
Index Cond: (cont_ftsed @@ ”’editor” & ”find”’::tsquery)
Rows Removed by Index Recheck: 1
Total runtime: 0.280 ms
(4 rows)

What happens? The cost is higher but is faster? The fast answer is YES, is possible. Cost is based on estimations in the amount of blocks and type of access. Random access use to has 4:1 more cost in relation with the sequential accesses. In this particular case, the table is small so the planner estimates that is cheaper to read it without indexes. BUT that’s not the real problem, the real one is that the data is already TOASTED outside the “data file” and this data is not considered by the planner in this case.

Another question is: can we improve the access to this data? The answer is YES, if you are not complaining about the storage. What you can do is:

fts=# alter table documents_en alter column cont_ftsed set storage external;
ALTER TABLE

That will uncompress the data of the column cont_ftsed and force to store it outside the table (this is not really at this way, but for be didactic we will explain it like this).

fts=# vacuum full analyze documents_en;
VACUUM
fts=# explain ANALYZE SELECT author, title
FROM documents_en WHERE cont_ftsed @@ to_tsquery(‘english’,’editor&find’);
QUERY PLAN
—————————————————————————————————————————
Index Scan using ix_ftsed_en on documents_en (cost=0.00..8.27 rows=1 width=34) (actual time=0.071..0.127 rows=2 loops=1)
Index Cond: (cont_ftsed @@ ”’editor” & ”find”’::tsquery)
Rows Removed by Index Recheck: 1
Total runtime: 0.178 ms
(4 rows)

fts=# SET enable_seqscan TO on;
SET
fts=# explain ANALYZE SELECT author, title
FROM documents_en WHERE cont_ftsed @@ to_tsquery(‘english’,’editor&find’);
QUERY PLAN
——————————————————————————————————-
Seq Scan on documents_en (cost=0.00..3.27 rows=1 width=34) (actual time=0.067..0.292 rows=2 loops=1)
Filter: (cont_ftsed @@ ”’editor” & ”find”’::tsquery)
Rows Removed by Filter: 20
Total runtime: 0.329 ms
(4 rows)

In this case we use VACUUM because the SET STORAGE option doesn’t make any changes to the previous stored data, and its effects are after. So with vacuum, we are forcing to apply the changes in all the table.

Ok, what happen if the table gets filled with more data? Automatically the planner will start to use the index:

fts=# select pg_size_pretty(pg_relation_size(‘documents_en’));
pg_size_pretty
—————-
1744 kB
(1 row)
fts=# SELECT count(*) FROM documents_en WHERE cont_ftsed @@ to_tsquery(‘english’,’editor&find’);
count
——-
45
(1 row)

fts=# explain ANALYZE SELECT author, title
FROM documents_en WHERE cont_ftsed @@ to_tsquery(‘english’,’editor&find’);
QUERY PLAN
———————————————————————————————————————–
Bitmap Heap Scan on documents_en (cost=4.38..47.02 rows=13 width=31) (actual time=1.091..5.404 rows=45 loops=1)
Recheck Cond: (cont_ftsed @@ ”’editor” & ”find”’::tsquery)
Rows Removed by Index Recheck: 80
-> Bitmap Index Scan on ix_ftsed_en (cost=0.00..4.37 rows=13 width=0) (actual time=1.003..1.003 rows=125 loops=1)
Index Cond: (cont_ftsed @@ ”’editor” & ”find”’::tsquery)
Total runtime: 5.489 ms
(6 rows)

fts=# SET enable_bitmapscan TO off;
SET
fts=# explain ANALYZE SELECT author, title
FROM documents_en WHERE cont_ftsed @@ to_tsquery(‘english’,’editor&find’);
QUERY PLAN
——————————————————————————————————————————
Index Scan using ix_ftsed_en on documents_en (cost=0.00..56.50 rows=13 width=31) (actual time=0.307..5.286 rows=45 loops=1)
Index Cond: (cont_ftsed @@ ”’editor” & ”find”’::tsquery)
Rows Removed by Index Recheck: 80
Total runtime: 5.390 ms
(4 rows)

We don’t see a huge difference in terms of time between each other, but we still facing the question about the relation between cost/real-performance.

But, don’t forget the functional index! I think you will like the following example:

fts=# analyze;
ANALYZE
fts=# explain analyze select author_id, text_id from documents_en
where (((setweight(to_tsvector(‘english’::regconfig, COALESCE(title, ”::character varying)::text), ‘A’::”char”) || setweight(to_tsvector(‘english’::regconfig, COALESCE(author, ”::character varying)::text), ‘B’::”char”)) || setweight(to_tsvector(‘english’::regconfig, COALESCE(content, ”::text)), ‘C’::”char”))) @@ to_tsquery(‘english’,’editor&find’);
QUERY PLAN
———————————————————————————————————————————————————— Seq Scan on documents_en (cost=0.00..292.20 rows=17 width=7) (actual time=2.960..3878.627 rows=45 loops=1)
Filter: (((setweight(to_tsvector(‘english’::regconfig, (COALESCE(title, ”::character varying))::text), ‘A’::”char”) || setweight(to_tsvector(‘english’::regconfig, (COALESCE(author
, ”::character varying))::text), ‘B’::”char”)) || setweight(to_tsvector(‘english’::regconfig, COALESCE(content, ”::text)), ‘C’::”char”)) @@ ”’editor” & ”find”’::tsquery)
Rows Removed by Filter: 2238
Total runtime: 3878.714 ms
(4 rows)

fts=# SET enable_seqscan TO off;
SET
fts=# explain analyze select author_id, text_id from documents_en
where (((setweight(to_tsvector(‘english’::regconfig, COALESCE(title, ”::character varying)::text), ‘A’::”char”) || setweight(to_tsvector(‘english’::regconfig, COALESCE(author, ”::character varying)::text), ‘B’::”char”)) || setweight(to_tsvector(‘english’::regconfig, COALESCE(content, ”::text)), ‘C’::”char”))) @@ to_tsquery(‘english’,’editor&find’);
QUERY PLAN
———————————————————————————————————————————————————— Bitmap Heap Scan on documents_en (cost=272.15..326.46 rows=17 width=7) (actual time=0.937..1.026 rows=45 loops=1)
Recheck Cond: (((setweight(to_tsvector(‘english’::regconfig, (COALESCE(title, ”::character varying))::text), ‘A’::”char”) || setweight(to_tsvector(‘english’::regconfig, (COALESCE(
author, ”::character varying))::text), ‘B’::”char”)) || setweight(to_tsvector(‘english’::regconfig, COALESCE(content, ”::text)), ‘C’::”char”)) @@ ”’editor” & ”find”’::tsquery)
-> Bitmap Index Scan on textsearch_en (cost=0.00..272.15 rows=17 width=0) (actual time=0.916..0.916 rows=45 loops=1)
Index Cond: (((setweight(to_tsvector(‘english’::regconfig, (COALESCE(title, ”::character varying))::text), ‘A’::”char”) || setweight(to_tsvector(‘english’::regconfig, (COALE
SCE(author, ”::character varying))::text), ‘B’::”char”)) || setweight(to_tsvector(‘english’::regconfig, COALESCE(content, ”::text)), ‘C’::”char”)) @@ ”’editor” & ”find”’::tsquer
y)
Total runtime: 1.109 ms
(5 rows)

Again! But hits time it is more stronger the difference in terms of time. Is not so pretty, ugh? So, It seems that you will need to be careful if you use searchs in big columns, that are toasted. Maybe you should check in you production database if you need to add some tweaks to get the maximum performance to your Postgres instance.

The version used on this post is PostgreSQL 9.2devel on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.6.2 20111027 (Red Hat 4.6.2-1), 64-bit. Besides this, I made tests in 9.1 and still see the same thing’.

Happy hacking!

Interested in working with Laine? Schedule a tech call.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *