Tags again, denormalized

There’s a problem with nested query I previously mentioned: query time is unstable. The worst case is at least as slow as multiple self-join method – when the first few search key result is too big.

After trying to find out faster solution, I find out another solution by denormalizing the table. The current Danbooru also has similar table but awesomely fucked up by using custom FTS parser.

PostgreSQL 8.4 and up added one aggregation function: array_agg. This function allows result of GROUP BY combined into single array. Normalized tables:

Posts >> PostsTags >> Tags

PostsTags contains list of tag a post own. Create additional column (tags_cache) of int[] and then denormalize Posts by using array_agg:

UPDATE posts SET tags_cache = (select array_agg(tag_id)) FROM posts_tags WHERE post_id = id)

Add GIN index to the column:

CREATE INDEX idx_posts__tags_cache ON posts USING gin (tags_cache)

Finally the query:

select count(tag_id), 
  (select name from tags where id = tag_id)
from (
  select unnest(tags_cache) tag_id 
  from posts where status <> 'deleted' and 
  tags_cache @> (select array_agg(id) from tags 
    where name in ('pantsu', 'amaha_miu', 'angelina_nanatsu_sewell', 'uryuu_sakuno', 'sena_airi', 'seifuku', 'thighhighs', 'izumi_tsubasu', 'palette', 'mashiroiro_symphony')))
tag_relation group by tag_id;

Result: ~20ms on my VPS. Not as fast as when using nested (5ms) but doesn’t choke on different queries (there’s one case where this type finishes in 40ms and in 470ms with nested). There’s additional bonus of the query being much simpler and shorter than nested method.

To handle updates, make sure to update tags_cache column on tag removal or addition.

This is also a good case on materialized view. I hope it gets implemented someday.

Leave a Reply

Your email address will not be published. Required fields are marked *