Searching Tags in (3NF) Database

This afternoon I experimented with tag searching. Current implementation doesn’t scale beyond 6 tags due to parsing requirement. You can argue about binding variable all day but in the end it can be abused easily if the limit is lifted.

Here’s the original, awesomely monstrous query:

SELECT (SELECT name
        FROM   tags
        WHERE  id = pt0.tag_id) AS tag,
       Count(pt0.*)             AS tag_count
FROM   posts_tags pt0,
       posts_tags pt1,
       posts_tags pt2,
       posts_tags pt3,
       posts_tags pt4,
       posts_tags pt5,
       posts_tags pt6,
       posts_tags pt7,
       posts_tags pt8,
       posts_tags pt9,
       posts_tags pt10
WHERE  pt0.post_id = pt1.post_id
       AND (SELECT TRUE
            FROM   POSTS p0
            WHERE  p0.id = pt0.post_id
                   AND p0.status <> 'deleted')
       AND pt1.post_id = pt2.post_id
       AND pt1.post_id = pt3.post_id
       AND pt1.post_id = pt4.post_id
       AND pt1.post_id = pt5.post_id
       AND pt1.post_id = pt6.post_id
       AND pt1.post_id = pt7.post_id
       AND pt1.post_id = pt8.post_id
       AND pt1.post_id = pt9.post_id
       AND pt1.post_id = pt10.post_id
       AND pt1.tag_id = (SELECT id
                         FROM   tags
                         WHERE  name = 'pantsu')
       AND pt2.tag_id = (SELECT id
                         FROM   tags
                         WHERE  name = 'amaha_miu')
       AND pt3.tag_id = (SELECT id
                         FROM   tags
                         WHERE  name = 'angelina_nanatsu_sewell')
       AND pt4.tag_id = (SELECT id
                         FROM   tags
                         WHERE  name = 'uryuu_sakuno')
       AND pt5.tag_id = (SELECT id
                         FROM   tags
                         WHERE  name = 'sena_airi')
       AND pt6.tag_id = (SELECT id
                         FROM   tags
                         WHERE  name = 'seifuku')
       AND pt7.tag_id = (SELECT id
                         FROM   tags
                         WHERE  name = 'thighhighs')
       AND pt8.tag_id = (SELECT id
                         FROM   tags
                         WHERE  name = 'izumi_tsubasu')
       AND pt9.tag_id = (SELECT id
                         FROM   tags
                         WHERE  name = 'palette')
       AND pt10.tag_id = (SELECT id
                          FROM   tags
                          WHERE  name = 'mashiroiro_symphony')
GROUP  BY pt0.tag_id
ORDER  BY tag_count DESC;

A kawaii~ 10 tables join. The query itself is fast (~30 ms)… after parsing. It takes ~2500 ms with parsing.

There’s an alternative for that query – Toxi solution (reference):

SELECT Count(1),
       pt.tag_id,
       t.name
FROM   posts_tags pt
       JOIN tags t
         ON ( pt.tag_id = t.id )
WHERE  post_id IN (SELECT post_id
                   FROM   posts_tags ptin
                   WHERE  tag_id IN ( 5858, 12271, 4264, 16822,
                                      43495, 20773, 168, 16823,
                                      1148, 539 )
                   GROUP  BY post_id
                   HAVING Count(post_id) = 10)
       AND (SELECT true
            FROM   posts p
            WHERE  p.id = pt.post_id
                   AND p.status <> 'deleted')
GROUP  BY pt.tag_id,
          t.name
ORDER  BY 1 DESC;

The parsing is fast (instant) but the query itself is a bit slow (~100 ms). The reference above suggested fetching all data per tag but sure sounds like a waste of query.

And then after few hours, I figured out another solution: just nest the damn queries. Its per tag search goes like this:

select post_id from posts_tags where tag_id = 1148

And then we need to further filter it like this (e.g. tag_id 1148 and 5858):

SELECT post_id
FROM   posts_tags
WHERE  tag_id = 1148
       AND post_id IN (SELECT post_id
                       FROM   posts_tags
                       WHERE  tag_id = 5858);

Each queries are fast and simple. The results from inner are also obviously cacheable (by db). To complete the query, just repeat the patter above ad infinitum (or 10 times) and apply relevant additional queries to obtain result just like the original query:

select count(post_id), (select name from tags where id = pt.tag_id) tag
from posts_tags pt where 
  post_id in 
    (select post_id from posts_tags where 
      tag_id = 66 and post_id in 
      (select post_id from posts_tags where 
        tag_id = 168 and post_id in 
          (select post_id from posts_tags where 
            tag_id = 16822 and post_id in 
              (select post_id from posts_tags where 
                tag_id = 16823 and post_id in 
                  (select post_id from posts_tags where 
                    tag_id = 43495 and post_id in 
                    (select post_id from posts_tags where 
                      tag_id = 20773 and post_id in 
                    (select post_id from posts_tags where 
                      tag_id = 539 and post_id in 
                    (select post_id from posts_tags where 
                      tag_id = 1148 and post_id in 
                    (select post_id from posts_tags where 
                      tag_id = 5858 and post_id in 
                    (select post_id from posts_tags where 
                      tag_id = 4264 and post_id in 
                    (select post_id from posts_tags where 
                      tag_id = 12271
  )))))))))))
group by tag_id order by 1 desc;

Looks a bit like monster but really just simple queries nested 10 times. Parsing speed is awesomely fast and the query itself runs at ~30 ms just like the original query.

And even better, this query is possible over Ruby on Rails (and perhaps any ORM) though not in single query but split per level.

2 thoughts on “Searching Tags in (3NF) Database

  1. Still, don’t you need another query for tag_id lookup? and I see you’ don’t test the result for deleted post at your solution, might end up with orphaned tags… I humbly think that some search limit was set for good as there’s not gonna be any solution would scaled up nicely both in code and performance for this case. Not sure about recursive query though.
    Just my 2 cents… I’ve been away from sql related works fo a while now. 😀

    • Put `post_id = (select id from posts p where p.id = pt.post_id and p.status ‘deleted’)` before `GROUP BY`. In fact, using recursive query is much faster after certain amount of keywords since the search range will be much smaller.

      The tag_id lookup is primary key lookup which is almost free.

Leave a Reply

Your email address will not be published. Required fields are marked *