September 25, 2018 · sql repost

Using sets for many-to-many relationships

This is a repost from a previous iteration of my blog. The original post was written in November, 2012.

What's a many-to-many?

It is very common when dealing with data modeling to find many-to-many relationship. Two entities having many related rows in the other table and viceversa.

A common example of is articles and tags. One article can have many tags and one single tag can be in many articles. Another is books and readers, one reader can read many books, and one book can be read by many readers.

Classic schema

How do we model an schema for such kind of relationship?

I will first describe the regular textbook way of doing it, along with its advantages and disadvantages. To help me in this exercise I will use articles/tags schema as an example, given that problem the model will probably end up looking like this:

classic-manytomany

We have the entities articles and tags, and we use a third table to help us relate it to each other, this is called a join table or junction tables.

The join table has foreign keys to both of our main tables. A row creates a single relationship between our tables, many rows create many-to-many relationships.

It's easy to go from one side to another, that is, to see what tags any given article has, or to find what articles have a certain tag. For example, the next query would give all the articles and their tags:

select 
  articles.title, 
  tags.name 
from articles
  inner join articles_tags on articles_tags.article_id = articles.id
  inner join tags on tags.id = articles_tags.tag_id;

           title           |    name
---------------------------+-------------
 Buzzword about buzzwords! | open source
 Buzzword about buzzwords! | ruby
 Buzzword about buzzwords! | programming

Advantages

This model is the only way to go when you need metadata about the relationship, for example, when it was created, by whom, or maybe the relationship status, like "Approved", "On Hold", "Canceled", etc.

Another advantage is that changing the relationships is easy. If you need to add or remove some you just modify the single rows you need, everything else is untouched.

Disadvantages

This model is very redundant if you only care about the relationships themselves and you have natural low cardinality in your data, that is, the same combinations repeat over and over.

On my given example, if many articles have the same tags.

  article_id | tag_id
-------------+-------------
      1      |    1
      1      |    2
      2      |    1
      2      |    2
      3      |    1
      3      |    2

Notice how the same tag combination, 1 and 2, is repeated over and over in multiple articles. If you have 1,000,000 articles with the same two tags, you would have 2,000,000 rows for them in the join table.

Although this may seem very unlikely to happen in a blog, the same principles hold for many other data sets that may be in the order of millions of rows.

This is not only wasted space, space is cheap, but space becomes important because the smaller your tables, the smaller your indexes, the less I/O operations you have to do to access your tables, your database manager can cache more of it, etc.

The less rows you have, the faster you can do operations on them.

This redundancy results from the fact that we have the reference to our main table in the join table. When we need to create a combination it doesn't matter if we have seen that particular combination before, we have to create it anyway because there is no way we can reuse it.

Set schema

set-manytomany

What is a set?

I am going to quote Wikipedia entry for sets:

In computer science, a set is an abstract data structure that can store certain values, without any particular order, and no repeated values. It is a computer implementation of the mathematical concept of a finite set.

A data structure for values with no repetitions. That sounds like something we could use for our "little" redundancy problem.

How?

Instead of introducing just a join table we also introduce another table, a sets table. Instead of relating articles and tags, we relate sets and tags.

Picture this scenario:

User 1:

articles_tags_set_id | tag_id
---------------------+--------
                   1 |      1
                   1 |      2

User 2:

See what we did here? We created two articles but we reused the set item and every time a new article comes that has this combination of tags we can reuse it.

Since we took the article_id from the join table and introduced the set_id instead, we can create a single copy and reuse it in as many articles as we want. If we have two tags and we have a 1,000,000 articles using these two tags, we would still have two single rows in our join table. Impressive isn't it‽

This is called a set because the combinations are unique, a set item is a distinct combination of tags.

Queries

How do we find what tags belong to an article?

select 
  articles.title, 
  tags.name 
from articles
  inner join articles_tags_sets_tags on articles_tags_sets_tags.articles_tags_set_id = articles.articles_tags_set_id
  inner join tags on tags.id = articles_tags_sets_tags.tag_id;

           title           |    name
---------------------------+-------------
 Buzzword about buzzwords! | programming
 Buzzword about buzzwords! | startups
 Buzzword about buzzwords! | ruby
                          ...

Looks familiar?

It is almost the same query but instead of having the article_id in the join table, the article table has the articles_tags_set_id. The behavior of this query is exactly the same as the first one, so finding your relationships has no cost whatsoever and since this table will grow to a fraction of a regular join table when you grow to hundreds of thousands of rows, this join will be faster because the database has a lot less rows to deal with.

How do we find if an existing set item already exists?

select 
  articles_tags_sets.id set_item_id
from articles_tags_sets_tags
    join articles_tags_sets on articles_tags_sets.id = articles_tags_sets_tags.articles_tags_set_id
where articles_tags_sets_tags.tag_id IN (3,1)
  and (
    select
      count(*) 
    from articles_tags_sets_tags c
    where c.articles_tags_set_id = articles_tags_sets_tags.articles_tags_set_id
  ) = 2
group by articles_tags_sets.id having count(*) = 2

 set_item_id
------------
          20

Not the simplest of queries but not exactly complicated either. We simply find first the set items that match the count of our row, two in this case, and then find the one that matches exactly our combination.

Caveats

This technique is not free. There are several disadvantages to it.

First and foremost, it has an extra cost when saving and updating. Every time we have to save a row in the parent table and related rows, we have to check first if this particular combination of children exists, if they do not, we have to create them, as opposed as just dumping rows in the join table.

When updating, if you add a new row to the combination we are now dealing with a new set item that has to be checked for, if it does not exist we have to create it entirely as opposed to inserting a single row to the join table.

This cost gets amortized with time and usage, when the system is fairly new and no set items yet exist, set items will get created all the time, after a while reuse will be more common since many different combinations will now exist.

Which bring us to another problem. Your data has to have a natural tendency for redundancy, like in our articles and tags. Otherwise you will be creating new set items all the time as opposed to be reusing them, you have to analyze your data and check if you have a lot of repetitions and redundancy and determine if this technique is for you. You can probably apply it to some tables as opposed to your entire database.

Yet another problem is that this only applies to when your relationship is the only data you need. If you need more information for each relationship you cannot use this technique because relationships are shared among many parent rows.

Real world usage

As an example of real world usage of this, let me share with you some data of one of the databases at my job.

We have songs, each song has clearance to some countries (3-5 average) and each song has different ways to be sold (streaming, download, radio, etc). Each song/country combination has many ways to be sold in that given country, so we have to have a many-to-many relationship here.

Numbers

Table# of rows
Song2.5 million
LocalizedSong12 million
DistributionType34

We have in average 10-15 different distribution type relationships for each song/country.

Using a regular join table we would have 125 million rows in average, we had that problem and it made it very difficult to us to scale our queries. (This is not the only many-to-many relationship we have, we have a complex data model).

Once we switched to this technique for this relationship because in this particular situation we have low cardinality (many albums from the same label have the same clearances and distribution types, or the songs in a given album have the same distribution types, etc), we have the outstanding number of 22k rows in our relationship table. See that number? 22,000 from 125,000,000.

When you operate at this scale, this is very, very helpful.

The End

For you Rails people, I took this concept and wrapped it nicely in a gem. It's called has-many-with-set and you can find the source code in github.

Do you like it? Have any doubts? Let me know, we can talk about it. This was designed at my job, INgrooves as part of our scaling efforts and now we share the concept with you. I hope you like it and at least learn a little bit by reading about it.

If you find a use for this and saves your company of buying more servers and you become your office's hero let me know, that will make me very happy.