I worked in the digital media distribution industry for almost four years at INgrooves, during my time there I was part of very interesting and ambitious projects such as helping get the platform ready for mobile products distribution (wallpapers, ring tones, ring back tones, etc.), upgrade our platform for ebook distribution (INscribe Digital), and the two major projects that INgrooves had at the time, Hercules and Zeus.
These two projects were part of a major initiative to replace several systems that were part of the digital distribution infrastructure of UMG’s (the largest music label in the world) with improved versions of our own, handled by us. The systems that we replaced from UMG are the ones that actually sent their content out to retailers, something that INgrooves excels at.
Zeus is a content metadata database, validation and ordering system, that is, it has the database with all the information and clearances for content, upon ordering, if the content is cleared to go, it generates the metadata files that go out to retailers and places an order to Hercules. Hercules takes this order and creates whatever binaries are needed to fulfill it, binaries include cover art and the actual encoded music files (mp3, aac, wma, etc.), after the binaries generation step is over, it takes the binaries and the generated metadata and creates a final package that goes off to the retailers.
Easy said than done though, these systems present several challenges along the way, for once scaling of our databases and code to support all functionality needed, business logic to process the very complex logic of marking a product cleared for distribution or not, the problem of massive storage for all these digital masters (no, we cannot use S3, labels are very protective of their content), having the infrastructure and capabilities of creating all the different audio encodings for all the content that goes out the door every day, etc. Fun stuff.
Alas, everything that has a beginning has an end.
After a few months contemplating the idea I left in December 2012. Although we faced interesting challenges every day and more were in the way, I was definitely craving to work in different technologies and a different problem space, I love music and applying my technical skills in that industry was a blessing but I decidedly needed to try my hand at something new.
Expensify’s goal simply stated is “expense reports that don’t suck!”. I only had to fill (thankfully!) an expense report two or three times in my life and boy, did I hate it. Having to keep an eye on not losing receipts didn’t help either. Well, Expensify is here to change that and I joined them to help achieve it!
No, really, why exactly did I join Expensify?
When I set to look at new opportunities I sat down and thought really hard about what I wanted to do next. Leaving INgrooves meant losing my H-1B and thus we would have to leave San Francisco. This thing in particular wasn’t entirely a deal breaker but I surely was sad that I didn’t get to enjoy San Francisco very much in the one year we spent here. Moving to a different country is a big deal, and doing it for only a year is kind of wasteful. On the other hand, if I went back to Mexico I could take a stab at starting my own thing, try to create a product to improve people lives and they would love and care.
In the end we (me and my wife) decided to try to stay in San Francisco. So the search began…
I was very stringent in the type of company I wanted to work for, it would have to have certain characteristics, like:
It would have to be a technology company. I want to be part of a company that looks at their technology and people as their main asset, not as a “tool”. It’s very different to work at say, Google, than work at Walmart’s IT department. I’ve received emails from recruiters from Amazon.
It would have to be a small company. I do not like bureaucracy, I don’t want to have to argue with a committee to do the right thing, nor to fill form TRX1-12 and get it signed before I could try to put a fire out in production.
Their people would need to be very smart and still have the will to live. Have you ever met that guy that is very smart but working his entire life at a bank sucked their soul out? I have. I need very smart people to work with, not because I am very smart, but because I hope to be some day, hopefully before I die.
I would have to believe in it, in what it does and what it stands for, this one is particularly hard, in the bay area there are thousands of companies but sadly, most of them are doing crap and pretend they are great so they have a chance to win the startup-lottery.
So I started browsing Github’s job board (I actually considered applying to Github as well) and Expensify caught my eye for two reasons, for one, finance is one of my hobbies as well, I like reading about business, personal finances, investing, etc, etc. I think the lack of finance education hurts people more than not having a degree or any other kind of education., and the fact that to apply they do not require you to send a resume (I HATE resumes), they simply want you to tell them about you, what have you done, when did you start to code? Why?
That is amazing. The one reason I hate resumes so much is because I feel I cannot convey the passion I felt my entire life about what I do in two pages and I figured that if they asked you this is because they felt the same way as I do. So, confirmation bias kicked in and I applied. I have a harder time trying to figure out what to put in two pages of a resume than to write passionately about what I do, I sent 8 pages.
So, the process began and after a coding challenge, and a few calls I was scheduled for a on-site interview which went pretty well but gave me a good insight in what they are planning to do and their philosophy.
I spent a long time talking with Expensify’s CEO, David Barrett about how they got to where they are, the kind of technology they use and have and I got very excited, I realized there were genuinely smart people working there and were doing amazing things and I wanted to be a part of it since Expensify did pass my “criteria” for a good place to work.
Did you know that in order to avoid having to maintain different codebases for the different mobile apps they created a mobile framework and API that abstracts each platform native API and controls so they can use a shared Javascript code base ? PhoneGap like. Yes, there are companies that do this for a living and here they did simply because it was more convenient. Yep. There is some internal effort to open source this and other in-house technology, that excites me and makes me very happy as well.
To infinity, and beyond!
Interesting problems, smart people and the right attitude lead to an amazing time, I’ve been in Expensify for three weeks (I started February 8th, 2013) and I already open sourced some code that I did to analyze some data for a problem that we had with some data structure.
Amazing times.
PS: I am baffled that apparently I am the only emacs user among IDE/vim people.
It is very common when dealing with data modeling to find many-to-many relationships, that is, given two entities or tables (in SQL parlor) having many related rows in both sides, that is, each table can have many related rows in the other table and viceversa.
A common example of this is articles and tags, one article can have many tags and one single tag can belong to many articles. Another example is books and readers, one reader can read many books, and one book can be read by many readers. And yet another more contrived example is the concept of “followers” in many social networks, when one user can follow many users, and one user can be followed by many users, even though there is a many to many relationship, it relates the table to itself.
Classic schema
So, how do we model an schema for such kind of relationship? I will first describe the regular textbook way of doing it, along with its advantages and disadvantages.
To help me in this exercise I will use articles/tags schema as an example, given that problem the model will probably end up looking like this:
We have our tables, articles and tags, and we use a third table to help us with the relationship, this is usually called join table, Wikipedia calls them junction tables.
The join table has foreign keys to both of our main tables, that is, a row creates a single relationship between our tables, many rows create many-to-many relationships.
Is easy to go from one side to another, that is, to see what tags any given article has, or to find what articles have a certain tag. For example, the next query would give all the articles and their tags:
select articles.title, tags.name from articles
inner join articles_tags on articles_tags.article_id = articles.id
inner join tags on tags.id = articles_tags.tag_id;
title | name
---------------------------+-------------
Buzzword about buzzwords! | open source
Buzzword about buzzwords! | ruby
Buzzword about buzzwords! | programming
...
Advantages
This model is the only way to go when you need more data about the relationship, for example, when it was created, by whom, or maybe the relationship status, like “Approved”, “On Hold”, “Canceled”, etc., it depends on your data and your business needs.
Yet another advantage is that changing the relationships is easy, if you need to add or remove some it is very straightforward and you just modify the single rows you need, everything else is untouched.
But…
This model is very redundant if you only care about the relationships themselves and you have natural low cardinality in your data, that is, the same combinations repeat over and over, like in my given example, if many articles have the same tags. That is very common.
Notice how the same tags (1 and 2) are repeated over and over and over, if you have 1,000,000 articles with the same two tags, you would have 2,000,000 rows for them in the join table, now, although this may seem very unlikely to happen in a blog application, the same principles hold for many other data sets that may be in the order of millions of rows.
This is not only wasted space, space is cheap, but when you are talking in the context of a database, space becomes important because the smaller your tables, the smaller your indexes, the less I/O you have to do to access your tables, your database manager will cache more of it, etc. All in all, the less rows you have, the faster you can do operations with them.
If you work with an application that is in the hundreds of thousands or millions of rows I strongly recommend you to check your join tables.
Here’s a query for postgresql to find out how many repetitions and redundancy you have in your join tables, although it’s tailored for the example it should be easy to change it for your own database and schema:
-- Sample database has 1000 random articles with 5 tags in random combinations.
select
tags, COUNT(article_id) repetitions
from (select
article_id,
array_agg(tag_id order by tag_id) tags
from articles_tags
group by article_id) as a
group by tags
order by COUNT(article_id) desc;
tags | repetitions
-------------+-------------
{3} | 51
{5} | 44
{4} | 40
{2} | 38
{3,4,5} | 34
{3,4} | 33
{1} | 32
{1,3,5} | 31
{1,3,4} | 29
{3,5} | 28
...
Scared?
What is the problem exactly?
Fun fact, there is really no other way to do many-to-many relationships where you have the full advantages of a relational database, like referential integrity, index usage, etc.
So how do we go about reducing the redundancy?
The redundancy results from the fact that we have the reference to our main table in the join table, so when we need to create a combination it doesn’t matter if we have seen that particular combination before, we have to create it anyway because there is no way we can reuse it because can’t stuff it in our previous rows.
How about we pull out the reference to the first table from the join table and we leave the reference to the second table?
This may sound a bit confusing so let me show you what I mean:
In computer science, a set is an abstract data structure that can store certain values, without any particular order, and no repeated values. It is a computer implementation of the mathematical concept of a finite set.
A data structure for values with no repetitions, that sounds like something we could use for our little “redundancy” problem…
What is going on here?
Basically, instead of introducing just a join table we also introduce a another table, a Sets table.
Instead of relating articles and tags, we relate sets and tags.
Image this work flow:
User one:
User creates an article, selects two given tags for this article.
We create the tags if they do not exist.
We ask ourselves this question, do we have this combination of tags already in articles_tags_sets_tags table?
No we do not, so we create an articles_tags_sets row.
Using the ID of this new set item we create the articles_tags_sets_tags rows like:
Now we have a suitable set item for our article, so we create the article and relate it to this using the column in articles articles_tags_set_id.
User two:
User creates an article, selects two given tags for this article.
Tags already exist so we do not have to create them.
We ask ourselves this question, do we have this combination of tags already in articles_tags_sets_tags table?
Yes we do, so we only grab the id of articles_tags_sets that represent this combination.
Using this suitable set item id we create the articles row.
See what we did here? We created two articles but we reused the set item and every time a new article comes that has a combination of tags that we have seen before, we can reuse it, not copy it over and over.
Since we took the article_id from the join table and introduced the set_id instead we can create a single copy and reuse it in as many articles as we want, if we have two tags and we have a 1,000,000 articles using these two tags, we would still have two single rows in our join table. Pretty impressive isn’t it?!
This is called a Set because the combinations are unique, a set item is a distinct combination of tags.
Queries
How do we find what tags belong to an article?
select articles.title, tags.name from articles
inner join articles_tags_sets_tags on articles_tags_sets_tags.articles_tags_set_id = articles.articles_tags_set_id
inner join tags on tags.id = articles_tags_sets_tags.tag_id;
title | name
---------------------------+-------------
Buzzword about buzzwords! | programming
Buzzword about buzzwords! | startups
Buzzword about buzzwords! | ruby
...
Looks familiar? It is, is almost the same query, but instead of having the article_id in the join table, the article table has the articles_tags_set_id. The behavior of this query is exactly the same as the first one, so finding your relationships has no cost whatsoever and since this table will grow to a fraction of a regular join table when you go to the order of hundreds of thousands of rows, this join will be faster because the database has a lot less rows to deal with.
How do we find if an existing set item already exists?
select articles_tags_sets.id set_item_id
from articles_tags_sets_tags
join articles_tags_sets on articles_tags_sets.id = articles_tags_sets_tags.articles_tags_set_id
where
articles_tags_sets_tags.tag_id IN (3,1)
and (select count(*) from articles_tags_sets_tags c
where c.articles_tags_set_id = articles_tags_sets_tags.articles_tags_set_id) = 2
group by articles_tags_sets.id
having count(*) = 2
set_item_id
------------
20
Not the most simple query, but not exactly complicated either, we simply find first the set items that match the count of our rows (two in this case) and then find the one that matches exactly our combination.
Caveats
This technique though is not exactly free. There are several disadvantages to it.
First and foremost, it has an extra cost when saving and updating. Every time we have to save a row in the parent table and related rows, we have to check first if this particular combination of children exists, if they do not, we have to create them, as opposed as just dumping rows in the join table.
When updating, if you add a new row to the combination we are now dealing with a new set item that has to be checked for, if it does not exist we have to create it entirely as opposed to inserting a single row to the join table.
This cost gets amortized with time and usage, when the system is fairly new and no set items yet exist, set items will get created all the time, after a while reuse will be more common since many different combinations will now exist.
Which bring us to another problem, your data has to have a natural tendency for redundancy, like in our articles and tags. Otherwise you will be creating new set items all the time as opposed to be reusing them, you have to analyze your data (you can use the query I gave above) and see if you have a lot of repetitions and redundancy and determine if this technique is for you. You can probably apply it to some tables as opposed to your entire database.
Yet another problem is that this only applies to when your relationship is the only data you need, if you need more information for each relationship you cannot use this technique because relationships are “shared” among many parent rows. You can try to move that data into your second table but that will reduce redundancy and compression.
Real world usage
As an example of real world usage of this, let me share with you some data of one of the databases at my job.
We have songs, each song has clearance to some countries (3-5 average) and each song has different ways to be sold (streaming, download, radio, etc).
Each song/country combination has many ways to be sold in that given country, so we have to have a many-to-many relationship here.
Numbers
Table
# of rows
Song
2.5 million
LocalizedSong (song/country)
12 million
DistributionType (ways to sell)
34
We have in average 10-15 different distribution type relationships for each song/country.
Using a regular join table we would have 125 million rows in average, we had that problem and it made it very difficult to us to scale our queries. (This is not the only many-to-many relationship we have, we have a complex data model).
Once we switched to this technique for this relationship because in this particular situation we have low cardinality (many albums from the same label have the same clearances and distribution types, or the songs in a given album have the same distribution types, etc), we have the outstanding number of 22k rows in our relationship table. See that number? 22k.
When you operate at that scale, this is very, very helpful.
Conclusions
For you Ruby people, I took this concept explained here and began working on wrapping it nicely in a gem that extends ActiveRecord to use it. The gem is called has-many-with-set and you can find the source code (and instructions for installation/usage) in github.
Do you like it? Have any doubts? Let me know, we can talk about it. This was designed at my job, INgrooves Fontana as part of our scaling efforts and now we share the concept with you, I hope you like it and at least learn a little bit reading it. (It’s worth noting that we do not use Rails nor Ruby at my work).
If you find a use for this and saves your company of buying more servers and you become your office’s hero let me know, that will make me very happy.
The purpose of combinatorial optimization is, in plain english, to find an optimal value in a finite set of said values. Usually it means finding the minimum or the maximum of a given function or problem space. Now, the fun part is that there are many problems where brute-force searching is not feasible due to the vast size of the set possible values. Some classical computer science problems like this are the Traveling Salesman Problem and Minimum Spanning Tree.
This has a lot of practical applications, some of them are (quoting Wikipedia):
Developing the best airline network of spokes and destinations
Deciding which taxis in a fleet to route to pick up fares
Determining the optimal way to deliver packages
Determining the right attributes of concept elements prior to concept testing
In this post I write about one of the many possible algorithms to tackle this kind of problem, it is called Simulated Annealing.
Annealing
Annealing in metallurgy is a process in which certain metals are altered to improve certain properties like hardness and ductility.
The process consist of heating the material to very high temperatures (depending on the material) and letting it cool down gradually allowing its atoms to progress into its equilibrium state. This heating process alters the internal structure of the material resulting in a more uniform composition of the atoms, hence improving hardness and ductility.
Simulated Annealing
This algorithm is a meta-heuristic, what that means is that it doesn’t guarantee a solution since it searches for a good approximation in a small amount of time by iteratively improving it.
These problems are represented as a solution space, a starting state (initial solution) and a function that calculates the “goodness” or “energy” of a solution.
The analogy with annealing is that the algorithm starts with a given temperature and a random solution and iteratively calculates a new random solution. A solution is selected depending on the result of a probability function that takes into account if the solution is better or worse than the current accepted solution and the temperature. As the temperature decreases, the probability of accepting worse solutions also decreases. Also, an “stabilization” period happens between changes in temperature to allow the system to stabilize.
Simulated Annealing Javascript:
function _probabilityFunction (temperature, delta) {
if (delta < 0) {
return true;
}
var C = Math.exp(-delta / temperature);
var R = Math.random();
if (R < C) {
return true;
}
return false;
}
function _doSimulationStep () {
if (currentSystemTemperature > freezingTemperature) {
for (var i = 0; i < currentStabilizer; i++) {
var newEnergy = generateNeighbor(),
energyDelta = newEnergy - currentSystemEnergy;
if (_probabilityFunction(currentSystemTemperature, energyDelta)) {
acceptNeighbor();
currentSystemEnergy = newEnergy;
}
}
currentSystemTemperature = currentSystemTemperature - coolingFactor;
currentStabilizer = currentStabilizer * stabilizingFactor;
return false;
}
currentSystemTemperature = freezingTemperature;
return true;
}
The code is pretty straightforward: While the temperature is above freezing, generate a neighbor (new solution based on our current one) and calculate the delta of the goodness of that solution against the goodness of our current solution, if the probability function returns true accept the solution.
I won’t go into much detail on the selected probability function but I will show a graph of the behavior just so we can be sure that it meets our criteria:
A better solution is always accepted.
A worse solution has less chance to be accepted than a “not so worse” solution.
Worse solutions (regardless of “worseness”) have less chance to be accepted as temperature decreases.
The larger the delta the worse the generated solution is compared to our current accepted solution.
Eight Queen Puzzle
The puzzle is about placing eight chess queens on an 8×8 chessboard so no two queens can attack each other, that is, they cannot share the same row, column or diagonal with each other.
There are 4,426,165,368 possible arrangements of eight queens on a 8×8 board but only 92 solutions. The problem is not trivial, although by today hardware standards even a brute force approach would probably do fine, specially if you reduce the problem space a bit like adding the restriction that queens cannot share columns or rows and you reduce the number of possibilities to 16,777,216 but for the sake of showing the algorithm let’s use the problem as it is.
To represent the puzzle I am simply going to use an array of eight elements with x and y as properties and to plug the problem into the simulated annealing implementation we need just a few functions, the first and probably most important we need a way to calculate the number of attacks between queens in any given configuration. Our goal is to optimize the result of this function:
// Simply count the queens that share row, column or diagonals.
function _calculateAttacks (board) {
var numAttacks = 0;
for (var iQueen = 0; iQueen < NUM_QUEENS - 1; iQueen++) {
for (var iAttackingQueen = iQueen + 1; iAttackingQueen < NUM_QUEENS; iAttackingQueen++) {
if (board[iQueen].x == board[iAttackingQueen].x) {
numAttacks++;
}
else if (board[iQueen].y == board[iAttackingQueen].y) {
numAttacks++;
}
else if (board[iQueen].x + board[iQueen].y ==
board[iAttackingQueen].x + board[iAttackingQueen].y) {
numAttacks++;
}
else if (board[iQueen].y - board[iQueen].x ==
board[iAttackingQueen].y - board[iAttackingQueen].x) {
numAttacks++;
}
}
}
return numAttacks;
}
For the operation of the algorithm we need three functions.
Generate a random initial solution.
Generate a “neighbor” solution, that is, based on our current configuration just change it slightly, so we don’t do huge jumps along the search space, we achieve this by taking a random queen and moving it a single step in a random direction.
Accept the neighbor if simulated annealing determines so.
// Generate 8 queens with random (x, y) and no repetitions. No queen can share the same space as another one.
function _generateRandomPositions () {
var done = false;
for (var iQueen = 0; iQueen < NUM_QUEENS; iQueen++) {
var repetitions = true;
currentQueensPositions[iQueen] = {};
while (repetitions) {
currentQueensPositions[iQueen].x = parseInt(Math.random() * 8);
currentQueensPositions[iQueen].y = parseInt(Math.random() * 8);
if (!_checkRepetitions(currentQueensPositions)) {
repetitions = false;
}
}
}
return _calculateAttacks(currentQueensPositions);
}
// Take our current solution and generate a neighbor solution by taking a random queen and moving it a single random step.
function _generateNeighbor () {
for (var iQueen = 0; iQueen < NUM_QUEENS; iQueen++) {
newQueensPositions[iQueen] = {
x: currentQueensPositions[iQueen].x,
y: currentQueensPositions[iQueen].y
};
}
var changingQueen = parseInt(Math.random() * NUM_QUEENS);
var repetitions = true;
while (repetitions) {
var oldX = newQueensPositions[changingQueen].x;
var oldY = newQueensPositions[changingQueen].y;
// The MOD is so if a queen moves out of the board it comes through the other side. The extra mod logic is to avoid a javascript bug
// that makes negative numbers mod into -1.
newQueensPositions[changingQueen].x = (((newQueensPositions[changingQueen].x + (parseInt(Math.random() * 3) - 1)) % 8) + 8) % 8;
newQueensPositions[changingQueen].y = (((newQueensPositions[changingQueen].y + (parseInt(Math.random() * 3) - 1)) % 8) + 8) % 8;
if (!_checkRepetitions(newQueensPositions)) {
repetitions = false;
}
else {
newQueensPositions[changingQueen].x = oldX;
newQueensPositions[changingQueen].y = oldY;
}
}
return _calculateAttacks(newQueensPositions);
}
// Accept the current generated solution.
function _acceptNeighbor () {
for (var iQueen = 0; iQueen < NUM_QUEENS; iQueen++) {
currentQueensPositions[iQueen] = { x: newQueensPositions[iQueen].x, y: newQueensPositions[iQueen].y } ;
}
}
Sample implementation
And without further ado, running code (if you are reading this on a rss reader you won’t be able to see this, please go to website, I apologize):
Play around with the parameters!
Pretty simple code, it still amazes me how simple the simulated annealing code is and that it still gets the right answers most of the time given good parameters. The parameters for this particular implementation I found simply by trial and error and experimentation, even the original paper suggest that but certainly there may be better ways to determine them for a given particular problem.
The full source code is available here feel free to play around with it and if you find a bug let me know, I wrote this code some years ago.