Francisco Soto Ramblings about code, life and stuff.

Fun with Javascript and function tracing
May 18, 2013

Big Bad Bug

The problem

It was a normal day at my job, my team mates were putting the finishing touches and bug fixes into our latest features, due to be released during Finovate Spring. I wasn’t part of that project but found myself free of my own tasks and decided to jump in to help. And so a bug report came to me.

The bug was simple, when adding some expenses to a report they weren’t being displayed until you refreshed (this was our webapp) which is a bad thing, you want to see stuff as you do it. So I embarked into a journey of code exploring, setting breakpoints, running through the debugger. It just took a couple of minutes to realize that the new expenses were being displayed just to be immediately hidden again by some other part of the code.

What to do?

It didn’t take much guessing to know that the culprit was an incorrect call to jQuery.hide(). So I needed to find out if the call was being made on the element directly or on one of its parents. So I set up to monitor all the calls to jQuery.hide() after adding an expense and find out what was being hidden and where in the code the call originated from. So what to do?

Normally is just a couple of choices:

  • rgrep all the occurrences of the id or class of the element being hidden as to see what parts of the code were making a reference to it, this of course would only work if one of the parents of said element weren’t being hidden as opposed to just said element.
  • Place a conditional breakpoint on jQuery’s hide function. This is a moderately complex web application, we have dialogs, popover tips, etc., so the calls to jQuery.hide() are many.

None of these seemed very appealing or particularly fast. Specially the debugging one which certainly would take me to the right answer albeit after taking the time to filter out all the calls that weren’t the one I was looking for.

I reminisced of Common Lisp’s trace/untrace macros. That would be the ideal, having some code run whenever jQuery.hide() was being called and display the parameters/context so I could see what it was being called with, that would allow me to narrow my search significantly. Then I realized this is easily doable in Javascript (comments added for clarity):

$.fn.hide = (function() {             // Executed on the spot to create a closure.
    var oldfn = $.fn.hide;            // Save jQuery's hide().
    return function () {              // Create the closure and return the function to replace it.
        console.log(this);            // Simply print the context.
        oldfn.apply(this, arguments); // Apply jQuery's hide() so everything works normally.
    };
}());                                 // Execute it and rewrite jQuery's.

I ran that in Chrome’s console, did the steps to reproduce the bug and voilĂ . After that I knew which HTML elements were being hidden and I easily found the one that was hiding the element I wanted. After that I refreshed the page, and ran the same code as before, just with a conditional to check for the element id and printing the stack trace when that element was hidden and I found the part of the code that was doing it and after that it was an easy fix.

Tracing.js

After wrapping up the bug fix, I realized that the little trick to hook into another function could be easily abstracted into a tiny library and so I took it as a weekend project to do it. And so Tracing.js was born.

The premise is simple, it allows you to hook functions on your own to other functions. There are two types of hooks, before and after, as their name imply, before functions will be called before the actual function call, and after will be called after the fact. It is been pointed out to me that this pattern can also be classified as Aspect Oriented Programming but that wasn’t what I had in mind when I wrote this.

As a plus and because I always found this very useful in Common Lisp, I added a trace() functionality that works exactly the same as the source of inspiration.

// Assuming we have previously defined these functions:
Tracing.trace('square', 'multiply', 'add');

square(2);
>  square called with arguments: (2)
>    multiply called with arguments: (2, 2)
>      add called with arguments: (0, 2)
>        add called with arguments: (1, 1)
>          add called with arguments: (2, 0)
>          add returned: 2
>        add returned: 2
>      add returned: 2
>      multiply called with arguments: (2, 1, 2)
>        add called with arguments: (2, 2)
>          add called with arguments: (3, 1)
>            add called with arguments: (4, 0)
>            add returned: 4
>          add returned: 4
>        add returned: 4
>        multiply called with arguments: (2, 0, 4)
>        multiply returned: 4
>      multiply returned: 4
>    multiply returned: 4
>  square returned: 4

You can find this code on Github or read more about it and other of my projects here.


Link

I joined Expensify
Mar 02, 2013

Music Business

INgrooves

I worked in the digital media distribution industry for almost four years at INgrooves, during my time there I was part of very interesting and ambitious projects such as helping get the platform ready for mobile products distribution (wallpapers, ring tones, ring back tones, etc.), upgrade our platform for ebook distribution (INscribe Digital), and the two major projects that INgrooves had at the time, Hercules and Zeus.

These two projects were part of a major initiative to replace several systems that were part of the digital distribution infrastructure of UMG’s (the largest music label in the world) with improved versions of our own, handled by us. The systems that we replaced from UMG are the ones that actually sent their content out to retailers, something that INgrooves excels at.

Zeus is a content metadata database, validation and ordering system, that is, it has the database with all the information and clearances for content, upon ordering, if the content is cleared to go, it generates the metadata files that go out to retailers and places an order to Hercules. Hercules takes this order and creates whatever binaries are needed to fulfill it, binaries include cover art and the actual encoded music files (mp3, aac, wma, etc.), after the binaries generation step is over, it takes the binaries and the generated metadata and creates a final package that goes off to the retailers.

Easy said than done though, these systems present several challenges along the way, for once scaling of our databases and code to support all functionality needed, business logic to process the very complex logic of marking a product cleared for distribution or not, the problem of massive storage for all these digital masters (no, we cannot use S3, labels are very protective of their content), having the infrastructure and capabilities of creating all the different audio encodings for all the content that goes out the door every day, etc. Fun stuff.

Alas, everything that has a beginning has an end.

After a few months contemplating the idea I left in December 2012. Although we faced interesting challenges every day and more were in the way, I was definitely craving to work in different technologies and a different problem space, I love music and applying my technical skills in that industry was a blessing but I decidedly needed to try my hand at something new.

Expensify

Expensify

And so, I joined the ranks of Expensify.

Expensify’s goal simply stated is “expense reports that don’t suck!”. I only had to fill (thankfully!) an expense report two or three times in my life and boy, did I hate it. Having to keep an eye on not losing receipts didn’t help either. Well, Expensify is here to change that and I joined them to help achieve it!

No, really, why exactly did I join Expensify?

When I set to look at new opportunities I sat down and thought really hard about what I wanted to do next. Leaving INgrooves meant losing my H-1B and thus we would have to leave San Francisco. This thing in particular wasn’t entirely a deal breaker but I surely was sad that I didn’t get to enjoy San Francisco very much in the one year we spent here. Moving to a different country is a big deal, and doing it for only a year is kind of wasteful. On the other hand, if I went back to Mexico I could take a stab at starting my own thing, try to create a product to improve people lives and they would love and care.

In the end we (me and my wife) decided to try to stay in San Francisco. So the search began…

I was very stringent in the type of company I wanted to work for, it would have to have certain characteristics, like:

  • It would have to be a technology company. I want to be part of a company that looks at their technology and people as their main asset, not as a “tool”. It’s very different to work at say, Google, than work at Walmart’s IT department. I’ve received emails from recruiters from Amazon.
  • It would have to be a small company. I do not like bureaucracy, I don’t want to have to argue with a committee to do the right thing, nor to fill form TRX1-12 and get it signed before I could try to put a fire out in production.
  • Their people would need to be very smart and still have the will to live. Have you ever met that guy that is very smart but working his entire life at a bank sucked their soul out? I have. I need very smart people to work with, not because I am very smart, but because I hope to be some day, hopefully before I die.
  • I would have to believe in it, in what it does and what it stands for, this one is particularly hard, in the bay area there are thousands of companies but sadly, most of them are doing crap and pretend they are great so they have a chance to win the startup-lottery.

So I started browsing Github’s job board (I actually considered applying to Github as well) and Expensify caught my eye for two reasons, for one, finance is one of my hobbies as well, I like reading about business, personal finances, investing, etc, etc. I think the lack of finance education hurts people more than not having a degree or any other kind of education., and the fact that to apply they do not require you to send a resume (I HATE resumes), they simply want you to tell them about you, what have you done, when did you start to code? Why?

That is amazing. The one reason I hate resumes so much is because I feel I cannot convey the passion I felt my entire life about what I do in two pages and I figured that if they asked you this is because they felt the same way as I do. So, confirmation bias kicked in and I applied. I have a harder time trying to figure out what to put in two pages of a resume than to write passionately about what I do, I sent 8 pages.

So, the process began and after a coding challenge, and a few calls I was scheduled for a on-site interview which went pretty well but gave me a good insight in what they are planning to do and their philosophy.

I spent a long time talking with Expensify’s CEO, David Barrett about how they got to where they are, the kind of technology they use and have and I got very excited, I realized there were genuinely smart people working there and were doing amazing things and I wanted to be a part of it since Expensify did pass my “criteria” for a good place to work.

Did you know that in order to avoid having to maintain different codebases for the different mobile apps they created a mobile framework and API that abstracts each platform native API and controls so they can use a shared Javascript code base ? PhoneGap like. Yes, there are companies that do this for a living and here they did simply because it was more convenient. Yep. There is some internal effort to open source this and other in-house technology, that excites me and makes me very happy as well.

To infinity, and beyond!

Interesting problems, smart people and the right attitude lead to an amazing time, I’ve been in Expensify for three weeks (I started February 8th, 2013) and I already open sourced some code that I did to analyze some data for a problem that we had with some data structure.

Amazing times.

PS: I am baffled that apparently I am the only emacs user among IDE/vim people.


Link

Using sets for many-to-many relationships
Nov 11, 2012

What’s a many-to-many?

It is very common when dealing with data modeling to find many-to-many relationships, that is, given two entities or tables (in SQL parlor) having many related rows in both sides, that is, each table can have many related rows in the other table and viceversa.

A common example of this is articles and tags, one article can have many tags and one single tag can belong to many articles. Another example is books and readers, one reader can read many books, and one book can be read by many readers. And yet another more contrived example is the concept of “followers” in many social networks, when one user can follow many users, and one user can be followed by many users, even though there is a many to many relationship, it relates the table to itself.

Classic schema

So, how do we model an schema for such kind of relationship? I will first describe the regular textbook way of doing it, along with its advantages and disadvantages.

To help me in this exercise I will use articles/tags schema as an example, given that problem the model will probably end up looking like this:

We have our tables, articles and tags, and we use a third table to help us with the relationship, this is usually called join table, Wikipedia calls them junction tables.

The join table has foreign keys to both of our main tables, that is, a row creates a single relationship between our tables, many rows create many-to-many relationships.

Is easy to go from one side to another, that is, to see what tags any given article has, or to find what articles have a certain tag. For example, the next query would give all the articles and their tags:

select articles.title, tags.name from articles
    inner join articles_tags on articles_tags.article_id = articles.id
    inner join tags on tags.id = articles_tags.tag_id;

           title           |    name
---------------------------+-------------
 Buzzword about buzzwords! | open source
 Buzzword about buzzwords! | ruby
 Buzzword about buzzwords! | programming
                          ...

Advantages

This model is the only way to go when you need more data about the relationship, for example, when it was created, by whom, or maybe the relationship status, like “Approved”, “On Hold”, “Canceled”, etc., it depends on your data and your business needs.

Yet another advantage is that changing the relationships is easy, if you need to add or remove some it is very straightforward and you just modify the single rows you need, everything else is untouched.

But…

This model is very redundant if you only care about the relationships themselves and you have natural low cardinality in your data, that is, the same combinations repeat over and over, like in my given example, if many articles have the same tags. That is very common.

It would look something like:

  article_id | tag_id
-------------+-------------
      1      |    1
      1      |    2
      2      |    1
      2      |    2
      3      |    1
      3      |    2
            ...
      N      |    1
      N      |    2

Notice how the same tags (1 and 2) are repeated over and over and over, if you have 1,000,000 articles with the same two tags, you would have 2,000,000 rows for them in the join table, now, although this may seem very unlikely to happen in a blog application, the same principles hold for many other data sets that may be in the order of millions of rows.

This is not only wasted space, space is cheap, but when you are talking in the context of a database, space becomes important because the smaller your tables, the smaller your indexes, the less I/O you have to do to access your tables, your database manager will cache more of it, etc. All in all, the less rows you have, the faster you can do operations with them.

If you work with an application that is in the hundreds of thousands or millions of rows I strongly recommend you to check your join tables.

Here’s a query for postgresql to find out how many repetitions and redundancy you have in your join tables, although it’s tailored for the example it should be easy to change it for your own database and schema:

--  Sample database has 1000 random articles with 5 tags in random combinations.
select
    tags, COUNT(article_id) repetitions
from (select
          article_id,
          array_agg(tag_id order by tag_id) tags
      from articles_tags
      group by article_id) as a
group by tags
order by COUNT(article_id) desc;

     tags    | repetitions
-------------+-------------
 {3}         |          51
 {5}         |          44
 {4}         |          40
 {2}         |          38
 {3,4,5}     |          34
 {3,4}       |          33
 {1}         |          32
 {1,3,5}     |          31
 {1,3,4}     |          29
 {3,5}       |          28
            ...

Scared?

What is the problem exactly?

Fun fact, there is really no other way to do many-to-many relationships where you have the full advantages of a relational database, like referential integrity, index usage, etc.

So how do we go about reducing the redundancy?

The redundancy results from the fact that we have the reference to our main table in the join table, so when we need to create a combination it doesn’t matter if we have seen that particular combination before, we have to create it anyway because there is no way we can reuse it because can’t stuff it in our previous rows.

How about we pull out the reference to the first table from the join table and we leave the reference to the second table?

This may sound a bit confusing so let me show you what I mean:

Set schema

What is a set?

I am going to quote Wikipedia entry for sets:

In computer science, a set is an abstract data structure that can store certain values, without any particular order, and no repeated values. It is a computer implementation of the mathematical concept of a finite set.

A data structure for values with no repetitions, that sounds like something we could use for our little “redundancy” problem…

What is going on here?

Basically, instead of introducing just a join table we also introduce a another table, a Sets table.

Instead of relating articles and tags, we relate sets and tags.

Image this work flow:

User one:

  • User creates an article, selects two given tags for this article.
  • We create the tags if they do not exist.
  • We ask ourselves this question, do we have this combination of tags already in articles_tags_sets_tags table?
  • No we do not, so we create an articles_tags_sets row.
  • Using the ID of this new set item we create the articles_tags_sets_tags rows like:
articles_tags_set_id | tag_id
---------------------+--------
                   1 |      1
                   1 |      2
  • Now we have a suitable set item for our article, so we create the article and relate it to this using the column in articles articles_tags_set_id.

User two:

  • User creates an article, selects two given tags for this article.
  • Tags already exist so we do not have to create them.
  • We ask ourselves this question, do we have this combination of tags already in articles_tags_sets_tags table?
  • Yes we do, so we only grab the id of articles_tags_sets that represent this combination.
  • Using this suitable set item id we create the articles row.

See what we did here? We created two articles but we reused the set item and every time a new article comes that has a combination of tags that we have seen before, we can reuse it, not copy it over and over.

Since we took the article_id from the join table and introduced the set_id instead we can create a single copy and reuse it in as many articles as we want, if we have two tags and we have a 1,000,000 articles using these two tags, we would still have two single rows in our join table. Pretty impressive isn’t it?!

This is called a Set because the combinations are unique, a set item is a distinct combination of tags.

Queries

How do we find what tags belong to an article?

select articles.title, tags.name from articles
    inner join articles_tags_sets_tags on articles_tags_sets_tags.articles_tags_set_id = articles.articles_tags_set_id
    inner join tags on tags.id = articles_tags_sets_tags.tag_id;

           title           |    name
---------------------------+-------------
 Buzzword about buzzwords! | programming
 Buzzword about buzzwords! | startups
 Buzzword about buzzwords! | ruby
                          ...

Looks familiar? It is, is almost the same query, but instead of having the article_id in the join table, the article table has the articles_tags_set_id. The behavior of this query is exactly the same as the first one, so finding your relationships has no cost whatsoever and since this table will grow to a fraction of a regular join table when you go to the order of hundreds of thousands of rows, this join will be faster because the database has a lot less rows to deal with.

How do we find if an existing set item already exists?

select articles_tags_sets.id set_item_id
from articles_tags_sets_tags
    join articles_tags_sets on articles_tags_sets.id = articles_tags_sets_tags.articles_tags_set_id
where
    articles_tags_sets_tags.tag_id IN (3,1)
and (select count(*) from articles_tags_sets_tags c
     where c.articles_tags_set_id = articles_tags_sets_tags.articles_tags_set_id) = 2
group by articles_tags_sets.id
having count(*) = 2

 set_item_id
------------
          20

Not the most simple query, but not exactly complicated either, we simply find first the set items that match the count of our rows (two in this case) and then find the one that matches exactly our combination.

Caveats

This technique though is not exactly free. There are several disadvantages to it.

First and foremost, it has an extra cost when saving and updating. Every time we have to save a row in the parent table and related rows, we have to check first if this particular combination of children exists, if they do not, we have to create them, as opposed as just dumping rows in the join table.

When updating, if you add a new row to the combination we are now dealing with a new set item that has to be checked for, if it does not exist we have to create it entirely as opposed to inserting a single row to the join table.

This cost gets amortized with time and usage, when the system is fairly new and no set items yet exist, set items will get created all the time, after a while reuse will be more common since many different combinations will now exist.

Which bring us to another problem, your data has to have a natural tendency for redundancy, like in our articles and tags. Otherwise you will be creating new set items all the time as opposed to be reusing them, you have to analyze your data (you can use the query I gave above) and see if you have a lot of repetitions and redundancy and determine if this technique is for you. You can probably apply it to some tables as opposed to your entire database.

Yet another problem is that this only applies to when your relationship is the only data you need, if you need more information for each relationship you cannot use this technique because relationships are “shared” among many parent rows. You can try to move that data into your second table but that will reduce redundancy and compression.

Real world usage

As an example of real world usage of this, let me share with you some data of one of the databases at my job.

We have songs, each song has clearance to some countries (3-5 average) and each song has different ways to be sold (streaming, download, radio, etc).

Each song/country combination has many ways to be sold in that given country, so we have to have a many-to-many relationship here.

Numbers

Table# of rows
Song2.5 million
LocalizedSong (song/country)12 million
DistributionType (ways to sell)34

We have in average 10-15 different distribution type relationships for each song/country.

Using a regular join table we would have 125 million rows in average, we had that problem and it made it very difficult to us to scale our queries. (This is not the only many-to-many relationship we have, we have a complex data model).

Once we switched to this technique for this relationship because in this particular situation we have low cardinality (many albums from the same label have the same clearances and distribution types, or the songs in a given album have the same distribution types, etc), we have the outstanding number of 22k rows in our relationship table. See that number? 22k.

When you operate at that scale, this is very, very helpful.

Conclusions

For you Ruby people, I took this concept explained here and began working on wrapping it nicely in a gem that extends ActiveRecord to use it. The gem is called has-many-with-set and you can find the source code (and instructions for installation/usage) in github.

Do you like it? Have any doubts? Let me know, we can talk about it. This was designed at my job, INgrooves Fontana as part of our scaling efforts and now we share the concept with you, I hope you like it and at least learn a little bit reading it. (It’s worth noting that we do not use Rails nor Ruby at my work).

If you find a use for this and saves your company of buying more servers and you become your office’s hero let me know, that will make me very happy.


Link

Older entries... / Go up...