I personally consider “debugging” (figuring out how something works and how is
it broken) a very important skill as a software developer and one that I don’t
think it is taught explicitly. It is simply left for us to figure it out.
For this reason I wanted to write about it for a while. Specially with examples
of real life problems and debugging sessions as I find they are far more useful
that simple made up examples.
This is the first in what likely will be a long series of posts as software
breaks all the time…
Software breaks all the time…
I’ve been using vagrant and virtualbox to setup my development environments for
a few years now. It’s a pretty convenient way to create reproducible
environments. Since I moved to linux though I’ve been using vagrant-lxc instead
of virtualbox as cointainers are just as isolated as virtual machines and they
are significantly faster.
Recently though, this stopped working as I tried to set up a new dev env. It
simply errored out.
Yeeey… fun.
Investigation
Apparently when we try to create our container, vagrant is failing to run a
command while trying to set up private networks. We have the command that it’s
trying to run but we don’t have any error output. Let’s try vagrant suggestion
and check the debug log.
We don’t get much additional detail from that. So my next step is simply to try
to run the command by hand and see if I can catch anything.
(EDIT: It was pointed out to me that the error I was looking for is indeed
present in the debug log and I missed it.)
A clue! pipework seems to think we have two containers with the same name.
Well. We don’t. What is going on? Let’s see if we can figure out what pipework
is trying to do. Thankfully, it is a bash script so we can simply open it in our
favorite text editor and search for “more than one”, which brings us to this
piece of code:
(This code is included in vagrant-lxc 1.4.3, under scripts/pipework and starting in line 146.)
It may look complicated if you haven’t done shell scripting before, but if you
look at it closely it is relatively easy to find out what it is doing. Let’s
dissect it.
In the line:
It’s trying to find something in the path $CGROUPMNT that is named after our
container “devenv” and if it finds more than one it returns the error we
got. What is $CGROUPMNT?
We find that here:
It’s iterating the contents of /proc/mounts and searching for anything that has
cgroup as file system and devices in the path.
Let’s try that out:
So now we have the file path, let’s run that find command that the script is running.
It returns two entries which seem related and valid. So, what’s going on?
At this point I am thinking that lxc itself, the container implementation, may
be the problem. vagrant-lxc hasn’t been updated in a while.
So it seems that lxc was upgraded a few weeks ago (it’s 2019-01 at the time of
this writing) and since I haven’t used it during these weeks I didn’t run into
this before.
My next step is to downgrade my installed version and see if that fixes this:
It works! So it was the newer version of lxc that broke my setup. Let’s see what
this version has in the cgroup mount point.
Yep. This returns one entry. The newer version changed that and it broke what
pipework expected. But now we know exactly what and we can fix it:
The fix
We add a new case, if we get exactly two entries from find, check if one of them
is lxc.monitor, one of lxc 3.1.0 new additions, and if so, don’t error out.
We bring lxc up to date, apply this patch by hand to vagrant-lxc pipework copy
and try again:
Success!!
Conclusion
We fixed our broken dev environment and everything is almost right in the world again.
The other thing we need to do, of course, is to submit this patch to pipework
and submit another patch to vagrant-lxc to use this new version of pipework so
people that run it with lxc >3.1.0 have a working setup again.