I was recently sitting, staring at a progress bar, which is how very nerdy adventures start.
The particular progress bar was telling me about the packages being installed as part of upgrading my workstation from Ubuntu 16.04 to the newer Ubuntu 18.04. As the package names whizzed by, one after the other, the thing that annoyed me was that it took So. Damned. Long. My day job often involves trying to understand why Linux systems don’t go as fast as I would like, so I naturally started firing up some basic utilities to see what was happening. The most obvious thing to check is always CPU usage. top showed me that my CPU cores were sitting almost entirely idle. CPU usage is a metric that I often describe as convenient to measure, relatively easy to understand, and generally useless. But it’s still a good place to start. I wasn’t really surprised that the installation process wasn’t CPU bound, so I fired up iotop, which is a much more useful utility for seeing what processes on a system are io bound, and saw… Nothing interesting. And it was then that I sort of fell into a curiosity. If you count all the many servers I have caused package installations to happen on, I have probably installed many millions of debian packages over the years. Some with salt, others with apt-get, and some with dpkg, but I never really studied in detail exactly how the ecosystem worked.
I started by trying to figure out exactly what a debian package is. It seems like a silly question, with a simple answer. Of course, “a debian package is just a common standard ar archive,” as a friend of mine pointed out while I was talking to him. But that sort of understates things. First off, ar archives aren’t that common, or particularly standardised. Ar archives are ‘common’ only as the format for static libraries, and debian packages. They just aren’t common as general purpose archives, like tarballs or zip files. Which is sort of interesting in it’s own right.
Let’s consider just how standard the format actually is… Wikipedia has a good breakdown of the format. Is the diagram on Wikipedia all we’d need to know to read a debian package? Well, man 5 ar notes “There have been at least four ar formats” and “No archive format is currently specified by any standard. AT&T System V UNIX has historically distributed archives in a different format from all of the above.” Eep, that’s not terribly promising. Thankfully, debian packages are at least consistent among themselves in their Ar dialect, since they can generally be assumed to be made with the ar on a debian Linux distribution.
There’s a whole side-story here about how there is a C system header for reading ar archives in an old-school “read a struct” way. But the format use a slightly odd whitespace padded text pattern, so to get trimmed filenames as C++ std::strings and integer number values is more of a pain in the neck than you’d hope. There isn’t a good c++ library with a modern API for the format. So I wrote a YAML definition for Katai in order to have a convenient C++ API for reading it, and used the SPI Pystring library for some of the string manipulation. In any event, I could read the format. Yay, I could read a debian package myself!
A debian package consists of just three things when you unpack it. I file called ‘debian-binary’ that tells you the version number of the format. And, two tarballs. One with control metadata about the package and the other with the actual contents of the package.
At this point, anybody trying to write their own code to unpack a debian package in order to better understand the process will try and punch a wall. Because we’ve just figured out how to write code to read this relatively uncommon Ar format, and the first thing we find inside of it is two tarballs, which is a completely different format! Surely, we could have designed the package files to either be an Ar with Ar archives in it, or a tar file with tar files in it! Well, okay, my friend’s assertion that I just needed to know about Ar archives was a lie, but I only need to know about two formats. That’s not too bad. Oh, well, tarballs are actually two formats unto themselves. There’s a compression format, and then the actual tar archive. So, you need to handle three file formats to install a debian package. I have some code that will unpack the Ar layer, so let’s see which compression method is used on the tar files…
If you unpack the apturl package, you get the debian-binary file, and the data and control archives. It’s totally arbitrary that I used apturl-common as a test file for my code. It just happened to be a package that I downloaded. Other packages will vary slightly.
Wait, those two tar files have different compression formats. One is a .gz file, and the other is a .xz! Not just different compression formats from debian files of different eras. For example, if Ubuntu 12.04 packages used gz and Ubuntu 18.04 used xz, you would only need to support one or another to install packages from any particular distribution. As it turns out, there are different compression formats inside a single package. Okay, so to unpack and install a debian file, you actually need to support a few compression formats. Let’s say xz, bz2, and gz at a minimum. Okay, so you need to support 5 different formats. So, what’s in that control archive?
You get a few scripts. preinst, postinst, and prerm. Those scripts get run when you would expect. Before install, after install, and before removing the package if you uninstall it. Languages like Python can be embedded in native applications, but shell scripts aren’t really intended to be used that way. (And actually, if I were embedding Python today, I’d probably use PyBind11 instead of Boost.Python like I did in my old blog post. But that’s neither here nor there.) So, you can pass on being responsible for running the scripts in-process if you are trying to implement something to install the packages, and just shell out to do it. (Writing a shell is definitely at least a whole other blog post unto itself.) You also have files called md5sums, control, and conffiles. Conffiles is just a newline separated list of files that the package uses for configuration so the install program can warn you about merging local changes during install. It’s barely a file format, so we’ll count it as half. md5sums is a listing of checksums of all the files in the content archive called “data,” in the format of md5sums.
This is also a pretty simple format, but you need to split the space after the hash, while correctly handling the possibility of things like spaces in filenames. (And I’m not entirely sure what you do if you have a newline in a filename, which is possible, in these simple formats.) So we are up to Six and a half file formats.
Maintainer: Michael Vogt <firstname.lastname@example.org>
Depends: python3:any (>= 3.3.2-2~), python3-apt, python3-update-manager
Replaces: apturl (<< 0.3.6ubuntu2)
Description: install packages using the apt protocol - common data
AptUrl is a simple graphical application that takes an URL (which follows the
apt-protocol) as a command line option, parses it and carries out the
operations that the URL describes (that is, it asks the user if he wants the
indicated packages to be installed and if the answer is positive does so for
This package contains the common data shared between the frontends.
The “control” file is yet another text file, but the format is different from conffiles or md5sums. We are now up to seven and a half file formats. Which is surely a far cry for the original “you just need to know the Ar format!” that I got as received wisdom when I first fell into this rabbit hole.
On the bright side, this does give us enough information to unpack and install the data in the package. (And I’d like to complain how vague a name “data” is for the archive with the actual contents. As if the rest of the package was somehow something other than data!) But we still haven’t covered any of the local database that keeps track of what packages are available, what are installed, how dependency resolution works, etc. But some of that will have to wait for another blog post. This is certainly enough content that the original progress bar that isnpired me did finish what it was doing long before I made it this far with my own code.
Learning how to unpack packages wound up just being the first steps of a project to try and do my own simple implementations of a whole raft of common UNIX command line utilities that I depend on every day. Trying to implement a useful subset of a complete userland is what inspired the blog post’s title, “Adventures in Userland.” The UNIX userland is full of fascinating history, layers of cruft, clever design, and features you never even realised were there. Even implementing my own cat turned out to be an interesting project, despite how simple that utility seems. I am hoping to make time to document some of the things I learned while poking around the things I have long taken for granted, and how shaky and wobbly some of the underpinnings of modern state of the art cloud and container systems are.
convenient modern C++ API’s for things like machine learning and image processing are easy to find, but not so much for things like .debs, and .tars. The utilities in GNU coreutils sometimes have surprising limitations, and some files haven’t had any commits since Star Trek: The Next Generation was in first run. I think it’s fair to say some of that stuff is about due for a fresh look.