Adventures in Userland I – Brown Paper Packages Tied Up in std::string, xz, ar, tar, gz, and spaghetti

I was recently sitting, staring at a progress bar, which is how very nerdy adventures start.

assorted wine bottles

This photo of a bar popped up when I did a search for “progress bar,”  It’s rather more colorful and visually interesting than the actual progress bar I was staring at that inspired me to figure out what happens when you install a debian package.  Photo by Chris F on Pexels.com

The particular progress bar was telling me about the packages being installed as part of upgrading my workstation from Ubuntu 16.04 to the newer Ubuntu 18.04.  As the package names whizzed by, one after the other, the thing that annoyed me was that it took So. Damned. Long.  My day job often involves trying to understand why Linux systems don’t go as fast as I would like, so I naturally started firing up some basic utilities to see what was happening.  The most obvious thing to check is always CPU usage.  top showed me that my CPU cores were sitting almost entirely idle.  CPU usage is a metric that I often describe as convenient to measure, relatively easy to understand, and generally useless.  But it’s still a good place to start.  I wasn’t really surprised that the installation process wasn’t CPU bound, so I fired up iotop, which is a much more useful utility for seeing what processes on a system are io bound, and saw…  Nothing interesting.  And it was then that I sort of fell into a curiosity.  If you count all the many servers I have caused package installations to happen on, I have probably installed many millions of debian packages over the years.  Some with salt, others with apt-get, and some with dpkg, but I never really studied in detail exactly how the ecosystem worked.

I started by trying to figure out exactly what a debian package is.  It seems like a silly question, with a simple answer.  Of course, “a debian package is just a common standard ar archive,” as a friend of mine pointed out while I was talking to him.  But that sort of understates things.  First off, ar archives aren’t that common, or particularly standardised.  Ar archives are ‘common’ only as the format for static libraries, and debian packages.  They just aren’t common as general purpose archives, like tarballs or zip files.  Which is sort of interesting in it’s own right.

Let’s consider just how standard the format actually is…  Wikipedia has a good breakdown of the format.  Is the diagram on Wikipedia all we’d need to know to read a debian package?  Well, man 5 ar  notes “There have been at least four ar formats” and “No archive format is currently specified by any standard.  AT&T System V UNIX has historically distributed archives in a different format from all of the above.”  Eep, that’s not terribly promising.  Thankfully, debian packages are at least consistent among themselves in their Ar dialect, since they can generally be assumed to be made with the ar on a debian Linux distribution.

There’s a whole side-story here about how there is a C system header for reading ar archives in an old-school “read a struct” way.  But the format use a slightly odd whitespace padded text pattern, so to get trimmed filenames as C++ std::strings and integer number values is more of a pain in the neck than you’d hope.  There isn’t a good c++ library with a modern API for the format.  So I wrote a YAML definition for Katai in order to have a convenient C++ API for reading it, and used the SPI Pystring library for some of the string manipulation.  In any event, I could read the format.  Yay, I could read a debian package myself!

A debian package consists of just three things when you unpack it.  I file called ‘debian-binary’ that tells you the version number of the format.  And, two tarballs.  One with control metadata about the package and the other with the actual contents of the package.

At this point, anybody trying to write their own code to unpack a debian package in order to better understand the process will try and punch a wall.  Because we’ve just figured out how to write code to read this relatively uncommon Ar format, and the first thing we find inside of it is two tarballs, which is a completely different format!  Surely, we could have designed the package files to either be an Ar with Ar archives in it, or a tar file with tar files in it!  Well, okay, my friend’s assertion that I just needed to know about Ar archives was a lie, but I only need to know about two formats.  That’s not too bad.  Oh, well, tarballs are actually two formats unto themselves.  There’s a compression format, and then the actual tar archive.  So, you need to handle three file formats to install a debian package.  I have some code that will unpack the Ar layer, so let’s see which compression method is used on the tar files…

 

Wait, Why aren't they using the same compression?!

If you unpack the apturl package, you get the debian-binary file, and the data and control archives.  It’s totally arbitrary that I used apturl-common as a test file for my code.  It just happened to be a package that I downloaded.  Other packages will vary slightly.

Wait, those two tar files have different compression formats.  One is a .gz file, and the other is a .xz!  Not just different compression formats from debian files of different eras.  For example, if Ubuntu 12.04 packages used gz and Ubuntu 18.04 used xz, you would only need to support one or another to install packages from any particular distribution.  As it turns out, there are different compression formats inside a single package.  Okay, so to unpack and install a debian file, you actually need to support a few compression formats.  Let’s say xz, bz2, and gz at a minimum.  Okay, so you need to support 5 different formats.  So, what’s in that control archive?

You get a few scripts.  preinst, postinst, and prerm.  Those scripts get run when you would expect.  Before install, after install, and before removing the package if you uninstall it.  Languages like Python can be embedded in native applications, but shell scripts aren’t really intended to be used that way.  (And actually, if I were embedding Python today, I’d probably use PyBind11 instead of Boost.Python like I did in my old blog post.  But that’s neither here nor there.)  So, you can pass on being responsible for running the scripts in-process if you are trying to implement something to install the packages, and just shell out to do it.  (Writing a shell is definitely at least a whole other blog post unto itself.)  You also have files called md5sums, control, and conffiles.  Conffiles is just a newline separated list of files that the package uses for configuration so the install program can warn you about merging local changes during install.  It’s barely a file format, so we’ll count it as half.  md5sums is a listing of checksums of all the files in the content archive called “data,” in the format of md5sums.

b25977509ca6665bd7f390db59555b92  usr/bin/apturl 
da0e92f4f035935dc8cacbba395818f2  usr/lib/python3/dist-packages/AptUrl/AptUrl.py 
2c645156bfd8c963600cd7aed5d0fc0b  usr/lib/python3/dist-packages/AptUrl/Helpers.py 
927320b1041af741eb41557f607046a7  usr/lib/python3/dist-packages/AptUrl/Parser.py 
b697ac30c6e945c0d80426a8a4205ef8  usr/lib/python3/dist-packages/AptUrl/UI.py 
d41d8cd98f00b204e9800998ecf8427e  usr/lib/python3/dist-packages/AptUrl/Version.py 
d41d8cd98f00b204e9800998ecf8427e  usr/lib/python3/dist-packages/AptUrl/__init__.py 
a8f4538391be3cd2ecac685fe98b8bca  usr/lib/python3/dist-packages/apturl-0.5.2.egg-info 
4bd6e933c4d337fdb27eee28abbd289d  usr/share/applications/apturl.desktop 
3824814ef04af582f716067990b7808f  usr/share/doc/apturl-common/changelog.gz 
2ae15dd4b643380e1fbb9c44cf8e9c54  usr/share/doc/apturl-common/copyright 
019ea97889973f086dfd4af9d82cf2fb  usr/share/kde4/services/apt+http.protocol

This is also a pretty simple format, but you need to split the space after the hash, while correctly handling the possibility of things like spaces in filenames.  (And I’m not entirely sure what you do if you have a newline in a filename, which is possible, in these simple formats.)  So we are up to Six and a half file formats.

Package: apturl-common 
Source: apturl 
Version: 0.5.2ubuntu11.2 
Architecture: amd64 
Maintainer: Michael Vogt <mvo@ubuntu.com> 
Installed-Size: 168 
Depends: python3:any (>= 3.3.2-2~), python3-apt, python3-update-manager 
Replaces: apturl (<< 0.3.6ubuntu2) 
Section: admin 
Priority: optional 
Description: install packages using the apt protocol - common data 
 AptUrl is a simple graphical application that takes an URL (which follows the 
 apt-protocol) as a command line option, parses it and carries out the 
 operations that the URL describes (that is, it asks the user if he wants the 
 indicated packages to be installed and if the answer is positive does so for 
 him). 
 . 
 This package contains the common data shared between the frontends.

The “control” file is yet another text file, but the format is different from conffiles or md5sums.  We are now up to seven and a half file formats.  Which is surely a far cry for the original “you just need to know the Ar format!” that I got as received wisdom when I first fell into this rabbit hole.

On the bright side, this does give us enough information to unpack and install the data in the package.  (And I’d like to complain how vague a name “data” is for the archive with the actual contents.  As if the rest of the package was somehow something other than data!)  But we still haven’t covered any of the local database that keeps track of what packages are available, what are installed, how dependency resolution works, etc.  But some of that will have to wait for another blog post.  This is certainly enough content that the original progress bar that isnpired me did finish what it was doing long before I made it this far with my own code.

Learning how to unpack packages wound up just being the first steps of a project to try and do my own simple implementations of a whole raft of common UNIX command line utilities that I depend on every day.  Trying to implement a useful subset of a complete userland is what inspired the blog post’s title, “Adventures in Userland.”  The UNIX userland is full of fascinating history, layers of cruft, clever design, and features you never even realised were there.  Even implementing my own cat turned out to be an interesting project, despite how simple that utility seems.  I am hoping to make time to document some of the things I learned while poking around the things I have long taken for granted, and how shaky and wobbly some of the underpinnings of modern state of the art cloud and container systems are.

convenient modern C++ API’s for things like machine learning and image processing are easy to find, but not so much for things like .debs, and .tars.  The utilities in GNU coreutils sometimes have surprising limitations, and some files haven’t had any commits since Star Trek: The Next Generation was in first run.  I think it’s fair to say some of that stuff is about due for a fresh look.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s