The underappreciated scaling of memory access costs hidden by the cult of Big-O

In any CS-101 textbook, you learn about the scaling of certain algorithms.  I’ve had a few conversations with several colleagues recently related to this topic, and how badly the “cult of complexity” can lead one astray when trying to scale real systems rather than ones in textbooks.  I wanted to write a bit about the topic, and share a simple program that demonstrates the realities.

This post sadly has nothing in particular to do with the giant fighting mecha anime, “Big O” from 1999… Actually, now that I think about it, scaling is bound to be a concern when making giant robots. But the square-cube law is a bit of a different topic than memory time complexity, so we’ll have to leave it for another day.

In the textbook, one algorithm might scale linearly — O(n).  Another might do better at O(log(n)), or have a constant number of operations regardless of working set size and be O(1), etc.  It’s a pretty straightforward way of keeping track of the terribleness of an algorithm.  But the key is that Big O notation only estimates the number of operations as n grows large, rather than than the amount of time taken to run it.  There is a pretty natural and intuitive implicit assumption that makes Big O useful.  The assumption is that one read operation from memory is about as slow as any other.  So you just need to count up the number of read operations to get an idea of how much time you’ll spend waiting on memory.

This is, of course, pants-on-fire nonsense.

Continue reading

Advertisements

Adventures in Userland I – Brown Paper Packages Tied Up in std::string, xz, ar, tar, gz, and spaghetti

I was recently sitting, staring at a progress bar, which is how very nerdy adventures start.

assorted wine bottles

This photo of a bar popped up when I did a search for “progress bar,”  It’s rather more colorful and visually interesting than the actual progress bar I was staring at that inspired me to figure out what happens when you install a debian package.  Photo by Chris F on Pexels.com

The particular progress bar was telling me about the packages being installed as part of upgrading my workstation from Ubuntu 16.04 to the newer Ubuntu 18.04.  As the package names whizzed by, one after the other, the thing that annoyed me was that it took So. Damned. Long.  My day job often involves trying to understand why Linux systems don’t go as fast as I would like, so I naturally started firing up some basic utilities to see what was happening.  The most obvious thing to check is always CPU usage.  top showed me that my CPU cores were sitting almost entirely idle.  CPU usage is a metric that I often describe as convenient to measure, relatively easy to understand, and generally useless.  But it’s still a good place to start.  I wasn’t really surprised that the installation process wasn’t CPU bound, so I fired up iotop, which is a much more useful utility for seeing what processes on a system are io bound, and saw…  Nothing interesting.  And it was then that I sort of fell into a curiosity.  If you count all the many servers I have caused package installations to happen on, I have probably installed many millions of debian packages over the years.  Some with salt, others with apt-get, and some with dpkg, but I never really studied in detail exactly how the ecosystem worked.

I started by trying to figure out exactly what a debian package is.  It seems like a silly question, with a simple answer.  Of course, “a debian package is just a common standard ar archive,” as a friend of mine pointed out while I was talking to him.  But that sort of understates things.  First off, ar archives aren’t that common, or particularly standardised.  Ar archives are ‘common’ only as the format for static libraries, and debian packages.  They just aren’t common as general purpose archives, like tarballs or zip files.  Which is sort of interesting in it’s own right.

Let’s consider just how standard the format actually is…  Wikipedia has a good breakdown of the format.  Is the diagram on Wikipedia all we’d need to know to read a debian package?  Well, man 5 ar  notes “There have been at least four ar formats” and “No archive format is currently specified by any standard.  AT&T System V UNIX has historically distributed archives in a different format from all of the above.”  Eep, that’s not terribly promising.  Thankfully, debian packages are at least consistent among themselves in their Ar dialect, since they can generally be assumed to be made with the ar on a debian Linux distribution.

There’s a whole side-story here about how there is a C system header for reading ar archives in an old-school “read a struct” way.  But the format use a slightly odd whitespace padded text pattern, so to get trimmed filenames as C++ std::strings and integer number values is more of a pain in the neck than you’d hope.  There isn’t a good c++ library with a modern API for the format.  So I wrote a YAML definition for Katai in order to have a convenient C++ API for reading it, and used the SPI Pystring library for some of the string manipulation.  In any event, I could read the format.  Yay, I could read a debian package myself!

A debian package consists of just three things when you unpack it.  I file called ‘debian-binary’ that tells you the version number of the format.  And, two tarballs.  One with control metadata about the package and the other with the actual contents of the package.

At this point, anybody trying to write their own code to unpack a debian package in order to better understand the process will try and punch a wall.  Because we’ve just figured out how to write code to read this relatively uncommon Ar format, and the first thing we find inside of it is two tarballs, which is a completely different format!  Surely, we could have designed the package files to either be an Ar with Ar archives in it, or a tar file with tar files in it!  Well, okay, my friend’s assertion that I just needed to know about Ar archives was a lie, but I only need to know about two formats.  That’s not too bad.  Oh, well, tarballs are actually two formats unto themselves.  There’s a compression format, and then the actual tar archive.  So, you need to handle three file formats to install a debian package.  I have some code that will unpack the Ar layer, so let’s see which compression method is used on the tar files…

 

Wait, Why aren't they using the same compression?!

If you unpack the apturl package, you get the debian-binary file, and the data and control archives.  It’s totally arbitrary that I used apturl-common as a test file for my code.  It just happened to be a package that I downloaded.  Other packages will vary slightly.

Wait, those two tar files have different compression formats.  One is a .gz file, and the other is a .xz!  Not just different compression formats from debian files of different eras.  For example, if Ubuntu 12.04 packages used gz and Ubuntu 18.04 used xz, you would only need to support one or another to install packages from any particular distribution.  As it turns out, there are different compression formats inside a single package.  Okay, so to unpack and install a debian file, you actually need to support a few compression formats.  Let’s say xz, bz2, and gz at a minimum.  Okay, so you need to support 5 different formats.  So, what’s in that control archive?

You get a few scripts.  preinst, postinst, and prerm.  Those scripts get run when you would expect.  Before install, after install, and before removing the package if you uninstall it.  Languages like Python can be embedded in native applications, but shell scripts aren’t really intended to be used that way.  (And actually, if I were embedding Python today, I’d probably use PyBind11 instead of Boost.Python like I did in my old blog post.  But that’s neither here nor there.)  So, you can pass on being responsible for running the scripts in-process if you are trying to implement something to install the packages, and just shell out to do it.  (Writing a shell is definitely at least a whole other blog post unto itself.)  You also have files called md5sums, control, and conffiles.  Conffiles is just a newline separated list of files that the package uses for configuration so the install program can warn you about merging local changes during install.  It’s barely a file format, so we’ll count it as half.  md5sums is a listing of checksums of all the files in the content archive called “data,” in the format of md5sums.

b25977509ca6665bd7f390db59555b92  usr/bin/apturl 
da0e92f4f035935dc8cacbba395818f2  usr/lib/python3/dist-packages/AptUrl/AptUrl.py 
2c645156bfd8c963600cd7aed5d0fc0b  usr/lib/python3/dist-packages/AptUrl/Helpers.py 
927320b1041af741eb41557f607046a7  usr/lib/python3/dist-packages/AptUrl/Parser.py 
b697ac30c6e945c0d80426a8a4205ef8  usr/lib/python3/dist-packages/AptUrl/UI.py 
d41d8cd98f00b204e9800998ecf8427e  usr/lib/python3/dist-packages/AptUrl/Version.py 
d41d8cd98f00b204e9800998ecf8427e  usr/lib/python3/dist-packages/AptUrl/__init__.py 
a8f4538391be3cd2ecac685fe98b8bca  usr/lib/python3/dist-packages/apturl-0.5.2.egg-info 
4bd6e933c4d337fdb27eee28abbd289d  usr/share/applications/apturl.desktop 
3824814ef04af582f716067990b7808f  usr/share/doc/apturl-common/changelog.gz 
2ae15dd4b643380e1fbb9c44cf8e9c54  usr/share/doc/apturl-common/copyright 
019ea97889973f086dfd4af9d82cf2fb  usr/share/kde4/services/apt+http.protocol

This is also a pretty simple format, but you need to split the space after the hash, while correctly handling the possibility of things like spaces in filenames.  (And I’m not entirely sure what you do if you have a newline in a filename, which is possible, in these simple formats.)  So we are up to Six and a half file formats.

Package: apturl-common 
Source: apturl 
Version: 0.5.2ubuntu11.2 
Architecture: amd64 
Maintainer: Michael Vogt <mvo@ubuntu.com> 
Installed-Size: 168 
Depends: python3:any (>= 3.3.2-2~), python3-apt, python3-update-manager 
Replaces: apturl (<< 0.3.6ubuntu2) 
Section: admin 
Priority: optional 
Description: install packages using the apt protocol - common data 
 AptUrl is a simple graphical application that takes an URL (which follows the 
 apt-protocol) as a command line option, parses it and carries out the 
 operations that the URL describes (that is, it asks the user if he wants the 
 indicated packages to be installed and if the answer is positive does so for 
 him). 
 . 
 This package contains the common data shared between the frontends.

The “control” file is yet another text file, but the format is different from conffiles or md5sums.  We are now up to seven and a half file formats.  Which is surely a far cry for the original “you just need to know the Ar format!” that I got as received wisdom when I first fell into this rabbit hole.

On the bright side, this does give us enough information to unpack and install the data in the package.  (And I’d like to complain how vague a name “data” is for the archive with the actual contents.  As if the rest of the package was somehow something other than data!)  But we still haven’t covered any of the local database that keeps track of what packages are available, what are installed, how dependency resolution works, etc.  But some of that will have to wait for another blog post.  This is certainly enough content that the original progress bar that isnpired me did finish what it was doing long before I made it this far with my own code.

Learning how to unpack packages wound up just being the first steps of a project to try and do my own simple implementations of a whole raft of common UNIX command line utilities that I depend on every day.  Trying to implement a useful subset of a complete userland is what inspired the blog post’s title, “Adventures in Userland.”  The UNIX userland is full of fascinating history, layers of cruft, clever design, and features you never even realised were there.  Even implementing my own cat turned out to be an interesting project, despite how simple that utility seems.  I am hoping to make time to document some of the things I learned while poking around the things I have long taken for granted, and how shaky and wobbly some of the underpinnings of modern state of the art cloud and container systems are.

convenient modern C++ API’s for things like machine learning and image processing are easy to find, but not so much for things like .debs, and .tars.  The utilities in GNU coreutils sometimes have surprising limitations, and some files haven’t had any commits since Star Trek: The Next Generation was in first run.  I think it’s fair to say some of that stuff is about due for a fresh look.

Installer Downloader Installer Downloaders

I couldn’t figure out how to fit this into a 140 character tweet, so now it’s a blog post.  Recently on a mailing list that I am subscribed to, a friend and former coworker posted:

Went to download the Unity 5 updates, what I ended up with was:

UnityDownloadAssistant-5.0.1f1.dmg
UnityDownloadAssistant-5.0.1f1.exe

I can pretty much guarantee that at no time in the history of electronic software distribution has anyone ever said “gee, I really wish this application had its own custom downloader, because those guys at Microsoft / Apple / Mozilla / Google clearly don’t know what they are doing with those web browser things, and don’t even get me started about those curl / wget people…”. I feel slightly better now.

And I had unfortunately spent the morning wrestling with Adobe CC, so I chimed in with my response…

That damned dress.

I posted a reddit comment about the picture of a dress that has been confusing people.  Depending on viewing conditions it looks either white and gold, or black and blue.  It’s a rather extraordinary demonstration of how malleable human perception is, and it really shows off the fact that while we believe we are capable of observing an “objective” reality, our minds make up our perception of the world with only modest input from our senses.  Given that nobody can tell the color of a dress when they are looking at a photo of it, the picture obviously speaks to how much we rely on things like human recollection of events in court cases which literally decide matters of life and death.  A witness can be 100% sure that they saw one thing, but that doesn’t mean that what they saw necessarily had a strong correlation with the objective reality that is so important.  The reddit comment was in response to a post that then got deleted, so I assume nobody will ever read it there, so I am reprinting it here.

Continue reading

Google Support is analagous to awesome

But some analogies are pretty awful.  Which is the case here.  I used to use a Nexus5 phone from Google.  It was pretty good.  I won’t claim it was perfect.  It had about half the battery life I wish it did, and it always felt slightly too large in my hand.  But, nobody makes small phones anymore for some reason, and it certainly got the job done.  Or at least it got the job done until I had dinner at a particular sushi restaurant with a very hard floor, and shattered the screen.  ::Sigh::  Oh well, I guess it was time to contact Google support to see if they would repair it, and if it was going to cost something, how much…

Continue reading

FFMPEG API. The Agony and the Adequacy.

The FFMPEG API is one of those things that I love to hate.  On one hand, it always seems to be the best tool for the job whenever I write video code.  It’s portable, and integrates relatively easily into any application without needing to use a special framework.  And it has existed long enough that I can reasonably expect it to still be there when i am making version 2 of my project.  It also supports basically every format I need to deal with.  Thats more than I can say about QuickTime[-X].  (Not portable to Linux.  QuickTime-X is Mac/iOS only.  Requires dealing with Carbon/Cocoa API’s.  Hard to use in a command line app.  Etc.)  Or Windows Media.  (Obviously Windows only.  Is is DirectShow or Windows Media Framework now?  Oh yeah, also surprisingly difficult to integrate into a command line app.)  Or G-Streamer.  (More portable than WinMedia stuff, sure.  But still a lot of baggage to add to a port, and keeping consistent format support between platforms is essentially impossible.  And my app isn’t meant too be a “G-Streamer client.”  It’s an app that I want to get video into.  So, just give me the damn pixels already.  I’ll worry about displaying them, thank you very much.)  Or QtMultimedia.  (In practice, I am using Qt, and it is fairly portable, but I still just want the damn pixels.  And I’ll need to encode at some point.)

Most of these API’s suffer from the fascinating delusion that people just want to write simple video players.  Who the hell actually wants to do that?  Why is it such a well supported use case given that most users already have a video player installed.  It makes a neat party trick to do your own, but I’m not sure what I get out of writing another vlc/mplayer/totem/mediaPlayer/QuickTimePlayer given that I already have one.  I wonder who these legions of develops are who look at VLC and think, “I’m gonna basically do that, except probably worse and less mature.”  Every developer I ever talked to about using a video API was also doing something “interesting.”  They needed the pixels more than they needed a a 20 line demo of making a video player in python.  They needed easier encoding much more than they needed trivialised presentation.  And they needed good documentation.

So, FFMPEG stays the king of a motley crew of video API’s that aren’t that great.  But I keep running into things like the fact that they keep gradually evolving the API in place and keeping most of the cruft but making it so the random example code I fount on the net won’t actually compile anymore.  The API cleanups are never quite sweeping enough to elevate the API to “nice,” but just enough to make a lot of extra work out of figuring out what tutorials and samples are actually valid when starting a project.  Which wouldn’t be so bad if the main project documentation was first-rate.

Fore example:  http://ffmpeg.org/doxygen/trunk/group__lavc__encoding.html#gaa2dc9e9ea2567ebb2801a08153c7306b from the documentation for “avcodec_encode_video2.”  Now, first off there is the fact that in their API redesigns, in order to preserve some sense of backwards compatibility, since they mutate in place rather than just having a “version 2” of the API itself, they have version numbers on individual functions.  This function replaced the older avcodec_encode_video after it was deprecated.  I suppose it is practical, and it does server a purpose, but nothing else I depend on works this way.  As a matter of personal opinion, I really don’t like it.  But my actual complaint with the docs here is the explanation of the return value.

“0 on success, negative error code on failure”  Okay, so I can know if it worked, but that negative error code doesn’t actually seem to be documented anywhere.  And given the state of the docs and examples, I am pretty much guaranteed to do something that causes an error.  Unfortunately, I won’t get any kind of a hint about what exactly I did wrong.  The documentation, such as it is, is pretty much all just autogenerated Doxygen HTML pages presenting a slightly prettier view of exactly what is in the headers if I just read the headers directly without any documentation at all.  And that’s where I go bonkers.

I will hopefully be able to talk more about the app that I am writing that uses FFMPEG quite soon.  It’s an interesting project.  I’ve learned a lot about a lot of things, and it has some interesting features.  It’s part of the post production pipeline for a project that will be shooting in January that will hopefully turn out quite fun.