Every once in a while, somebody gets so frustrated by the modern dynamic linking ecosystem that they suggest just throwing it out entirely and going back to static linking everything. Seriously, the topic comes up occasionally in some very nerdy circles. I was inspired to write this blog post by a recent conversation on Twitter.
So, what’s dynamic linking? Why is it good? Why is it bad? Is it actually bad? What’s static linking, and how did things work before dynamic linking? Why did we mostly abandon it? And… What would static linking look like if we “invented” it today with some lessons learned from the era of dynamic linking? (And no, I am not talking about containers and Docker images when I say modern static linking. This isn’t one of those blog posts where we just treat a container image as morally equivalent to a static linked binary. Though I have some sympathy for that madness given that the modern software stack often feels un-fixable.)
In the beginning
I like history, so I tend to go pretty far back in time when I try to tell a story. Some people would say I sometimes go too far back when telling a story, and that the Byzantine Empire isn’t actually the reason I was late to work today. But don’t worry, I am going to completely skip over programming ENIAC with plug boards. This is keeping things quite fresh and relevant by my standards!
We can start our story as recently as the early 1950’s. The development of what we’d now call a linker started with what was called “Automatic Coding” and “Compilation” at the time. (Grace Hopper’s part starts at page 21 in that PDF.) In those days, programs were mainly written in assembly. (Writing exclusively raw machine code was already falling out of vogue, but only recently.) Many machines in those days already had instructions that would be immediately familiar to modern assembly programmers like
add to do addition,
inc to increment a value,
jmp to do an unconditional jump, etc.
To gloss over some things and make a simple mental model, many of those machines didn’t have some things we’d take for granted today like a multiply instruction. If you wanted to multiply, you could do it by repeated addition. (4×2 is the same as 2+2+2+2.) But it was obviously kind of inconvenient to write something every single time you wanted to multiply two numbers like:
(not exactly real assembly code)
// Multiply (X * Y) by repeatedly Adding X, Y times. start_multiply: load A Y // A = Y, We'll use A as a counter for how many times to add. load B 0 // B = 0 loop_start: // We'll keep coming back here until we finish. add B X // B = B + X dec A // Decrement the counter. We'll stop when it hits zero. jnz A loop_start // "Jump if Not Zero." Which is to say, // Check if we've finished the countdown from Y to zero, // Go back if we haven't finished the countdown.
As soon as computers had the capacity to store a bunch of such commonly useful bits of code, it stopped making sense to keep re-writing that sort of thing by hand over and over. And I can’t be bothered to check if I’ve made an off-by-one error in the example pseudo code snippet I just wrote for a blog post. The version from a known-good library is probably much more trustworthy.
The early systems were pretty simple. You just copied the code from the library into the final executable program. This would mean redundant copies of library code in various programs. But there weren’t very many programs. And everything was pretty much done with tape, so you wanted a program to be a simple sequence of data on the tape when you needed to load it into memory to run it. It would have taken ages to run a program if you had to switch tapes every few bytes while loading. It would certainly be silly tos ay that computer storage was especially cheap in the 1950’s. But tape was certainly cheaper than computer time, and the amount of data to be stored was pretty small because the programs and libraries were all pretty small. We simply weren’t asking computers to do very much in those days.
By the early 1990’s, surprisingly little had really changed about the way linkers and libraries worked. Even though we are talking about the age of UNIX workstations with disks and GUIs instead of UNIVAC systems with James Bond movie style reel to reel tapes. Technically dynamic linking long pre-dates the 90’s, but it was far from universal. And as of 1993 dynamic linking was still a bleeding edge feature in Irix, which led to leaks of a bunch of internal emails about the struggles they had with software bloat that are in retrospect a bit hilarious and quite informative. And those struggles are as good place a place as any to pick up our story.
In the bits between the 1950’s and the 1990’s, UNIX had been invented, along with C. Writing in assembly had fallen out of favor. Some stuff was written in C++ even though it wasn’t standardized yet. But a linker doesn’t really care much about your operating system or what language you originally write your code in. By the time you get to the linker, it’s pretty much just machine code with maybe a few annotations sprinkled in here and there. You could have entered raw machine code in using toggle switches, or it could have been compiled down through some complicated set of steps from a high level language. At this level, it all looks the same. On a UNIX workstation you would write some code, compile it down to object files, and then link the object files into an executable along with any relevant library code that would get copied in.
This has some drawbacks. In particular, these modern systems could have Hundreds of different binaries! (It seemed like a lot at the time.) And if you copy the same code into many binaries, you waste many megabytes of space. (And modern libraries have grown so much in the mean time that static linking a modern Linux workstation would mean zillions of gigabytes of duplication. Every KDE application links to hundreds of megabytes of Qt libraries.) That redundancy doesn’t just waste disk space. It also takes up RAM. It also has a performance impact – you have to load a binary off disk to execute it. If the binary is big, it takes longer to load. If you can just load some commonly used code once, loading just the application specific code to run it is a lot quicker.
The solution here is pretty obvious. Instead of copying library code into each binary when you build it, just install the library onto the system once and load it on-demand along with every program that uses it. This is dynamic linking. In Irix, the new kind of libraries were called “Dynamically Shared Objects.” The system could load a library once, save on memory and disk space, and work faster. One nice bonus is that if you need to fix a bug or install a security update in a binary, you can just install it once and everything that uses that library will also be instantly updated. Which brings us back to the internal memos that I mentioned previously…
“DSOs were supposed to reduce physical memory usage, but have had just the opposite effect,
and their indirection has reduced performance.”— Tom Davis, 1993 memo on software bloat
Whoopsie! Turns out that theory and practice aren’t always entirely in sync. The exact details of the dynamic linker implementation in Irix 5 are not terribly relevant to modern systems. And not every issue they had was related to DSO’s. But they make a useful counterpoint when it’s taken on faith that we can’t use static linking at all in modern systems because dynamic linking is clearly always superior.
Smaller individual binaries is great. But the larger total amount of data due to added complexity means there is more that has to be loaded overall. And the cost of indirection is certainly much cheaper on modern systems than it would have been rushing to swap tape spools on an old UNIVAC, but it’s never quite zero. And a lot of small costs of indirection add up measurably.
So what would modern static linking look like?
Oh, yeah, the point of writing this blog post. I am finally getting to it here at the end. Hopefully somebody read this far.
We want some of the stuff that dynamic linking promises:
- Don’t waste disk
- Don’t waste memory
- Make updates easy
- Fast loading
But we also want some of the stuff that static linking offers
- Can’t bork a whole system by messing up one file that all programs need.
- glibc upgrades are often terrifying for a sysadmin on Linux.
- Simple software deployment
- Deploying a static binary often just involves copying the one file.
- Avoid complexity of needing to worry about ABI breakage
- Sure, dynamic linking claims you can just drop in a new library, but there are a lot of quirks that effect whether or not it will explode, and they are difficult to check for automatically ina completely reliable way, so you need a bunch of dev process fussiness.
- Whole program optimization (fast execution once it’s loaded)
- It’s impossible to optimize across a shared library boundary because you need to guarantee that stable public interface for the library. The toolchain may see that you want to do an operation on a constant known at compile time, but it can’t optimise out that operation and replace it with the result because the behavior of the operation can’t be known until runtime.
At a glance, these are some pretty incompatible goals!
But modern operating systems are pretty neat. If we think a bit about the services they provide, we may be able to fudge things a bit. For example, the core goal of dynamic linking is deduplication. Yeah, it has other benefits. But that’s the main selling point. But modern filesystems like btrfs and ZFS support deduplication of all kinds of files at the filesystem level, automatically with no special user action on the files. If we could find a way to take advantage of that, it shifts the disk space savings argument of dynamic linking a lot.
Sadly, just dumping a static linked /bin onto ZFS won’t actually dedupe very effectively. ZFS isn’t looking for completely arbitrary shared file contents. It only notices when whole blocks are completely identical. Suppose I have two simple programs. One prints “Hello World.” The other prints “Hello Beautiful World!!!” All other things being equal, you’d expect the second program to be 12 bytes bigger than the first. So if both programs have the exact same subset of libc linked into them after the message, the contents of the second program will be offset by 12 bytes relative to the first. So even if you have many megabytes of identical library code between those two programs, there won’t be any whole blocks that are exactly identical on disk. The identical content has to be block aligned for ZFS to notice it.
This is a… solvable problem. It would require tinkering with the linker to make it aware of such ideas. And the linker would have to be willing to waste space inside a static linked binary in order to ensure that all libraries that get linked into a program have consistent alignment. And if you only link a subset of a library, the surviving chunks of the library also need to be aligned in a dedupe favorable way. And you probably need to sacrifice some degree of LTO / Whole Program Optimization in order to preserve some of those alignment patterns. But none of this is insurmountable.
I’m honestly not 100% sure if this can be accomplished with linker scripts, or if it requires a custom linker to really work well. My best understanding is that nobody has ever actually written a linker script, and nobody really knows where they come from. OSDevs just pass down scripts and make copypasta out of them until they work. It’s a poorly documented bit of the toolchain. Maybe the first linker script came from a time loop from the future in a causality paradox. But we can clearly see it would save disk, so even though I am too stupid and lazy to try to prototype a more dedupe friendly linker, let’s consider how such a system would perform.
Modern OS page cache is very good. And it exists below and separate from the filesystem layer. So if two binaries in separate files had data on the same disk blocks, the OS would only load those disk block into the page cache once. The fact that the redundant content is in two separate files would not matter on a modern system. The page cache does not care about files or have any concept of them. (Even on UNIX, not everything is a file!) When you load a new uncached program to be able to execute it, any pages with previously accessed content would already be in page cache. So if you have a 100MB binary with 99 MB of shared library content in it, loading the new program would only require loading 1 MB from disk. Exactly the same disk I/O as if you were loading a 1 MB dynamically linked binary and had already loaded a 99 MB .so library!
Block storage dedupe, in the context of a purpose built toolchain, could give you all of the disk space efficiency, memory (page cache) efficiency, and loading time efficiency benefits of a dynamic linker.
I know this blog post is already too long, but I am going to say it again. Block storage dedupe, in the context of a purpose built toolchain and a modern kernel page cache, could give you all of the disk space efficiency, memory (page cache) efficiency, and loading time efficiency benefits of a dynamic linker. (Assuming you let me get away with ignoring the space wasted with padding for the alignment rules. It might be possible to use some of that padding space for something sometimes, anyway.)
So that just leaves “software updates” as a big selling point for dynamic linking. And, uh, that was never even the original driver. These days, we mostly update software using a package manager. So, can we take advantage of more-or-less the existing package ecosystem to update static binaries? I think so. Currently, you only bother to express a library as a dependency in an executable package’s metadata when you need the dynamic library’s package to be installed. We could just use the “depends” field of a control file in a package manager on an all static linking system to represent that a binary was built using that library. Keep the library as a sort of metapackage with no real content. And whenever a new upstream version of the library is released, you need a new version of the metapackage, and that triggers everything that depends on that library to get rebuilt and updated. Voila.
It will admittedly take some more install time and bandwidth to get 100 updated packages instead of one. In an ideal world, you’d also have some de-dupe intelligence in the protocol used for downloading all the updated packages. Maybe something like bit torrent can accomplish something analogous to what ZFS does for storage. In any event, the downloading up updates is the least important part of the system to focus on optimizing. Hopefully programs are being run mor eoften that they are being installed. If a program isn’t being run, you can save the most space and bandwidth by uninstalling it rather than trying to optimize the update process.
So, there it is, my argument in favor of static linking in 2021. I’m not really 100% convinced that we should immediately throw out our dynamic linkers. Some applications will still need to be able to do things like load plugins at run time after the program has started with dlopen() or the equivalent. But I do think that the idea of static linked gets dismissed a bit too readily as being plainly absurd. If you think about what the whole OS+toolchain could be doing in a coordinated way to make static linking work better than it did in the 90’s, it seems a lot less absurd.