The underappreciated scaling of memory access costs hidden by the cult of Big-O

In any CS-101 textbook, you learn about the scaling of certain algorithms.  I’ve had a few conversations with several colleagues recently related to this topic, and how badly the “cult of complexity” can lead one astray when trying to scale real systems rather than ones in textbooks.  I wanted to write a bit about the topic, and share a simple program that demonstrates the realities.

This post sadly has nothing in particular to do with the giant fighting mecha anime, “Big O” from 1999… Actually, now that I think about it, scaling is bound to be a concern when making giant robots. But the square-cube law is a bit of a different topic than memory time complexity, so we’ll have to leave it for another day.

In the textbook, one algorithm might scale linearly — O(n).  Another might do better at O(log(n)), or have a constant number of operations regardless of working set size and be O(1), etc.  It’s a pretty straightforward way of keeping track of the terribleness of an algorithm.  But the key is that Big O notation only estimates the number of operations as n grows large, rather than than the amount of time taken to run it.  There is a pretty natural and intuitive implicit assumption that makes Big O useful.  The assumption is that one read operation from memory is about as slow as any other.  So you just need to count up the number of read operations to get an idea of how much time you’ll spend waiting on memory.

This is, of course, pants-on-fire nonsense.

Real world systems never actually scale according to  the textbook formulae.  This really sunk in for me a few years ago when I was debugging some obscure performance issue with very large filesystems.  The exact details are obscure (and part of that story revolves on details of a really old XFS implementation that are no longer relevant.)  But it doesn’t really matter because the sort of problems you get walking inodes in a filesystem are fundamentally the same sort of problems you get reading things from RAM, or accessing stuff from some geodistributed caching data system over a network.  Whatever scale you are working with, things are too danged slow, and they scale worse than you’d like.

One of the core problems with the time I was digging into filesystem stuff turned out to basically be that you “pay for what you don’t use.”  In-theory, a bunch of inodes for files on a disk that you aren’t opening and reading shouldn’t matter to you.  At any given moment, you should only be “paying for what you use.”  That is to say, the theoretical cost of opening a file like /storage/projects/fizzbuzz-pro/Readme.md should not be effected in any way by the existence of /storage/projects/ultimate-hamburger/recipe{1-99}.jpg.  You access the root inode of the disk, look up the pointer to “projects” and open it, work your way through the “fizzbuzz-pro” directory, get a pointer to where the file is, and you can start reading bytes from it.  N is five, so an open syscall should take a bit longer than an open() on something in the root.  But the mere existence of a bunch of hamburger pictures in another directory (no matter how savory) , isn’t a part of the process.

Reality, of course, is worse.

Whenever you do an I/O operation on a disk, you are generally reading data into a page cache in memory.  If you need to access something that is already in the page cache, you can skip the I/O and just read it from memory much more quickly.  The drive itself also has some DRAM buffers where you might incur the latency of a SATA transaction, but not an actual physical disk operation.  And that memory could potentially be in CPU cache.  On a NUMA system, it could be in a local or remote node, etc.  The point is that it’s not just an I/O operation.  It’s an operation in the context of a complicated memory cache hierarchy, and you can never be 100% sure what’s in which cache at any given time.

So, imagine that when you start doing lookups to open() that Readme file, the Readme itself is actually already in the page cache.  Winning!  Unfortunately, we don’t know that because the page cache isn’t a file cache.  This is important!  It caches chunks of raw bytes from the IO device, but the page cache doesn’t know anything about files, filenames, Readme’s, etc.  It sits below the filesystem driver, so the filesystem implementation can take advantage of the page cache, but the page cache can’t take advantage of the page cache.  (In general.  Some filesystems like squashfs do their own caching layer, but we can ignore that kind of thing for a simplified example.  My simplified examples already tend to be way too complicated because they involve computers and stuff.)

So the implementation of open() reads a block near where the root inode is.  inodes are scattered all around, so it may have accidentally read in some other useful data.  But since there are 99x as many pictures of hamburgers with just the right combination of cheeses as there are Readme’s, the page cache is much more likely to contain an inode referencing a hamburger picture than an inode we want.  And the jpeg files themselves are much larger than the inodes that point to them, so our arbitrary chunk of page cache is much more likley to contain some chunk of an image than any sort of inode, anyway.  The open() syscall digs through the chain of inodes recursively to get to our target file, and each time it probably “poisons” the cache a little bit with some irrelevant data referring to a file we don’t care about.  By the time it is iterating through the fizzbuzz-pro directory entry, it probably already evicted the inode for that Readme from the cache!  (Assuming a comically small cache for the sake of an example.)  We already had what we needed, but the process of trying to find it evicted what we needed and populated our page cache with stuff from that big directory with many files, just because they exist.  As a result, the I/O operation takes much longer on a system that has a bunch of other crap on the disk, despite the theoretical O(n) complexity of the algorithm that only cares about the number of path segments, and on-paper isn’t effected in any way by all the other unused crap that is in other directories not being accessed.  Ooops.

Another example is accesses an element from an array.  Nice simple O(1) complexity.  In theory, accessing an arbitrary element from an array is not effected in any way by the total size of that array, right?

Nope.

I wrote a little example test program to demonstrate the memory time scaling of accessing an element.  I’ll tidy it up and put it on Github shortly, but in the mean time, you can get the gist of it.

Here is some data showing the size of a randomized std::vector, and the amount of time it takes to access a single element form the vector.  To be clear, it’s doing the same number of reads from the vector in each iteration — the increased time isn’t coming from doing more total reads.  The time increase is a result of the fact that each read is, by itself, significantly slower on average when the worker set is larger.  Any read is less likely to be “near” the CPU in a fast cache.  Thus, a O(1) complexity algorithm actually takes something more along the lines of O(log(n)) time to execute.  And runtime of algorithms that are assumed to be O(n) because they depend on several steps that are individually just O(1) or O(log(m))  may in fact be Completely Dominated by those steps that are theoretically supposed to be negligible when n grows large.

Here’s a (noisy) graph showing how much slower reads get once you are picking random elements from the array.  Worth noting is that you don’t hit an obvious “ceiling” at this scale where the working set is significantly larger than the cache, so cache hits are assumed to be rare and you are just looking some “uncached” performance number long before the Gigabyte range.  Caching and locality effects are still clearly playing a role well past a point where intuiting might suggest a simpler behavior.

Big O implies this line should be flat.

Here is some raw data:

64 B, 5.71254 ns
128 B, 5.96568 ns
256 B, 5.59057 ns
512 B, 5.65889 ns
1024 B, 5.68297 ns
2 KB, 5.61811 ns
4 KB, 5.76851 ns
8 KB, 5.60438 ns
16 KB, 5.74249 ns
32 KB, 5.80321 ns
64 KB, 6.61791 ns
128 KB, 7.17731 ns
256 KB, 8.08497 ns
512 KB, 11.0743 ns
1024 KB, 12.9287 ns
2 MB, 20.0106 ns
4 MB, 24.3959 ns
8 MB, 46.1736 ns
16 MB, 66.629 ns
32 MB, 74.4211 ns
64 MB, 80.331 ns
128 MB, 85.2592 ns
256 MB, 88.2065 ns
512 MB, 99.6501 ns

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s