Embedding Python in my App Part I An Unexpected Journey

Boost Python is pretty awesome.  It’s a way to wire up your C++ code to Python without having to go all the way into the fairly low level Python C API.  I have been working on a C++ application using Qt, and I wanted to embed Python in the app to allow user scripting.  There’s a lot I could write about that, but I’m lazy, so maybe some of the subtleties of allowing users to script your app will wind up in another post.

I looked at a few alternatives to Boost before I settled on it.  Since I am working on a Qt app, I looked at PyQt and PySide and how they do things first.  Both of them are bindings for the whole Qt framework with a bunch of UI and miscellaneous stuff, but also have systems for wrapping that stuff which you can use for your own code and classes.  Exposing Qt to my users would have been pretty neat, so they could build their own UI’s, or talk to databases or network services with the Qt classes pretty much “for free.”  PyQt licensing is a bit tricky, and I’m not certain if my app will eventually be a commercial product, so I didn’t want to deal with the complexities.  On a personal project like this one, sometimes laziness is the best way forward.  That said, it works well, and the license cost is not at all unreasonable.  If I had a boss on this project who handed me PyQt as the way forward, I probably would have been able to make it work.  PySide is similar to PyQt.  Instead of being managed by a third party like PyQt, PySide is developed in-house by the maintainers of Qt, Nokia erm I mean Digia.  Nope, I think it’s the Qt Company for the moment.  That said, I need Qt 5.4 support, and PySide seems to have gotten of given up at Qt 4.x.  There may be some work maintaining it someplace that I am not looking, but the official release doesn’t seem to be very current.  They also don’t support Python 3, which I want to support going forward.  My current build is done with Python 2.7, but I’d rather support it from the get-go than have a big panic when I decide I can’t possibly live without the latest greatest python next month.  Shiboken, the wrapper generator used with PySide is also terribly documented.  In theory it supports doing your own stuff without involving PySide, but finding a simple guide to doing it was frustrating, and I had no idea what I was doing.

I tried SWIG, but it hates namespaces.  Dealing with it in a real live modern-ish C++ code base proved to be really annoying.  I could wrap simple classes, but wrapping QObject derived classes spread across multiple modules in different namespaces using SWIG proved to be remarkably similar to bashing my face against the mouth of an angry lion.  It’s not that it can’t be done, but you’ll frequently find your face being ripped off if you do.  Also, SWIG is very much targeted toward making Python modules, rather than embedding python.  While I did make simple Python modules in my experiments, I never did manage to get it embedded in my app properly.

Which brings me to Boost Python.  It builds as a part of my existing code, so it works fine with whatever my real app is.  On the other hand, while SWIG and Shiboken are generators for the binding, Boost requires me to actively expose stuff by hand.  I would have much preferred to have my Python bindings ‘just work’ which is basically impossible with Boost.  In an ideal world, I’d just write doxygen comments in my header files, run a generator, and magically have a fully documented, working Python API for my app.  Oh well.  Programming sucks, and you have to do work to make the program that you want.  I’ll dig more into the nitty gritty of using Boost.Python in my next post, now that I have established the rationale for using it that led me down that path.

A small weapon in the war on redundant data

I recently discovered a neat utility called rdfind that searches a path for duplicate files.  Dedup can be super useful when you realise you have hundreds of GB’s of redundant data floating around your PC.  (I had about 400 GB after moving a bunch of scattered data from several smaller hard drives to a new 3TB drive I just bought.  A lot of the drives had copies of some of the same data.)  It’s pretty easy to install (it’s in the standard repositories for apt-get on Ubuntu)  and to use:

will@will-desktop:/storage/test$ sudo rdfind /mnt/Quantum2/
 Now scanning "/mnt/Quantum2", found 259947 files.
 Now have 259947 files in total.
 Removed 0 files due to nonunique device and inode.
 Now removing files with zero size from list...removed 651 files
 Total size is 1445615350230 bytes or 1 Tib
 Now sorting on size:removed 72229 files due to unique sizes from list.187067 files left.
 Now eliminating candidates based on first bytes:removed 117614 files from list.69453 files left.
 Now eliminating candidates based on last bytes:removed 5872 files from list.63581 files left.
 Now eliminating candidates based on md5 checksum:removed 5166 files from list.58415 files left.
 It seems like you have 58415 files that are not unique
 Totally, 394 Gib can be reduced.
 Now making results file results.txt

It has some other options to do things like delete duplicate files automatically.  But, that terrifies me.  So, I don’t use that feature.  The result after it cranks away for quite a while is a file called results.txt with everything you need to know to go wastehunting.  Unfortunately, the output format is a bit obscure if you just want to know what will free up the most space easiest:

# Automatically generated
 # duptype id depth size device inode priority name
 DUPTYPE_FIRST_OCCURRENCE 58483 3 1 2082 1710946 1 /home/will/Downloads/pattern-2.6/Pattern.egg-info/not-zip-safe
 DUPTYPE_WITHIN_SAME_TREE -58483 3 1 2082 1710957 1 /home/will/Downloads/pattern-2.6/Pattern.egg-info/dependency_links.txt

It doesn’t directly give you the count of a given file, or the total waste for a given file.  It just gives you a file ID for each file, and the size of each copy.  You have to count,multiply, and sort by yourself to understand where your worst offenders are.  So, I wrote a little python script to process that file and save me counting file ID’s on my fingers.  It’s not anything fancy, but it looks like this:

will@will-desktop:~$ python Documents/rdproc.py /storage/test/results.txt
(4618027008, 1539342336, 3, '/mnt/Quantum2/vidprod/libby/TableRead/2013.09.17/PRIVATE/AVCHD/BDMV/STREAM/00000.MTS', 152486)
(4633560010, 2316780005, 2, '/mnt/Quantum2/vidprod/mdwm/WholeDriveBackup/Caitlin/MDWM/Transcoded', 162706)
(4927753612, 2463876806, 2, '/mnt/Quantum2/vidprod/mdwm/WholeDriveBackup/Caitlin/MDWM/Transcoded', 162593)
(5807821978, 2903910989, 2, '/mnt/Quantum2/vidprod/mdwm/WholeDriveBackup/MDWM', 160710)
(7562474188, 3781237094, 2, '/mnt/Quantum2/vidprod/mdwm/WholeDriveBackup/Caitlin/MDWM/Transcoded', 162601)

The order of output is explained at the github link.  But, it lets me easily see that my biggest waste comes from having a bunch of footage from My Dinner With Megatron, as well as a backup of a Whole Drive that was used during production.  Hence, 2 copies of a bunch of that stuff that I can merge back down quite easily.  I also have no fewer than 3 copies of a Table Read that I shot for a friend quite a while ago because I never cleared that memory card, and wound up re-importing it a few extra times after I shot more stuff on it.  As you can see, having everything sorted and summed makes it a lot easier to understand than if you were to try and use the results.txt file yourself.  So, feel free to use the python script I wrote.  It’s not complicated or fancy, but I figure it may be useful enough to save somebody from having to reinvent it for themselves.  Let me know if you find it useful.

FFMPEG API. The Agony and the Adequacy.

The FFMPEG API is one of those things that I love to hate.  On one hand, it always seems to be the best tool for the job whenever I write video code.  It’s portable, and integrates relatively easily into any application without needing to use a special framework.  And it has existed long enough that I can reasonably expect it to still be there when i am making version 2 of my project.  It also supports basically every format I need to deal with.  Thats more than I can say about QuickTime[-X].  (Not portable to Linux.  QuickTime-X is Mac/iOS only.  Requires dealing with Carbon/Cocoa API’s.  Hard to use in a command line app.  Etc.)  Or Windows Media.  (Obviously Windows only.  Is is DirectShow or Windows Media Framework now?  Oh yeah, also surprisingly difficult to integrate into a command line app.)  Or G-Streamer.  (More portable than WinMedia stuff, sure.  But still a lot of baggage to add to a port, and keeping consistent format support between platforms is essentially impossible.  And my app isn’t meant too be a “G-Streamer client.”  It’s an app that I want to get video into.  So, just give me the damn pixels already.  I’ll worry about displaying them, thank you very much.)  Or QtMultimedia.  (In practice, I am using Qt, and it is fairly portable, but I still just want the damn pixels.  And I’ll need to encode at some point.)

Most of these API’s suffer from the fascinating delusion that people just want to write simple video players.  Who the hell actually wants to do that?  Why is it such a well supported use case given that most users already have a video player installed.  It makes a neat party trick to do your own, but I’m not sure what I get out of writing another vlc/mplayer/totem/mediaPlayer/QuickTimePlayer given that I already have one.  I wonder who these legions of develops are who look at VLC and think, “I’m gonna basically do that, except probably worse and less mature.”  Every developer I ever talked to about using a video API was also doing something “interesting.”  They needed the pixels more than they needed a a 20 line demo of making a video player in python.  They needed easier encoding much more than they needed trivialised presentation.  And they needed good documentation.

So, FFMPEG stays the king of a motley crew of video API’s that aren’t that great.  But I keep running into things like the fact that they keep gradually evolving the API in place and keeping most of the cruft but making it so the random example code I fount on the net won’t actually compile anymore.  The API cleanups are never quite sweeping enough to elevate the API to “nice,” but just enough to make a lot of extra work out of figuring out what tutorials and samples are actually valid when starting a project.  Which wouldn’t be so bad if the main project documentation was first-rate.

Fore example:  http://ffmpeg.org/doxygen/trunk/group__lavc__encoding.html#gaa2dc9e9ea2567ebb2801a08153c7306b from the documentation for “avcodec_encode_video2.”  Now, first off there is the fact that in their API redesigns, in order to preserve some sense of backwards compatibility, since they mutate in place rather than just having a “version 2” of the API itself, they have version numbers on individual functions.  This function replaced the older avcodec_encode_video after it was deprecated.  I suppose it is practical, and it does server a purpose, but nothing else I depend on works this way.  As a matter of personal opinion, I really don’t like it.  But my actual complaint with the docs here is the explanation of the return value.

“0 on success, negative error code on failure”  Okay, so I can know if it worked, but that negative error code doesn’t actually seem to be documented anywhere.  And given the state of the docs and examples, I am pretty much guaranteed to do something that causes an error.  Unfortunately, I won’t get any kind of a hint about what exactly I did wrong.  The documentation, such as it is, is pretty much all just autogenerated Doxygen HTML pages presenting a slightly prettier view of exactly what is in the headers if I just read the headers directly without any documentation at all.  And that’s where I go bonkers.

I will hopefully be able to talk more about the app that I am writing that uses FFMPEG quite soon.  It’s an interesting project.  I’ve learned a lot about a lot of things, and it has some interesting features.  It’s part of the post production pipeline for a project that will be shooting in January that will hopefully turn out quite fun.