Lost in Userland (Talk at SCALE17x)

The slides for my talk at Scale17x are available here: In Google Slides

The synopsis is here: At the SCALE website

My repository with W-Utils, the “Worst Utils” versions of a few common userland utilities is here:  On GitHub . The implementations are all incomplete, proof of concept sorts of things, but they are useful examples that are much simplified compared to the “real thing” in GNU Coreutils.

A video of the talk is up on the SCALE YouTube page.  My talk in the “Ballroom C” video starts at 4:05:30.  The camera doesn’t cover the screen, so you may need to be clicking along with the slide deck in you want the full experience.  (The embedded video starts near the right time, but it doesn’t seem to want to seek to the exact spot, so you may need to move in time a little.)

Advertisements

Python Script Editor with HTML Output

File this one under Stupid Python Tricks.  I have written a bit about working on an app with an embedded Python run time.  It’s good fun.  I recently added a new feature to the script editor that was relatively easy, but for some reason isn’t very common.  There are a few small quirks to making a script editor do this, so in case anybody is curious how to do it in their own app, this is how I did it.

Behold the rich text glory!

Behold the rich text glory!  HTML output from Python directly in the script editor!  It really is like Christmas!

Continue reading

A small weapon in the war on redundant data

I recently discovered a neat utility called rdfind that searches a path for duplicate files.  Dedup can be super useful when you realise you have hundreds of GB’s of redundant data floating around your PC.  (I had about 400 GB after moving a bunch of scattered data from several smaller hard drives to a new 3TB drive I just bought.  A lot of the drives had copies of some of the same data.)  It’s pretty easy to install (it’s in the standard repositories for apt-get on Ubuntu)  and to use:

will@will-desktop:/storage/test$ sudo rdfind /mnt/Quantum2/
 Now scanning "/mnt/Quantum2", found 259947 files.
 Now have 259947 files in total.
 Removed 0 files due to nonunique device and inode.
 Now removing files with zero size from list...removed 651 files
 Total size is 1445615350230 bytes or 1 Tib
 Now sorting on size:removed 72229 files due to unique sizes from list.187067 files left.
 Now eliminating candidates based on first bytes:removed 117614 files from list.69453 files left.
 Now eliminating candidates based on last bytes:removed 5872 files from list.63581 files left.
 Now eliminating candidates based on md5 checksum:removed 5166 files from list.58415 files left.
 It seems like you have 58415 files that are not unique
 Totally, 394 Gib can be reduced.
 Now making results file results.txt

It has some other options to do things like delete duplicate files automatically.  But, that terrifies me.  So, I don’t use that feature.  The result after it cranks away for quite a while is a file called results.txt with everything you need to know to go wastehunting.  Unfortunately, the output format is a bit obscure if you just want to know what will free up the most space easiest:

# Automatically generated
 # duptype id depth size device inode priority name
 DUPTYPE_FIRST_OCCURRENCE 58483 3 1 2082 1710946 1 /home/will/Downloads/pattern-2.6/Pattern.egg-info/not-zip-safe
 DUPTYPE_WITHIN_SAME_TREE -58483 3 1 2082 1710957 1 /home/will/Downloads/pattern-2.6/Pattern.egg-info/dependency_links.txt

It doesn’t directly give you the count of a given file, or the total waste for a given file.  It just gives you a file ID for each file, and the size of each copy.  You have to count,multiply, and sort by yourself to understand where your worst offenders are.  So, I wrote a little python script to process that file and save me counting file ID’s on my fingers.  It’s not anything fancy, but it looks like this:

will@will-desktop:~$ python Documents/rdproc.py /storage/test/results.txt
...
(4618027008, 1539342336, 3, '/mnt/Quantum2/vidprod/libby/TableRead/2013.09.17/PRIVATE/AVCHD/BDMV/STREAM/00000.MTS', 152486)
(4633560010, 2316780005, 2, '/mnt/Quantum2/vidprod/mdwm/WholeDriveBackup/Caitlin/MDWM/Transcoded', 162706)
(4927753612, 2463876806, 2, '/mnt/Quantum2/vidprod/mdwm/WholeDriveBackup/Caitlin/MDWM/Transcoded', 162593)
(5807821978, 2903910989, 2, '/mnt/Quantum2/vidprod/mdwm/WholeDriveBackup/MDWM', 160710)
(7562474188, 3781237094, 2, '/mnt/Quantum2/vidprod/mdwm/WholeDriveBackup/Caitlin/MDWM/Transcoded', 162601)


The order of output is explained at the github link.  But, it lets me easily see that my biggest waste comes from having a bunch of footage from My Dinner With Megatron, as well as a backup of a Whole Drive that was used during production.  Hence, 2 copies of a bunch of that stuff that I can merge back down quite easily.  I also have no fewer than 3 copies of a Table Read that I shot for a friend quite a while ago because I never cleared that memory card, and wound up re-importing it a few extra times after I shot more stuff on it.  As you can see, having everything sorted and summed makes it a lot easier to understand than if you were to try and use the results.txt file yourself.  So, feel free to use the python script I wrote.  It’s not complicated or fancy, but I figure it may be useful enough to save somebody from having to reinvent it for themselves.  Let me know if you find it useful.