A little anniversary passed me by in November:

pedal:data alan$ whois

[…] Updated Date: 17-nov-2011 Creation Date: 25-nov-2001 Expiration Date: 25-nov-2012

Creation Date: 25-nov – 2011! I’ve held this little vanity domain for ten years now, making both it and me unquestionably ancient in real- and internet-years.

As I ego-dove a few years ago:

… due to a squirrely web host disappearing entirely one night, I don’t have any records of the first site I built except for a few miscellaneous graphics floating around. It was wicked cool (I maintain), though, using a simple perl-based templating system to display the most recent of a set of dated text files within a design and with navigation around index to the other files…

Seriously, it was awesome. That’s how we rolled in the aughts: We cobbled together our own custom templating tools and uploaded text files to our web host using some godawful Gnome FTP client. My hosting has been more stable (well, none of them have vanished, anyway), so now I have the complete blog record from January 2002 onward.

So I pulled the numbers to see my activity over time (which, by the way, is one good reason to work on a platform that one can control directly, rather than a hosted service: Want to make data from a database? Just run a query against it!). Here’s the per-month data for 2002 through November 2011:

Cool, right? Check that downward slope into 2007 as I finished graduate school, spent some time in Seattle, and thought about what to do next. Aside from a bit of a bump towards the end of 2008 (I was doing a lot of Lightroomn tinkering and writing then), I’ve kept it pretty quiet around here the last few years (lots more casual posting to twitter and facebook, family-blogging at posterous and brief flirtations with various devoweled-platforms). I don’t know if more blogging is on the horizon, but it’s fun to explore the past ten years a bit (Kieran recently did this in style, producing full-on ebook).

Ride the Divide - race visualization

I watched Ride the Divide on Netflix tonight. It’s a really well put-together documentary about a mountain bike race from Banff, Canada, across the entire great divide, to the New Mexico-Mexico border. It features great photography and strong characters in these semi-nuts enthusiasts who take on the adventure, and turns out to be a pretty moving story.

It also has a bang-up cool race visualization:

Ride the Divide image
Ride the Divide image

The image features the relative positions of all the racers along the route — leaders, followers, and distance between them — their current altitudes, the mileage and location of the current subject, day of the race, relative distances to travel through each state, elevation of the overall route, and total travel distance for the entire race (2711 miles!). In a single, dense image, you get a ton of data. Quite cool.

The Setup by the Numbers

I recently found myself browsing interviews at The Setup, where various nerdy and creative types describe the tools they use to do their work, and my curiosity sparked at this brief statement on the about page: “Despite appearances, the site is not actually sponsored by Apple – people just seem to like using their tools. We’re a fan, too.”

Wouldn’t it be interesting, I wondered, to know just how many of the interviewees were Apple users? And what they used? And, for that matter, how many were into Android, or Lightroom versus Aperture, or emacs or obscure outlining applications that absolutely nobody else uses?

The code for The Setup is on github, and it includes the text of all the posts! The interviews are in markdown format, and the processor generates links for product names by referencing an index of hardware and software. This is the key for someone like me who wants to make data, because it provides a (mostly) standardized catalog of gear, no content coding required. It also means that the interview text can be descriptive while also referring to the standardized name for that bit of equipment, as in [15" MacBook Pro][macbook-pro].

The ruby code that builds The Setup from those files even helpfully includes a ready-to-go regular expression that finds those hardware and software references. So with a few of my own inelegant but functional passes using grep, perl and awk, I built a tab-delimited data set from which we can learn all sorts of things, such as:

  • More of The Setup interviewees are into Lightroom than Aperture
  • Apple machines really are popular (and so are iOS devices)
  • Textmate still has a lot of adherents
  • Canons are more popular than Nikons (though it’s pretty close)
  • Nobody yet interviewed has a Xoom or Galaxy tablet
  • Very few iOS apps are named more than once (not even Angry Birds)

I’ve used R to put together an easy-to-update, full rundown of the numbers (see usesthis-summary.txt) that I thought were interesting and/or fun, but you can easily explore via awk, too. For example, the following finds and counts all unique iOS applications:

awk ' {FS = "\t"} { if ($4 ~ /\-ios$/) print $4 }' thesetup-data.txt | sort | uniq | wc -l

There are a few limitations to the making of grand statements about this data: Of course each interview is a static snapshot, and we have no idea (without asking) if, say, Marco Arment has moved his work to a HP touchsmart, or Kieran Healy has switched to SPSS and MS Word, or if all the reported 3G users are still using that model of the iPhone. [Idea: break down some of the numbers by year.] There are also the occasional instances in the interviews where someone says something like, “I can’t imagine using something,” and due to the context-dumb nature of this data, that becomes a count of that something in the index. The counts rely on some skimming of the hardware/software catalogs and subsequent manual coding to identify models of gear that fit into various categories (Windows PCs and Android devices that come in all makes and models, for example). These will probably need periodic updating.

The data, the code to build the dataset, and the R code to run some numbers are all available at github.

All of this is possible thanks to the cool coding behind The Setup (imagine the work required to build a catalog if the interviews were simply static, hand-built html), the careful curation of interviews to make use of the hardware and software catalog, and the Attribution-ShareAlike licensing of the original — which licensing applies to this effort, as well. Thanks to Daniel Bogan and contributors to The Setup!

(One note about the interview count: the interview with why the lucky stiff is hand-written and so doesn’t register with the scraper. But it’s good, so you should read it.)

Pics or it didn't happen

Pics or it didn’t happen

It’s May again (I know, I don’t know how that happened either, except that it followed April, and don’t get me started on April). Among other things, May is home to the “IronMan in May” competition at my place of work. The goal is to complete a sum of distances equivalent to an IronMan triathlon in the span of 31 days: 26 miles running, 2.4 miles swimming, and 112 miles biking. It’s a fun challenge and makes for a nice opportunity to mix up my workout routine, especially since I’m really, really not much of a runner.

And of course it means data to keep track of! There’s a paper chart on the wall at work, but who would want to use that when there’s such a glut of data visualization tools making the rounds now? Two years ago I used Joe Gregorio’s sparklines tool to plot my progress as I went:

Well it’s 2010 now (don’t get me started on 2009; that’s when I dislocated my shoulder — again — and then got nasty tendinitis to boot, so I didn’t take on IronMay last year), and this year I’m using Processing to plot my month’s data. Like R is really a programming language oriented to data-oriented programming, Processing is a language oriented around visualization (unlike R, however, which makes it easy to access vectors and data frames, Processing requires one to go back to array manipulation, so it took me some getting used to). Now, having seen the spectacular stuff in the Processing gallery I just hope that I’m not insulting the poor software by using it to make bar graphs (that look an awful lot like the old sparklines! I figured I had to start somewhere as I learned something brand new-to-me). This quick Processing tutorial is a great place to start.

Update May 31 2010: It’s a gorgeous day in Flagstaff and I knocked out the last of my run and bike miles on the urban trail this morning. Iron June, anybody?

Update shortly after the prior update: I wasn’t content with just the bar plot, so I tinkered just enough with streamgraph.js to come up with this (click through for the full-sized version at flickr):

IronMay 2010 - streamgraph

Pretty rad, I think.

Local culture revisited

Several years ago I stumbled over the Netflix “local favorites” list and had a good time exploring it. Well, the New York Times has gone and made a really cool presentation of that data, for 2009, for a dozen U.S. cities. Check it out. Good stuff.

Another year of photo data!

Following in the modest two-year tradition I’ve established (see 2007 and 2008 posts), here is my 2009 photo data from my Lightroom catalog!

[ quick howto: Lightroom 1 & 2 (and 3) databases are in sqlite3 format, which means that freely-available tools can extract data from them. I use sqlite3, some shell scripting, and R (and occasionally excel) to produce summaries of that data. Why? Data offers some insight into the kinds of photos I take. Mostly, though, it’s fun. I’d be happy to expand on the actual code that goes into these plots, if there’s interest. ]

Below is a set of plots that summarize some of this year’s data. Click through to flickr to see the larger version.

2009 photo data!

What’s interesting this year? Well, crop ratios looked pretty similar to last year, so this year, for the first time (suggested in a post by Keitha, whose photos I admire tremendously, and whose Pentax lens set I envy with the fire of a million anti-glare-coated nine-aperture-bladed all-metal suns) I pulled out some information about aperture for each of the prime lenses that I shoot with. You can see these four frequency plots (for each of the Pentax DA 70mm F2.4 ltd, FA 50mm F1.4, FA 35mm F2.0 and DA 21mm F3.2 ltd lenses) in the left hand column of the image. Right off the bat you can see that I shot a lot with the FA 35mm this year (which is confirmed by the “overall lens use” plot on the right column). In fact, I took that lens along as my sole lens on a few long weekend trips to Ventura, CA, and the San Juan Islands, and really loved its performance. It does great at large apertures, but I also used it a lot for street shooting at f/8 and smaller apertures.

Runner-up in frequency this year is the FA 50mm F/1.4, which ordinarily I would say is my favorite lens (and it very much still is; it just wasn’t as convenient a focal length to take as my only lens on those vacations). Its sweet spot [where it’s sharpest but still has a nice narrow depth of field] is about F/4, which is where I primarily use it.

Neither the DA 70mm F/2.4 or the DA 21mm F/3.2 got as much use this year, but I really love some of the photos I took with those lenses. In fact, I carried these two lenses specifically for their light weight and trim size on the Flagstaff photowalk I organized in July.

Car / Cat Ranch house / wide Crow Pomegranite Backside Doorman

How did 2009 stack up to 2008? In terms of absolute frequency, nearly identical! I kept 1308 frames last year, compared to keeping 1340 in 2008. Far fewer of those are picks, or posted to flickr — though a good number are waiting for me to come back to, to finish workup or to make a print.

And that’s it for the 2009 photo stats! I did re-work my keyword network code, so perhaps can follow up this post with a little more about keyword relationships.

If you’d like to know more about extracting and summarizing info from your own Lightroom catalog, please let me know (and check out my other lightroom-related posts)

And, as last year, I hope soon to follow up with a report on my 2009 photo goals, and to set a few for 2010.


Energy visibility

Our electric utility recently replaced our meter box with a digital one that somewhat magically sends its data back to the mothership. The cool bonus of this is that we can track our energy usage along a number of metrics. The online monitoring application isn’t super-sophisticated and I’d like more flexibility in how it aggregates data, but it does give us a window into how we use power. Not surprisingly, our usage increases in the morning, evening, and weekends. But what we hadn’t expected to see was just how much the usage spiked when the hot tub timer switched on. Turns out it’s a whole lot easier to not be “hot tub people” when you can see just how much power it’s using (and money costing).

Coming around again

Matt Yglesias asks “What are Today’s Protests Missing?” Turns out he asked much the same question a few years ago, and I had some thoughts at the time about what seems to be a common feature of both the left and right: When compared to the protest of ye old days, contemporary mass mobilization is greeted by public intellectuals with a sigh and either one of a) regret that it isn’t ye old days anymore when protests were coherent and organized, or b) dismissive sneering about how the hippies have never been good for anything and still aren’t good for anything.

This time around, Matt makes a really important point, that coherence of movements often is really only sensible in hindsight:

Both Gandhi and King led movements that were committed to vaguely defined and quite sweeping visions of social change that, among other things, included opposition to capitalism and all forms of war. Their goals look well-defined in retrospect because they achieved a great deal so, in retrospect, MLK’s leadership resulted in the Civil Rights Act and the Voting Rights Act and Gandhi’s leadership led to independence for India. But all mass-movements are prone to ill-defined goals.

That’s a part of one of the key observations I made in response to this same thread a few years ago:

The single largest event of the period was a Washington, D.C., antiwar rally of November 15, 1969, attended by an estimated 250,000 people. A quick read of the coverage of that weekend—like yesterday’s march, it really was a series of events, not a single event—demonstrates that participants were there to take part for many reasons, although they all ended up under the anti-war banner: Students protested the draft; religious activists ranging from Catholic to Quaker participated; radical leftists were there, as were elderly women and parents with their children, as were small groups seeking violent confrontations; also present were African American organizers and advocates for the poor, protesting the war’s diversion of funds from domestic programs. This is still an oversimplified list of participants; it’s clear that while the war was the most tangible target of the protests, many grievances actually brought protesters out. Like this weekend’s march, officially organized by United for Peace and Justice, that series of events had a nominal set of organizers, but plenty of other groups also participated. In a sister protest across the country, where another 100,000 people demonstrated, Physicians for Social Responsibility and the Gay Liberation Front were among notable organizations represented.

This is not to say that the context for contemporary protest hasn’t changed: Political opportunity structure is different, modes and tools of mobilization are transforming, and movement organizations are functioning in some very different ways. But we need to be aware of the reality of the good old days of American protest in order to make sense of what has changed and what hasn’t changed.


Update: Brayden King, one of my old office-mates, has more thoughts on this topic. Typically for him, it’s good, smart, well-researched stuff.

Facebook network visualization

Quick ‘n dirty visualization of the clusters of relationships among my facebook friends:

Facebook network visualization

Data generated with Bernie Hogan’s My Online Social Network app on facebook, and visualized with GUESS. Good stuff, Bernie!

Thanks to Marc Smith — he’s one of the nodes up there — for the link to the flickr version of this image over at Connected Action.

Brayden King and Kieran Healy (they’re up there in my visualization, too) have posted their own plots over at orgtheory: one, two.

The year in Lightroom, by the numbers

I started last year to play with pulling data right out of my Lightroom catalog. How fun to combine interests in photography with my need to make data out of things. Last year about this time I posted some 2007 photo stats, and with the release of Lightroom 2 I came up with some keyword network maps of my flickr images.

Over at The Online Photographer, Marc Rochkind did some writing about meta metadata and released a tool for OS X that produces much more summary information than I had previously considered: His tool produces by-lens statistics on cropping and aspect ratio in addition to focal length usage. This generated some thoughtful conversation about composing in the viewfinder versus cropping, and Marc’s work spurred me to think more about my own stats, and so I went back to my own Lightroom 2 catalog with the sqlite browser and R to see if I could reproduce for myself some of the more interesting data that Marc’s tool generated. After some tinkering, I think I have a functional, reusable set of R tools for generalized reporting of Lightroom image data.

Like Marc’s ImageReporter, I can filter by image type, picks, ratings, Quick Collection, camera model (though this matters less for me since I have one P&S and one DSLR) and time period, and I added filtering by color label as well — hey, just for fun, even though I don’t use the color labels (I generally get rating fatigue using anything more than picks.)

So, what do I have? First, a reproduction of the stats I checked out last year: Monthly photos and focal length:

The year in Lightroom

I continue to primarily use my prime lenses, and my picture-taking appears to have notched down dramatically as compared to 2007. This is partly because of work, of course, but also because I’ve become much more selective about what I actually keep in the catalog.

We can break out focal length a bit more. For the two zooms that I use on my K100D, what are the mean focal lengths?

> lensFL [1] 5.8-23.2 mm 15 [3] 85.0 mm f/1.8 85 [5] smc PENTAX-DA 18-55mm F3.5-5.6 AL 31 [7] smc PENTAX-DA 21mm F3.2 AL Limited 21 [9] smc PENTAX-DA 50-200mm F4-5.6 ED 121 [11] smc PENTAX-DA 70mm F2.4 Limited 70 [13] smc PENTAX-FA 35mm F2 AL 35 [15] smc PENTAX-FA 50mm F1.4 50

So that’s kind of interesting, suggesting that I use the 200mm zoom at about the middle of its range. But the mean isn’t necessarily informative. Here’s a plot of focal length for one of those zooms:

Focal lengths plot, DA 50-200mm lens, 2008

So, I use the 50-200mm lens primarily for shots at either extreme of its length, and I already have a 50mm fixed lens that takes better photos than the zoom at that distance. Moreover, breaking out just picks with this lens shows a three-to-one preference for 200mm than for 50mm. I think that means I need a long prime. Ka-ching!

I can also consider crop: How am I doing at composing in-camera? Here’s how often I crop, by lens, as a percentage:

	smc PENTAX-DA 18-55mm F3.5-5.6 AL   9.13 %
	smc PENTAX-DA 21mm F3.2 AL Limited 17.67 %
	smc PENTAX-DA 50-200mm F4-5.6 ED    6.93 %
	smc PENTAX-DA 70mm F2.4 Limited    23.78 %
	smc PENTAX-FA 35mm F2 AL           10.71 %
	smc PENTAX-FA 50mm F1.4            24.67 %

And, when I do crop, how much of the original composition do I keep?

	smc PENTAX-DA 18-55mm F3.5-5.6 AL  78.3 %                            
	smc PENTAX-DA 21mm F3.2 AL Limited 81.8 %                            
	smc PENTAX-DA 50-200mm F4-5.6 ED   81.6 %                            
	smc PENTAX-DA 70mm F2.4 Limited    80.9 %                            
	smc PENTAX-FA 35mm F2 AL           83.4 %                            
	smc PENTAX-FA 50mm F1.4            82.5 %

So, I’m cropping quite a bit. As Marc found in his exploration, these numbers go up when I filter by picks. I was surprised that I crop as much as I do with the DA21mm in particular, since I think of my use of it as being mostly for wide landscapes; but even those often enough are a bit crooked, enough to warrant at least some adjustment of tilt —- and Lightroom calls that adjustment a crop (fairly).

Does cropping mean I do a poor job at composing in-camera? Possibly. I have to admit that knowing I can crop gives me a conscientious freedom when I’m shooting, but these numbers give me something to think about. Maybe careful composition will be something to work on as I go forward.

We can cut all this in a few other ways. I’d like to take a look at my common keywords during a given time period, for example, but that will wait for the follow-up post, I think. This is more than enough nerdery for one January 1st afternoon.

My Lightroom 2 Backup Strategy

Related: A more recent post about archive and backup in Lightroom.

With this morning’s comment from martie asking about a crashed hard drive, I got to thinking about making my own Lightroom 2 backup plan a bit more automated and reliable. My general approach is to periodically copy my catalog file and image directories to an external hard drive, but there’s been nothing systematic about it until now.

I’ve previously described a bit of my Lightroom file structure, noting that I import new photos into a single directory per import. As part of a strategy to save space on the MacBook where I do my actual work, I periodically move those folders to an external hard disk currently named Grundle. This is simply a matter of dragging the folder, in the left-hand directories pane of Lightroom, from one hard drive to another.

Lightroom display of multiple drives

While this copying step is manual, the rest of the system is now automated, thanks to this tutorial at MacResearch and a bash script by Aidan Clark. The bash script took just a bit of tinkering to work with Lightroom’s catalog file, which by default will have rsync-breaking spaces in it, and to perform the second backup from the external volume to the iMac. I’ll post those specific and very minor modifications if there is interest.

Here’s the final result: Using OS X’s launchd tool, whenever I mount Grundle on my MacBook, whether via network or direct firewire connection, my Lightroom 2 catalog file is copied to Grundle using rsync. And, whenever I mount Grundle on the upstairs iMac, a similar combination of launchd and rsync copies both the catalog file and the image directories from Grundle to the iMac. This means that in the course of regular use of my two Macs and that external drive, both my Lightroom catalog and folders full of images get backed up.

One caveat to this system is that the backup of the image folders still involves that manual step of moving them from the laptop to Grundle. I could automate this the same way the catalog backup is done, but that could potentially mean trying to backup a gig or more at a time over the wifi network — a time- and bandwidth- consuming process that isn’t really necessary. Now, the obvious down-side is that the newest photos I’ve taken are always the ones most vulnerable to data loss, and that’s obviously not a highly desirable thing. But I’m satisfied with my current workflow of moving folders to Grundle generally when I’m done working with that set of images. I’ll continue to think about this situation and may come up with some additional redundancy for that stage of processing.

Update: Okay, I buckled. A bit more tinkering and I now have my current folders of raw images copied to Grundle. After I relocate the folder using Lightroom, the folder will disappear from the backup directory, so I don’t have redundant backup files stacking up anywhere. Nice and clean, and everything’s safe.

Update the second: One item I neglected to mention in the original post was the automated backup feature built in to Lightroom: Available in the catalog settings menu (alt-cmd-,) this feature performs scheduled backups of your catalog file only, to a location you specify, and can be set to run a backup on any of several schedules. My process above includes allowing that backup to run weekly — it never hurts to have a little more failsafe security. The benefit of automating getting that backup to another hard drive is one more important layer of keeping safe your data.

mycrocosm beats me out of the gate

A little while back I had a fun idea: I bet I could use twitter to collect and store little, ad-hoc data statements; with a simple parser, those statements could be used to make data. A little ad-hoc database right inside twitter! I even got myself a domain name where I could tinker with it.

Well, mycrocosm beat me to it. It’s cool. It makes graphs. Rad. Exhibit A, on my time spent engaged with the Olympics:


Also. Tinkering with mycrocosm, I found Google Charts. Holy smokes!


And today I see daytum, another service of the same sort. It’s invitation-only, dammit. But it looks cool.

Lightroom 2: Related Keywords are Dreamy

As it happens, Lightroom 2.0 has just the thing I daydreamed about a handful of months ago. The new version’s data includes a table of keyword co-occurrences that makes it possible to produce things like this:

My flickr tag neighborhood

This graph shows keyword relationships that occur within a hop from my “flickr” keyword — which I use to keep track of photos that I upload there. In other words, it’s sort of a descriptive keyword neighborhood of what I’ve put up on flickr.

Color is a little subjective. The darker blue, the higher the ratio between unique neighbors and total neighbors. That is, darker blue nodes are connected to relatively few unique other neighbors than the lighter blue nodes.

Of course, you could use any focal keyword for this kind of thing: Starting with a lens-specific keyword would produce a rough map of the neighborhood associated with that lens, and might reveal how I tend to use that lens. The possibilities are pretty endless — and totally a fun kick in the pants to tinker with.

Dots and lines

Sparklines are fun to tinker with and can provide quick glimpses of data. Here are some not-quite-realtime twitter sparklines, built with this small and useful tool and a bit of scripting. 30-days of twitter: How about a change plot: Or, if you like, the straight-up histogram:

Feelin' fine

Way cool:

(image page at flickr)

We Feel Fine aggregates and provides clicky-feely visualizations of expressions of emotions online, via text found in blogs, flickr pages and google.

I spent a good chunk of today trying to figure out why a single dumb plot was coming out all hinky; these guys have colored affect balls swirling apparently effortlessly around your mouse cursor. I feel inadequate, sure, but I feel wildly enthusiastic, as well. This is cool stuff.

(Via Chris at Ruminate.)

Nice find: Dataninja

Dataninja was just the right site to stumble across tonight. The production of an “economist and (future) economics PhD student,” Dataninja is packed full of good data and workflow stuff: Techniques to convert from spreadsheets to LaTeX code, tips for working with Stata, R pointers (including homemade reference cards), applescripts, programming tools, links to data sets, and more. As they say, read the whole thing.

Packed full. It’s a great resource.

Data collection

I very much enjoyed Drek’s thoughts about data today, and I am looking forward to his following up on this post with some discussion of important elements of research design: For example, the differences between collecting experimental data, conducting various sorts of field research, and performing simulations.

Easier done than said

Outrage fatigue has set in, making it hard to get steamed about stuff like this anymore. These guys just stand up and lie, with contrary evidence right in front of them. We get lies about the economy, lies about the tax cut, and lies about going to war.

The same continues to happen with regard to Tim Lambert’s ongoing whacking of John Lott with the honesty stick. On the efforts among Lott supporters to debunk a study that contradicts their own “research,” Lambert points out that, contrary to repeated claims otherwise, the study’s data is publicly available from ICPSR.

I checked, and yep, Lambert’s right. It took exactly seven seconds and a single click of a “search” button to find the study and whole mess of downloadable data.

Evidence. Right there. Data. Available. How do people get away this this crap? Unfortunately, the ability to readily disprove an egregious lie—er, excuse me, “extension of the truth” as I’m told we’re calling it now—seems to be easier done than said.

Edit: Oops. Accidentally dropped the “S” from ICP*S*R (to the joy of political scientists?).

About, the short version

I’m a sociologist-errant. This site is powered by Textpattern, Pair Networks and the sociological imagination. For more about me and this site, see the long version.

RSS feed