I started last year to play with pulling data right out of my Lightroom catalog. How fun to combine interests in photography with my need to make data out of things. Last year about this time I posted some 2007 photo stats, and with the release of Lightroom 2 I came up with some keyword network maps of my flickr images.
Over at The Online Photographer, Marc Rochkind did some writing about meta metadata and released a tool for OS X that produces much more summary information than I had previously considered: His tool produces by-lens statistics on cropping and aspect ratio in addition to focal length usage. This generated some thoughtful conversation about composing in the viewfinder versus cropping, and Marc’s work spurred me to think more about my own stats, and so I went back to my own Lightroom 2 catalog with the sqlite browser and R to see if I could reproduce for myself some of the more interesting data that Marc’s tool generated. After some tinkering, I think I have a functional, reusable set of R tools for generalized reporting of Lightroom image data.
Like Marc’s ImageReporter, I can filter by image type, picks, ratings, Quick Collection, camera model (though this matters less for me since I have one P&S and one DSLR) and time period, and I added filtering by color label as well — hey, just for fun, even though I don’t use the color labels (I generally get rating fatigue using anything more than picks.)
So, what do I have? First, a reproduction of the stats I checked out last year: Monthly photos and focal length:
I continue to primarily use my prime lenses, and my picture-taking appears to have notched down dramatically as compared to 2007. This is partly because of work, of course, but also because I’ve become much more selective about what I actually keep in the catalog.
We can break out focal length a bit more. For the two zooms that I use on my K100D, what are the mean focal lengths?
 “5.8-23.2 mm“ “15“
 “85.0 mm f/1.8“ “85“
 “smc PENTAX-DA 18-55mm F3.5-5.6 AL“ “31“
 “smc PENTAX-DA 21mm F3.2 AL Limited“ “21“
 “smc PENTAX-DA 50-200mm F4-5.6 ED“ “121“
 “smc PENTAX-DA 70mm F2.4 Limited“ “70“
 “smc PENTAX-FA 35mm F2 AL“ “35“
 “smc PENTAX-FA 50mm F1.4“ “50“
So that’s kind of interesting, suggesting that I use the 200mm zoom at about the middle of its range. But the mean isn’t necessarily informative. Here’s a plot of focal length for one of those zooms:
So, I use the 50-200mm lens primarily for shots at either extreme of its length, and I already have a 50mm fixed lens that takes better photos than the zoom at that distance. Moreover, breaking out just picks with this lens shows a three-to-one preference for 200mm than for 50mm. I think that means I need a long prime. Ka-ching!
I can also consider crop: How am I doing at composing in-camera? Here’s how often I crop, by lens, as a percentage:
smc PENTAX-DA 18-55mm F3.5-5.6 AL 9.13 %
smc PENTAX-DA 21mm F3.2 AL Limited 17.67 %
smc PENTAX-DA 50-200mm F4-5.6 ED 6.93 %
smc PENTAX-DA 70mm F2.4 Limited 23.78 %
smc PENTAX-FA 35mm F2 AL 10.71 %
smc PENTAX-FA 50mm F1.4 24.67 %
And, when I do crop, how much of the original composition do I keep?
smc PENTAX-DA 18-55mm F3.5-5.6 AL 78.3 %
smc PENTAX-DA 21mm F3.2 AL Limited 81.8 %
smc PENTAX-DA 50-200mm F4-5.6 ED 81.6 %
smc PENTAX-DA 70mm F2.4 Limited 80.9 %
smc PENTAX-FA 35mm F2 AL 83.4 %
smc PENTAX-FA 50mm F1.4 82.5 %
So, I’m cropping quite a bit. As Marc found in his exploration, these numbers go up when I filter by picks. I was surprised that I crop as much as I do with the DA21mm in particular, since I think of my use of it as being mostly for wide landscapes; but even those often enough are a bit crooked, enough to warrant at least some adjustment of tilt —- and Lightroom calls that adjustment a crop (fairly).
Does cropping mean I do a poor job at composing in-camera? Possibly. I have to admit that knowing I can crop gives me a conscientious freedom when I’m shooting, but these numbers give me something to think about. Maybe careful composition will be something to work on as I go forward.
We can cut all this in a few other ways. I’d like to take a look at my common keywords during a given time period, for example, but that will wait for the follow-up post, I think. This is more than enough nerdery for one January 1st afternoon.
I like this post on
regression linear models by Drew Thomas. Drew argues that casual use of the term “regression” doesn’t adequately describe linear modeling:
“Regression” literally means “the act of going back.” If we accept this definition in this context, we have to have something to which we can return. Clearly, this implies discovering the mean – but chronologically, it can only mean discovering the cause, that which came before.
Linear modelling makes no explicit assumptions about cause and effect, a major source of headache in our discipline, but the word itself, consciously or otherwise, binds us to this fact.
It also grates on his sensibilities to hear “regress” used as a verb. “We regressed bar on foo” has always seemed like an awkward phrasing to me, as well. Drew acknowledges that changing terminology is a tough business, but the call to be more precise in the language of our methodologies and models is one that I can get behind.
The second international conference of R users recently took place in Vienna, and the conference site has now posted slides and abstracts of both the keynotes and the regular presentations. There’s a ton of stuff there: Discussions of R for all sorts of statistical and graphics purposes, using R in teaching, and talks about R in a wide variety of disciplines and practices. It’s a gold mine.
The R-Help mailing list is a wealth of information. While I have no doubt (well, mostly) that the people who frequent the mailing list are all nice people in real life, woe is the newbie who asks a question that is easily answerable by consulting one of a half-dozen arcane texts, conducting an exhaustive search of list archives, or using R’s internal help system. “Read the posting guide!” will be accompanied by a curt response that often suggests how truly easy it was to find this answer for anyone not still working on their own cell division. (Okay, it’s not really that bad, but it can certainly be an intimidating place due to the sheer number of super-smart residents who have little tolerance for perceived time-wasters.)
This all adds up to my not being quite sure how to take today’s April Fool’s posts by list heavyweight Frank Harrell.
I have never taken a statistics class nor read a statistics text, but I am in dire need of help with a trivial data analysis problem for which I need to write a report in two hours. I have spent 10,000 hours of study in my field of expertise (high frequency noise-making plant biology) but I’ve always thought that statistics is something that can be mastered on short notice.
Briefly, I have an experiment in which a response variable is repeatedly measured at 1-day intervals, except that after a plant becomes sick, it is measured every three days. We forgot to randomize on one of the important variables (soil pH) and we forgot to measure the soil pH. Plants that begin to respond to treatment are harvested and eaten (deep fried if they don’t look so good), but we want to make an inference about long-term responses.
There’s more, including a couple of helpful responses, so you know, read the whole thing. The message ends with this conclusion, which is actually fairly representative of a good number of frantic help-me posts: “I would appreciate receiving a few paragraphs of description of the analysis that I can include in my report, and I would like to receive R code to analyze the data no matter which variables I collect. I do value your time, so you will get my everlasting thanks.”
Take-home message: Read the posting guide, design your analysis carefully, and don’t look crossways at Frank Harrel in a dark alley.
Looking up some information on ordinal logit models in R today, I came across Zelig. Produced by Kosuke Imai, Gary King, and Olivia Lau, Zelig is a sort of R meta-package that wraps, for example, various fuctions of the MASS or VGAM libraries into a smaller set of commands. The authors subtitled Zelig “Everyone’s Statistical Software,” with the idea that it will make R more accessible while not cutting off any of its power. Zelig appears to be a couple of years old now, but I hadn’t run into it before; maybe I’m hanging around on the wrong mailing lists?
Dataninja was just the right site to stumble across tonight. The production of an “economist and (future) economics PhD student,” Dataninja is packed full of good data and workflow stuff: Techniques to convert from spreadsheets to LaTeX code, tips for working with Stata, R pointers (including homemade reference cards), applescripts, programming tools, links to data sets, and more. As they say, read the whole thing.
Packed full. It’s a great resource.