schussman.com logo

Dealing with duplicate keywords in Lightroom 2

I’ve noted recently how slick it is to be able to use nested keywords in Lightroom: It’s a piece of cake to select a set of photos, hit cmd-K, and enter “mount humphries > mountains” to assign the “mount humphries” as a child keyword of “mountains”.

However, as noted by a poster to a thread about keywording over on the Flickr Lightroom group, this creates a potential problem: If the child keyword is already used, Lightroom 2 will end up creating a duplicate keyword; you’ll end up with one “mount humphries” without a parent, and one “mount humphries” keyword with the parent of “mountains.” So, by trying to be hierarchical with your keywording, you’ve actually splintered your keywords. Not helpful!

Having already found a fun way to explore relationships between and frequencies of my keywords, it occurred to me that I might have some ready-made tools to help with this situation: The need to find and deal with duplicate keywords.

As this article became more popular, I worked through a couple of alternative methods and organized things a bit further. To date, I describe three methods of identifying duplicate keywords:

  • Full auto: Requires some scripting but is the most expedient way to go about it (and my favorite).
  • Semi-auto: Requires the awk tool to identify duplicates but doesn’t rely on any sqlite3 code to pull from the LR database.
  • Full manual: Uses LR’s built-in export tool and MS Excel to get what you want. Lots of steps, but it works.

After finding your preferred method, read on to see what to do with all the duplicates you identify.

The full auto/scripted method

Thanks to Dieter for putting me on to a much quicker, straightforward way to identify duplicates using SQLite only — no awk:

select count(name) as num, name from AgLibraryKeyword group by name having num > 1

I’ve kept the original SQLite+awk method below for posterity:

First I needed to find any duplicates created — potentially — by the ad-hoc nesting of keywords. A quick modification to my keyword frequency script produces a list of dupes:

# display a list of duplicate keywords in Lightroom 2 cp ~/Pictures/Lightroom/Lightroom\ 2\ Catalog.lrcat ~/lightroom.lrdb /usr/local/bin/sqlite3 -csv ~/lightroom.lrdb 'select ImageCountCache, name from AgLibraryKeyword;' > /Users/alan/lr-keywords.csv awk -F , '{print $2}' lr-keywords.csv | sort -n | uniq -d rm ~/lightroom.lrdb rm ~/lr-keywords.csv

As with the frequencies, this short script makes a backup copy of my Lightroom 2 database, then calls on sqlite3 to extract the list of keywords; the difference is in the 4th line, where I use awk to pull out the tags and then pipe them through two built-in unix functions to print a list of any duplicates. In my case, it yielded the following:

Running the duplicate keyword script

Perfect! A list of keywords that appear to be duplicates.

Note: The script above works with OS X Tiger and requires an upgrade to the default version sqlite. It ought to work with Leopard as long as sqlite is present. Windows? Don’t know; either via cygwin or separate binaries, awk and uniq should be available for Windows, and there is a sqlite for Windows download at the above link.

Skipping the sqlite step

Update 17/Dec/2008: If you’re not eager to delve into sqlite, you can still make this work, but you’ll still need to have the awk tool. OS X users, you should be good to go, since awk comes with the OS; Windows users, you can download awk for Windows. First, manually export your keywords list from your catalog: Metadata > Export Keywords…, and save the file as lrkeywords.txt, and then run the following one-liner script from a shell/terminal:

awk 'BEGIN {FS = " "}; {$2 = $2; if (match($0, /{/)0) print}’ lrkeywords.txt | sort -n | uniq -d

Just as in the sqlite version, this one-liner parses your keywords file and returns the list of keywords that appear more than once and are not identified as synonyms. You can then reconcile the duplicates as described above. I prefer the single-step version that extracts directly from the database, but hope this is useful to a few folks.

The Full Manual Process

If you’re averse to both the sql steps and to using awk, you can use MS Excel to identify the duplicates. I think it’s far more cumbersome than either of the above processes, but it works. Here goes:

Manually export your keywords from Lightroom: Metadata > Export Keywords ….

Open that export file in Excel: Open as a text delimited file, but uncheck all the delimiters; you don’t want excel to parse along spaces or tabs, since both of those characters appear in the file but not as record separators.

You should see a single column of keywords something like this:

screenshot

We need to slightly clean up that column to remove leading whitespace. We’ll use the CLEAN function to do that: In the column next to your keywords column, enter the formula =clean(a1), and then drag that formula all the way down the keywords column.

screenshot

In col B, you now have a whitespace-trimmed set of keywords, but because of the clean() formula, you can’t manipulate it further. Select that entire column, copy it, and then use paste special to paste the column as values into column C.

screenshot - cleaning up keywords list with paste special

Now we’re ready for the final steps: Sort, flag, and filter. Select column C, then go to Data > Sort, and sort ascending by column C. Now, in column D, drag the following formula from the first row to the last: =IF(C2=C1, "!", "").

screenshot - applying the duplicate flag formula to the sorted list

That just fills the column with a flag if column C has a duplicate. It’s a low-budget search, but it works as long as the list of keywords is sorted. Finally, use Data > Filter > Autofilter, and click-select the ! in column D. You’ll now have a filtered list of duplicates from your original keyword list, which you can resolve as described above. Note that you’ll have a number of keywords surrounded by {} or [] brackets; these are keywords entered as synonyms or categories, and you should ignore them when you are addressing duplicates.

screenshot - the list of filtered duplicate keywords

You can see from my current duplicate list that I’ve been working heavily on food-related keywords as we cruise through the holiday season.

All told, that whole manual process should just take a couple of minutes once you have the steps sorted out. Because the steps are manual, it’s not as easily-repeatable as the automated sqlite+awk approach, but it does work. I hope someone finds it useful!

Dealing with the duplicates

Whatever method you’ve employed, at this point you have a list now — let’s check out if it means what I think it means. Switching back over to Lightroom, I can filter for all photos with the “mount humphries” keyword:

Lightroom listing duplicate keywords

Sure enough, I have 12 images tagged with “mount humphries”, and 11 images with the same tag set as a child of “mountains” (as an aside, I see that I have well over a hundred images with the “mountains” tag that could probably use some more granular tagging).

My first impulse was to try to just drag the non-child “mount humphries” into the “mountains” tag; this works, after all, with other keywords. But in this case, it won’t do the trick, presumably because there is already a “mount humphries” keyword there — Lightroom won’t let me add a same-named child.

To reassign the keyword to the parent, you need to take a few more steps: First, click the right-pointed arrow to the right side of the duplicate, non-child keyword; this will navigate to all images assigned that keyword. Then, in the grid view, simply select all (cmd-A), and then check the child keyword to add it to all of the selected images (the checkbox is to the left of the keyword, and appears when the mouse cursor hovers over the keyword), and you’ll see the count increase accordingly. Next, un-check the duplicate, non-child keyword in the keywords panel. You’ll see its count drop to 0. The order of those two check-uncheck steps is important: If you uncheck the non-child keyword first, you’ll end up with an empty selection and nothing to apply the proper keyword to.

Lightroom displaying duplicate keywords unchecked

There! My “mount humphries” keyword as a child of “mountains” is now assigned to all 23 original images, and I can delete the duplicate, non-child tag.

So, with an approach like this, ad-hoc keyword nesting shouldn’t be feared: We can identify duplicates created by nesting, and, in a matter of seconds, apply the same nesting to any previously-tagged images. And, once you’ve resolved the duplicate, any further assignment of the focal keyword will always assign it, appropriately, as a nested tag. Pretty slick, I do say.