Dealing with data

Warning: The post below is much longer than I expected and delayed by several days.

Tim Lambert has a really nice discussion of the importance of appropriate statistical models. He has done an extraordinary job of taking John Lott to task for both coding errors and the misuse of statistics, and his recent work is no exception. His discussion puts not only standard errors but also clustering into a highly understandable context.

Lambert’s service to the honest use of data is really important, and his most recent discussions coincide with a presentation last week by Erin Leahey on practices of data cleaning and data editing. Leahy’s research is still ongoing, and her published work into faculty practices is being extended to investigate student practices. At the core of the work is the assertion that cleaning or otherwise editing data is a common part of research. Data may be edited to make it amenable to particular statistical techniques, or it may be cleaned when different variables contradict one another, or cases may be dropped from analysis when their values are overly influential.

Depending on circumstances, there are legitimate reasons to do all of those things, but Leahy finds that these practices are only occasionally taught with much detail; seasoned researchers as well as students are frequently unsure of what the best practices are when it comes to cleaning data. In my own case, I have been cleaning protest data for four years now, using a set of programs that search for logically inconsistent values to identify sources of potential error. The program, for example, flags cases in which events are coded as taking place in multiple locations, but where only a single location is specifically identified. Coders then take those reports back to the original source material (in this case, news articles about protest) to find the discrepancies and correct the data.

Even a fairly systematic procedure like this one is subject to problems. As the data grew from dozens to hundreds of files coded by many individuals, the cleaning not only became a larger and more distributed task, but the nature of protest events substantively shifted over the time period of the data coded. As a result, issues which were of little importance early on became critical. Lawsuits and other legal proceedings, for instance, comprise just a smattering of data in the early time period. By the 1970s and 1980s, however, when lawsuits rise in frequency, the complications of recording their duration and other characteristics become more important.

Some of these problems can be addressed by carefully evaluating coding practices and doing our best to ensure consistency among coders. Occasionally, as happened a year or so ago, addressing past coding problems meant digging into older data and undertaking a massive *re*cleaning and recoding effort. The point here, and what I take from Leahy’s work, is that dealing with data is a serious task; as Leahy finds, grad students and faculty take it very much to heart, particularly as best practices seem uncertain at times. The validity and reliability of our work with our data depend on our ability not only to perform honest analysis, but to make the claim that the data itself represents the phenomena we seek to explain. In the case of some of the protest data, this means acknowledging that when dealing with lawsuits we have less precision than in cases of sit-ins, marches, or rallies; we have much better data on duration of the latter events than the former, so subsequent work with the data should take that into account.

What does this all mean with respect to the Lott/Lambert situation? Lambert reminds us of the centrality of integrity in work that involves data and analysis. Such work is hard, and demands extraordinary time and effort. When one’s work is performed with rigorous attention (to both uncertain best practices of cleaning and editing, as well as the far more certain best practices of statistical analysisis) and is open to peer review, it can make a contribution. Lott’s research continually fails this standard. The absurdity of adopting a false identity to cover one’s tracks and blaming obvious deceptions on your Macintosh go so far beyond what any self-respecting community of peers would accept. Rather than generate a scientific dialogue, what Lott has consistently done is obscure his data, dismiss criticism, and conceal (or try to conceal, anyway) his methods. He has done a real disservice to both the substance and method of social scientific work.