06 March 2007

my new filing technique is unstoppable

friends, it is high time that i tell you what i have been doing with myself (other than gorging on exotic produce and playing with frisky kittens) since i came to california.

i'll tell you: i've been filing. here at hrdag, incredible amounts of data are constantly falling out of the sky and into our laps, and all this data must be filed. it must all be filed in such a way that things that fall out of the sky are kept separate from things that are lovingly cultivated by the resident geeks. it must all be filed in such a way that we know by looking at its name what each of the thousands of files in each of our projects will do if we open it. it must all be filed so that everyone has access to it all the time. it must all be filed such that everyone who has access to it always knows its complete history. and so on.

filing is not necessarily the strong point of the social scientist. just for example, i'm not sure where i put most of the papers from my prospectus bibliography. (also, i don't remember what they say. but that is an entry for another day.) so for filing we put down KKV and pick up bash, python, svn, (GNU) emacs, and some of the rather more obscure portions of stata and R.

needless to say, learning all of this at once after a computer science career consisting entirely of...nothing...is a bit daunting. luckily, the fact that data fall from the sky makes difficulties *both* endemic (if you want good inference) and worthwhile (because you want good inference). and we do, in the end, want good inference. what has astounded me so far is the extent to which folks who purport to study violence in a quantitative way haven't dealt with (all that much of) the problems native to that sort of data. and, maybe more troublingly, how consequential that can actually be.

meghan has the best explanation of the problem. "it's not that they think they have it right," she said, "it's that they think they don't have it that wrong." we've been taught to believe, implicitly at least, that magnitudes somehow don't matter, and that the only thing that should concern us is whether we've correctly "approximated" the "variation." so it should be fine to generalize from the sample of killings reported to police or newspapers or survey researchers or truth commissions or archdiocesan commissions! except, whoops, each of those sources is biased in a different way; each of those sources' biases change over time; we might be barking up the entirely wrong tree.

[also of note: turns out that in the real world, and the course of the history of the real world, magnitudes really fucking matter (read the first paragraph of that link, comparing the number of deaths reported to the number eventually estimated. then think about that for a while.), and it's nasty for social scientists to say or imply that they don't.]

so the measurement problems matter (not just historically, but yes, friends, they are going to fuck with the quality of your causal assertions) and are not inconsequential. basically, the filing i'm doing is the first piece of the necessarily elaborate method for fixing those problems. because: before i can figure out the counting piece of "what really happened in el salvador," i need to catch at least three old, scary, broken datasets as they fall from the sky; i need to fix them (not by hand, but rather by python); i need to compare all possible overlaps between the three (not by hand, but by exciting matching pipeline i don't fully understand); and i need to apply some probability theory and nonparametric statistics in order to make the overlaps tell me about all the cases that weren't counted. also, that whole process has to be sufficiently transparent that someone could take the same old, scary, broken datasets and come to the same conclusions. or so that i could do it all again when more data rained like manna from heaven.

there are exciting things afoot here: in addition to learning a boatload of basic computer science skills, i've nearly finished my prospectus (knock on wood, duh). i'm also filling my head with the details i hadn't yet mastered when i first got stuck on this problem, preparing for talks and conferences, and...yeah, i really am *not* just hanging out at the bowl all the damn time.

ps: about the title