Friday October 18, 2013
Simon DeDeo, a research fellow in applied mathematics and complex systems at the Santa Fe Institute, had a problem. He was collaborating on a new project analyzing 300 years’ worth of data from the archives of London’s Old Bailey, the central criminal court of England and Wales. Granted, there was clean data in the usual straightforward Excel spreadsheet format, including such variables as indictment, verdict, and sentence for each case. But there were also full court transcripts, containing some 10 million words recorded during just under 200,000 trials.
How the hell do you analyze that data?” DeDeo wondered. It wasn’t the size of the data set that was daunting; by big data standards, the size was quite manageable. It was the sheer complexity and lack of formal structure that posed a problem. This “big data” looked nothing like the kinds of traditional data sets the former physicist would have encountered earlier in his career, when the research paradigm involved forming a hypothesis, deciding precisely what one wished to measure, then building an apparatus to make that measurement as accurately as possible.