This talk is about what might be called the Achilles heel of data science. It is a general talk, making reference to algorithmic trading, but applying much more generally to the applications of machine learning, AI, and statistics in the modern world of what is often called “big data”.
Data science is fundamentally statistics, with a leavening of computer science, data visualisation, mathematics, and of course heavy involvement of expertise from the application domain. Statistics is about describing data and making inferences from it. “Making inferences” means saying something about the reality underlying the data, and about the population from which your data were drawn. In trading terms, this means looking at past data and trying to describe and perhaps understand the processes which led to the kind of data you have got, and hence enabling prediction of what might happen in the future.
The key to using past data in this way is a sound understanding of how it has been drawn. Is it representative of the entire population of data? If not, in what way does it fail to be representative? Was it drawn by a probability sampling method, so that one can say how confident one is with the results? Or was it drawn in some purposive or unspecified way which means that inferences about the overall population or the future might be risky?
Since we have recently witnessed a dramatic revolution in data capture tools and methods, these questions have become particularly pointed. No longer do we painstakingly collect each data value by hand with the specific aim of understanding and future prediction – using a clipboard, ruler, or questionnaire and slowly writing down the result, later entering it into a computer for analysis. Nowadays data are captured automatically for some operational reason, and then go straight into the database. The details of trading transactions flash electronically from the exchange to the data store. Sales details are scanned, and added automatically to lead to the total bill, but then also automatically accumulate in the company’s computers. Tax returns form the basis of payments, but tax records are then built up over time as each year’s payments are made. Travellers scan their travelcards to automatically pay the fare, but then details of the routes people take are electronically aggregated into a database. Web searches are made to find things out, but then those search details gathered and stored. And so on.
In short, much of the data we analyse has been collected for some operational purpose, not with the aim of subsequent analysis to see how what happened did happen, or what might happen in the future. And this difference in aims has consequences. At the least it means that there are other data, data you did not collect but which are nonetheless very informative about the underlying process and future changes.
A very simple example is given by consumer loans. Using machine learning on past customers to distinguish those who defaulted from those who did not could well be useless as a predictive tool for a bank. After all, the data that the algorithm is trained on has all been obtained from people the bank previously thought were low risk, while it is unlikely that only low risk customers will apply in the future.
A familiar example from trading is the retrospective evaluation of predictive success rates. Companies or strategies which failed are likely to have dropped out of the data, giving a misleadingly positive impression of overall performance. And, worse, regression to the mean will mean that those companies or strategies which have done well in the past should be expected to do less well in the future.
Covering various examples, this talk gives a brief introduction to my forthcoming book Dark Data*, showing you why the data you don’t have can matter even more than the data you do have, how to recognise that you have a problem, and then what to do about it.
* Dark Data: Why What You Don’t Know Matters, David Hand, Princeton University Press, January 2020.