It has been almost two years since I last posted to this blog, but time is becoming a little more available and the topic seems to be worth some words. The scientist's dilemma was perhaps best articulated by Robert Oppenheimer reflecting on the development of the atomic bomb. He said “We knew the world would not be the same. A few people laughed, a few people cried. Most people were silent. I remembered the line from the Hindu scripture, the Bhagavad-Gita; Vishnu is trying to persuade the Prince that he should do his duty, and to impress him, takes on his multi-armed form and says, 'Now I am become Death, the destroyer of worlds.'"
I have often opined that my job as a researcher is to explore what can be done with technology. It has always been my hope that what we learn will be used to the betterment of humanity, but I do recognize that any technology used for good can also be used negatively. There has been significant discussion in the press about the development of systems that are self-regulating and potentially capable of reflective intelligence, a form we tend to reserve for humans. For some, the issue is so profound as to raise an Oppenheimer like reflection -- have we created something that threatens human existence.
Over the last few weeks have been thinking about a less cataclysmic, but I believe similarly profound development. Specifically, I have been thinking about data analytics. The story begins in the 1970 at supermarkets using the Universal Product Code (UPC) barcode. The goal was faster checkout. Later, grocery stores began issuing cards to customers and using "loyalty programs" to reward customers AND track their buying habits. With the growth of the web, the phenomenon has been extended in a variety of ways. In the "click" world, a business can track our purchases AND our exploratory activity in the marketplace. In my classes, I talk to students about how we use this data to segment markets and understand how to treat our customers. We know what people purchase and using aggregate data we can do a great job of cross selling -- "people who purchased this item also purchased these items" and upselling -- "people who purchased this item also looked at another item, which for just a few dollars more will also do...". Taken further, we can tell a lot about who buys a little from us and who buys a lot, who seems to be buying more and who seems to be buying less, who is happy, who is frustrated, etc. All in all, this process is often described in the media under the rubric of "big data". From a theoretic point of view, there are lots of opportunities these days to solve problems that fall within the domain of big data. While some will never be happy with weather forecasts, we have made significant progress in weather forecasting by processing vast amounts of sensor data in parallel to track weather and do a better job of predicting developments. Similarly, human genomics, astronomy, and particle physics research have benefited from new techniques that are being developed to process the vast data stores involved.
What started with UPC codes and laser scanners in supermarkets has been extended on the web to tracking our product explorations, general web browsing, and our commentary on social websites. Further developments are occurring related to personal movements tracked by our mobile phones as well as a plethora of other forms of personal data being gathered from sensors -- thermostats connected to the internet are early contenders for devices that use information about us to provide better service. It will not be too long before hundreds of devices operate in such ways that they will store data that can be used to reconstruct some picture of our behavior -- where we drive, how hot we like our shower, when we are at home, what we watch on TV, etc.
Perhaps even more frightening, concerted efforts are underway in most fields of human endeavor to do a better job by analyzing vast stores of data about people. There are developments in a couple domains that are worth thinking about in some detail. In medicine, we are seeing the beginning of personalized drugs that are targeted to work for particular individuals. This is a spectacular development. We are also making progress on cognitive behavioral treatments. Put simply, we are learning that how we approach issues mentally and what we do behaviorally can have significant impacts on our well-being. Few would argue with these benefits, but they will surely come with expensive price tags and some strong advice about what we should and shouldn’t do. I guess few would argue about the costs and benefits.
But in other domains, the costs and benefits may be things we want to argue about. Take for example law enforcement. We are getting to the point where we can predict where and when crimes will occur. “Hot-spot” maps of neighborhoods are being used to deploy police so as to be more effective. One has to wonder if these predictions will have a new sort of Heisenberg effect. Rather than observation introducing uncertainty, our measurements may influence more certainty. Will the likelihood of a crime taking place at a given time and place have the impact of inducing the criminal activity by virtue of our actions? One might look to recent disturbances in black neighborhoods as a result of police action as an example.
Another domain being worked on is education. It would seem that for as long as education has been an institutionalized activity, we have been asking how we might do a better job. Significant efforts have been extended over the last half century to improve the educational experience by making it more individualized. When I started my graduate work fifty years ago, we were "closing in on the solution" by providing individually prescribed instruction that modeled both the subject matter and the individual. While they were heady days, the envisioned solution eluded us. What became clear though was the conceptual framework. If we had perfect knowledge about what we wanted to teach, perfect knowledge about what the learner knew, perfect knowledge about the learner's preferred learning style, and unlimited funds, we could device a program that would maximize the learning outcome for that person. Put in more personal terms, it would be possible to devise an institutional program of instruction that came close to approximating the way a devoted and intelligent parent interacted with their child -- knowing exactly what would and would not work with their child. (A parent might not know what or how they knew, but they did know. Now we could have the perfect teacher in a public school use that same level of knowledge and awareness to work with the thirty students assigned to them!
So far, I would hope that you would agree with me that all of these developments and the promises they hold are worth pursuing. What comes next though is a little more disturbing. Let's go back to the business of selling something in a grocery store and meld that approach to big data to educational realm. What we are developing based on the processing of vast data stores is the ability to predict behavior. (For those of you who like science fiction, watch the "Minority Report" again or if you are like me, go back and reread Asimov's Foundation Trilogy and think about Hari Seldon and "psychohistory" -- a field in which scientists predict the future based on probabilities.) What we know with increasing precision are certain things about human behavior. We know that a percentage of people that buy product X will also purchase product Y. Further we know that we can present product Y to the purchaser of product X using techniques A, B, or C and that each method will have a different impact on different classes of people in terms of the purchasing decision. Some of this knowledge is generalizable, other facts are more idiosyncratic. What is most impressive is that with time with get a better sense of the accuracy and usefulness of the data. In education, we are increasingly, but in a less focused way, beginning to come to similar knowledge. For example, we are beginning to understand the probabilities that certain classes of students will succeed in certain academic programs. With this knowledge we can target our limited resources on those who need help.
Here comes the rub. In the "Minority Report", these predictions came from genetic mutations that allowed a couple humans to predict the future in select areas -- crime -- with a high level of certainty. In the Foundation trilogy, Hari Seldon was able to predict human behavior based on data from the activity of billions of humans collected over a long period of time. We are in the nascent stages of being able to predict with certain levels of certainty a variety of different things. For example, we may be able to predict that an individual from a given zip code, who comes from a household of a given socio economic status, with other characteristics -- e.g. SAT scores of X and Y, a single parent, a misdemeanor record, Facebook pages containing certain information, grades in certain subjects, etc., will or will not do well in a given program of study. Right now, crude information of this kind is beginning to help in decision making about services to students, or advising decisions. If we were to take the process out 50 or 100 years, we might imagine developing higher levels of confidence in our ability to predict outcomes. We might move from "It is likely that student X will benefit from counseling help if they take this program" to "There is a 99% probability that student X will not complete this program of study and therefore we should not admit them." In some ways, this is already how business analytics are beginning to be used. If we can predict with a high level of certainty that customer X will be a better (spend more) customer than customer Y, we will spend more effort to keep customer X than Y.
So long as social systems are designed in a way that our predictive skills are used to the benefit of those we serve, few would imagine that we should not use them to the greatest extent possible. To the extent that we find ourselves in the position of deciding for our clients, a conflict will develop between the lofty human expression of free will and self-determination and the systemic optimization of resources with a consequent denial of opportunity for some.
P.S. Some will note a disturbing assumption in these trends. Namely, the general trend in "Big Data" analyses is to develop conclusions that are based on correlations. As young researchers, we had it drilled into our head that correlations are not indicators of causality. Correlational data was almost always a spur to search for causal relationships. Causal relationships may be viewed as the holy grail of science. While many continue in that tradition, there is a growing role for use of correlational data. Recently, the Centers for Disease Control issued a set of travel advisories because of a concern about the Zika virus -- correlational data was enough to suggest caution. Similarly, it would seem that businesses are sometimes satisfied to make business decisons based on strong correlations without bothering to uncover causal relationships. While the purist in me wants to know the causal relationships, it is clear that some set of decisions might be reached by reasonable people when the correlations are near perfect.