Big data: How the University of Michigan navigates ethics, unpredictability of data science research
In recent years, big data emerged as a powerful tool, spurring the University of Michigan to dedicate an initiative and institute to its study and implementation.
In a statement in 2015, Jack Hu, vice president for research at the University, wrote that big data — exorbitantly large data sets that can be parsed to show trends and associations — was “revolutionizing research in extraordinary range of disciplines.”
The University’s financial and professional investment in data science has since proven beneficial; with $100 million subsidizing the University’s Data Science Initiative over five years, faculty members from multidisciplinary and intersectional departments have been part of grounding the University’s research in modern data computation, analytics and management.
“With this initiative, our goal is to spark innovation in research across campus while inspiring further advances in the techniques of data science itself,” Hu also said in 2015.
Big data has been the crux of medical initiatives and disease research, the uncovering of international corruption scandals such as in the Panama Papers, and worldwide technological and economic development.
For example, access to massive deposits of information related to issues such as on-field behavior can help predict athletic-related injuries, teaching and learning analytics can lead researchers to create better educational techniques, and patterns in accumulated patient data could help doctors discover preventative measures diagnose health conditions.
Now celebrating two years at the University, the Michigan Institute for Data Science — an institute under the umbrella of the DSI — has expanded as a multidisciplinary and interdepartmental sphere for all research in data, much alongside two other sectors of the DSI, including Advanced Research Computing – Technology Services and Consulting for Statistics, Computing and Analytics Research.
Co-director of MIDAS Brian Athey, a professor of computational medicine and bioinformatics, said innovations continue as the quest for finding new ways to use big data grows.
Even before this influx of data-driven institutions, researchers at the University had been using these large data sets to tackle issues such as disease prediction and augment the research of computer science.
Eric Michielssen, University associate vice president, Advanced Research Computing, said there is a “tsunami” of data available for use for these issues and more, but he and Athey cited the importance of a concept called the Four V’s of Big Data.
Aside from simply the volume of data Michielssen said researchers consider velocity, or the ability to receive vast amounts of data with unprecedented speeds. They must also note the variety of formats available, not just the receipt of data through structured spreadsheets, and the veracity, or uncertainty and trustworthiness of the data received.
Michielssen, whose office is home to MIDAS and the overarching Data Science Institute, said collecting and generating data on a daily basis at unprecedented speeds — particularly through social media, financial transactions and a newly coined “internet of things,” where objects are given the ability to share data through sensors, wireless technology and complex networks — requires researchers to consider how to leverage that data for the good of society and science, making it an innovative phenomenon.
“It’s nothing short of a game changer for society, as well as for science and education.” Michielssen said. “Research in just about every field is being affected by this new phenomenon, this big data phenomenon.”
According to Athey, the overall significance of data science at a university level — especially in the context of the University’s top-ranking research across the board — is that much of the University’s history in machine learning, data analytics, statistics and computation has already paved the way to today’s methodologies.
“We had an ideal environment to do this at U of M,” Athey said. Noting an increase in the power of computing and a decrease in its cost — also known as Moore’s law — Athey said, “Data science and big data are transformational to society and higher ed.”
The modern data scientist isn’t surrounded by rows of outdated computers that flash neon green numbers, MIDAS Managing Director Kevin Smith said.
Instead, Smith said, managing data today requires the three-fold collaboration of MIDAS — the University's academic hub — a focus on data science services such as consulting through CSCAR and a central avenue for high-performance computing infrastructure through ARC-TS.
“This is really thinking holistically about how you take data and be able to integrate and aggregate it in a meaningful way so that you can explore that data in the context of whatever scientific or business problems you’re trying to solve so that the analysis, the visualization, the exploration that you’re doing is valid,” Smith said.
Transportation and Tech
In much of its data science research, Michielssen said, the University focuses mostly on application rather than theory and method to yield more tangible results, especially through applications to policy, education and infrastructure.
“In many ways, we’re trying to frame this as an opportunity to advance the field of data science from a methodological perspective but in the context of the application of data science,” Smith added.
For example, researchers within MIDAS’s Centers for Data-Driven Transportation Research and that of Data-Intensive Learning Analytics Research are using data for improving automobile use and transportation, as well as creating new tools to examine the modern learning process.
Carol Flannagan, an associate research scientist for the University’s Transportation Research Institute, noted her team’s application of big data to the analysis and creation of simulations of driver behavior, traffic regulations and transportation systems, all with traditional, automated and connected vehicles.
“Transportation data is changing so fast,” Flannagan said. “It’s a really great area for novel applications of existing methods or even extensions of methods and new methods.”
In addition to working to expand the amount of data available to researchers in the field of transportation data analysis, Flannagan and her team have been successful in surveying motor vehicle crashes and establishing applications of crash-avoidance technologies. These applications can be transferred to vehicle occupant protection against and ideally, total crash prevention.
Through visualization tools and a surplus of traffic crash data, vehicle-centric countermeasures against vehicle crashes — particularly vehicle design — work in conjunction with behavior-centric measures, roadway design and enforcement to produce new options for policymakers and infrastructure designers. This helps optimize design and laws and eliminate unnecessary crashes.
Taking a comprehensive view of safety, making predictions and taking into consideration laws that are working well for some aspects of driving allows researchers, policymakers and others to integrate and focus improvements elsewhere.
Flannagan said one thing that poses difficulty are quick accessibility and shareability of data.
“We are not freed from the requirement to think first before engaging with data analysis,” Flannagan pointed out. “The requirement to think first just got harder.”
Data science research does not come without other challenges, however, particularly in a changing landscape of big data.
Social Science and Challenges
In May 2014, former President Barack Obama’s Executive Office released an official White House report detailing the significance of upholding privacy values, responsible education in a digital age and using data as a public resource.
“Properly implemented, big data will become an historic driver of progress, helping our nation perpetuate the civic and economic dynamism that has long been its hallmark,” the report reads.
Since then, the nationwide use of big data has more recently been at the heart of controversy surrounding the 2016 presidential election; big data was first seen as generally predictive for the election’s results, though others claim now-President Trump’s initial dismissal of the importance of data was a mask for his use of the information to target rural voters.
The work of research professor Michael Traugott, for example, could be applied to ultimately help prevent the effects of social media and news content on election outcomes.
Traugott and his team — in collaboration with Gallup and Georgetown University — have been collecting data to examine political communication in the 2016 presidential campaign. By using computer software to search for key topics and sentiment in open-ended question responses from thousands of participants throughout the election, researchers can check the representativeness on people’s opinions related to what they’ve seen on social media in terms of news.
A computerized content analysis of nine major U.S. newspapers can also be compared with attitudes and content in the tweets of journalists, another indicator of how news content is related to citizen sentiment later.
“I have been surprised by the emphasis on personality at the very earliest stage of the campaign and the disproportionate amount of coverage that (Trump) received,” Traugott said. “We hope to be able to track the news content sentiment against the favorability measures, as well as the topics mentioned in the open-ended (questions with participants).”
Traugott will use this data to establish whether traditional framing and agenda-setting themes are still present in contemporary reporting, but in a social media environment.
Because of the possibilities big data can hold in a politically charged and competitive social sphere, The Washington Post has called for further privacy considerations and tech policy amid the endless possibilities big data presents. It also noted that an inherent bias in data’s original collection — no matter how big or small — will continue to have consequences, resulting in what some refer to as a dangerous and ominous future.
Concerns for privacy and confidentiality are other unexpected hurdles University researchers grapple with daily — transportation data, medical records and social science information are just three avenues through which data can come that need to be secure.
“Someone has to put in place measures, software systems that allow one to do research and developments in this space while at the same time, guaranteeing the privacy of the person who gave rights to the data in the first place,” Michielssen said.
Fear of the unknown is another complication, Athey said, as the dearth of research in big data’s abilities makes future uses frighteningly ambiguous.
“Data science could be used to help society or frankly, could be used to take advantage of different groups within society or manipulate things,” Athey said. “Those that have the methodology and the computing and the access to the data have a distinct advantage on everybody. That could be used for good or for bad.”
Being data-illiterate comes with consequences, he explained.
“If you’re uneducated about data science and if you don’t know how to practice it, you would be a victim of it,” Athey said. “This is the society that we’re living in — Google and all these things — are not going to go away.”
Likewise, Smith noted an issue of ethics clouding global data science.
“While there is huge potential to take data and derive something that might give a company a competitive edge over its competitors, I think this is something that as a society and the community we continue to see evolve,” Smith said.
H.V. Jagadish, a professor of electrical engineering and computer science, created a massive open online course examining the ethics of data science. Jagadish’s intention with the course was that it, or modules of it, should be incorporated into data science training curriculum — ideally educating data scientists about “responsible data science.”
There are several other issues aside from privacy to be considered, Jagadish said, namely, inaccuracies in algorithms, algorithm discrimination and bias. Algorithms can get trained on the data they are applied to and unintentionally create undesired patterns in the results.
However, Jagadish believes that while there are issues, they are contained, controlled and managed — ultimately leading to a necessity for social consensus on rules of data science.
“With a lot of what we’re doing in data science in terms of having algorithms making decisions for us, or in terms of bringing data together from multiple sources and violating people’s notions of privacy and things of this nature, my assumption is that most actors — not all, but most — will want to do the right thing,” Jagadish said, “and we just need to talk about it to agree what those right things are.”
Misconceptions and the Future of Data
Big data can be used to predict social changes or the way a disease will advance; ironically, however, how the field of data science itself will change is rather unpredictable.
“The challenges come with the fact that the field is moving so quickly,” Michielssen added.
Updating computer infrastructure on which analytic tools live, new techniques for analysis being developed and changes in methodology will continue to be causes for innovation. Yet all of this does not negate the steps that came before this more recent influx and interest in big data, Michielssen said.
“The misconception is perhaps that all old science is out the window, that data science will replace all these older techniques that people have developed over the last decades to enable discovery,” Michielssen said — which to him simply isn’t true.
“Data science will augment existing techniques, data science will be a new tool, an absolutely necessary new tool in the toolbox of scientists and engineers and just about every branch of industry,” Michielssen said. “But it’s not going to replace, necessarily, existing techniques.”