Rice statistician and neuroscientist Genevera Allen is hoping to provide data scientists with new tools that can uncover hidden patterns and correlations from complex data sets, thanks to a new CAREER Development Award from the National Science Foundation (NSF).
CAREER awards support the research and educational development of young scholars who are likely to become leaders in their field. The five-year grants, which are among the most competitive awarded by the NSF, are given to only about 400 scholars per year across all disciplines.
Allen, Rice’s Dobelman Family Junior Chair of Statistics and an assistant professor of statistics and of electrical and computer engineering, is a statistician, mathematician and neuroscientist who holds a joint appointment in pediatric neurology at Baylor College of Medicine’s Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital. Her CAREER research has two aims: to develop statistical machine learning methods for the exploration of large, complex data sets and to create new techniques in pattern recognition for integrated data sets. Such tools are critically needed to make data-driven discoveries from big biomedical data to better understand the basis of complex diseases as well as possible personalized medical therapies.
“I hope to develop a new suite of statistical learning technologies and multivariate analysis techniques that can be used to make discoveries in big scientific data,” Allen said of the $400,000, five-year research program. “This type of data is common in genomics and neuroimaging, and the new techniques I hope to develop will be useful for identifying potential genomic drug targets, for modeling genetic networks and brain networks, as well as for brain decoding from neuroimaging and neural recordings.”
Allen specializes in creating tools that allow “big data” to speak for itself and guide analysts to what they most want to know and understand. She said one aspect of her CAREER project will focus on “high-dimensional data,” richly complex data sets that include thousands or even millions pieces of information on each research subject.
“This is a big issue in medical data,” she said. “In cancer genetics, for example, it’s common to have data from hundreds of patients. But the gene-expression assays from each of those patients may show whether any of 20,000 genes are turned on or off, and we may also have a half-million epigenetic measurements, not to mention whole genome sequencing, which gives us around 7 million points on the DNA to measure mutations. So, for only a couple hundred patients, we literally have millions of genetic features.”
Allen said such data is a particular challenge for machine learning — the area of statistics and computer science that involves creating computer applications that can learn and hone their skills with experience.
“The problem goes beyond the size of the data set,” Allen said. “It’s also very complex and highly correlated, which means that if gene A and gene B are both talking to each other, how do we know whether it’s really gene A that’s influencing the disease rather than gene B? From just observational data alone, that’s really hard to tell. With this grant, we hope to develop techniques that can address those sorts of questions for this type of big and complex data.”
Allen gave an example of the second aim of the project, which is to develop new techniques for pattern recognition for integrated data, also known as data fusion.
“This is the type of problem where you have one set of patients, but for that one set of patients, you have measured multiple different types of data,” Allen said. “This is also very common in medical studies. For one set of patients, you might have neuroimaging scans, genetic data and epidemiological data as well as clinical reports, including the notes from the treating physician.
“These are very different types of data, and you could clearly look at each data set separately and find a set of patterns. But what we would prefer to do is combine all the data and look for patterns that aren’t obvious from any single data set. What are the joint patterns if we look at neuroimaging and genetics, or if we look at the neuroimaging and those clinical features? Developing tools to answer these questions is critical for discovering the basis of complex diseases as well as developing therapies for personalized medicine.”