This article originally appeared in the 2020 issue of Rice Engineering Magazine.
Daniel Kowal announced on Twitter back in April: “Seamlessly adapt your favorite continuous data models to the count data setting!”
Not, perhaps, your customary tweet, but welcome news for applied statisticians and data scientists interested in regression models for count data. Kowal calls his innovative approach STAR (Simultaneously Transforming and Rounding).
“We’ve developed a new way to construct statistical models for integer-valued data – such things as counts, test scores and rounded data. It’s built from a continuous-valued process, like many of the best statistical and machine learning models available, but includes an extra layer to account for the integer nature of the data,” said Kowal, the Dobelman Chair Assistant Professor of Statistics (STAT).
The most popular models for count data rely on the Poisson distribution, which is a foundational probability model for the number of events occurring in a fixed time interval. But the Poisson distribution faces practical limitations and encounters difficulties with datasets that feature a large proportion of zeros or a sizable difference between the mean and the variance. Yet these features are common in many fields, including epidemiology, ecology and insurance, so prediction and modeling remains a significant challenge.
Kowal published “Simultaneous transformation and rounding (STAR) models for integer-valued data” in the Electronic Journal of Statistics. His co-author is Antonio Canale, associate professor of STAT at the University of Padua, Italy.
Brian King, a third-year graduate student in STAT, has applied STAR techniques to a data set consisting of the 311 calls received by the City of Houston from late 2011 to mid-2018. 311 is the number residents can use to report non-emergency activities. King focused on calls reporting illegal trash dumping, relying on data obtained through the Kinder Institute’s Urban Data Platform.
“The data have both a temporal and spatial component, so a choice of aggregation was necessary. We showed daily and monthly count aggregation divided into city council districts. We tested our method on the city-wide monthly counts, but in the future we hope to work on finer spatial and temporal resolutions,” King said.
Also working with Kowal on the project are Gabriel Dilly, a visiting student at Rice last year, now at Instituto Militar de Engenharia in Brazil; Ryan Quach, a senior computer science and STAT major at Rice; and Bohan Wu, a Rice junior in STAT and mathematics.
“The bigger goal coming out of this work is more coherent statistical modeling,” said Kowal, who earned his M.S. and Ph.D. in STAT from Cornell University in 2015 and 2017, respectively, and joined the Rice faculty in 2017. “Statistical models should reflect reality to the best of our abilities. We’re trying to make that happen for integer-valued data.”