CIS@Penn... And Amy Gutmann Hall and the IDEAS Initiative
Exciting things are happening in Computer and Information Science at Penn, as we continue to start revolutionary new projects, recruit new students and faculty, and continue on our period of high-speed growth. We have a new building on the way and a major commitment from the university towards expansion in AI for data science (with the new ASSET Center in Safe and Trustworthy AI as one of the first research initiatives). Since 2018 we have hired 25 new faculty, bringing us from a small to a midsized department with outstanding faculty in all areas. Want to know more about the great things happening in the department? Please check out our Highlights site, and follow me on Twitter (@zgives)! Please also see our faculty and lecturer ads.
Research
My research interests lie in building data science platforms, using techniques at the intersection of databases, machine learning, and distributed systems. I am interested in applications both to the Web and question answering, and to conducting data science. I often work with life scientists (especially in genetics and neuroscience) to evaluate our techniques with real data and real hypotheses. More details are generally available at the Penn Database Group web site.
The power of the Web and conventional search is limited, because Web search does not reason about relationships between facts. Question answering and data analysis systems need better techniques for integrating data from multiple sources, and reasoning about certainty. Similarly, we are at still in the early stages of building the "right" tools for data science, that let us link data, rapidly pose and evaluate hypotheses, and ensure we have trustworthy results. I'm interested in questions such as:
- How do we tie together the world's data to answer key scientific or policy questions, when the connections between the data are ambiguous?
- How do we facilitate and foster large-scale collaborative projects involving updates to data, code, and visualization?
- How do we know when we can trust a data analysis result or an answer to a question?
I am a member of the database and systems research groups, the Warren Center for Network and Data Science, and the Center for Health, Devices, and Technology at Penn. My research projects relate to making it easier to exchange, locate, and analyze networked information.
Automatically Structuring and Searching Data Lakes and Data Corpora.
As we collect large sets of related, multi-versioned data and documents --- what are the mechanisms by which we can
Understanding Claims, Quotes, and Discussion in Documents and Social Media.
When someone makes a statement in an article, what (in the article, in the Web at large) backs up that claim? In collaboration with Prof. Dan Roth and Dr. Yi Zhang, we are studying the questions of the provenance of claims, both in terms of sources and text, as well as in tabular data. Looking more broadly (with Dr. Wang-Chiew Tan at Meta), we are also interested in understanding the discussion revolving
Facilitating Data Management and Reuse in Data Science. Today the predominant mode of interacting with data has changed: rather than working with highly controlled, regularized databases, data scientists tend to work with a variety of different data sources within computational notebook software such as Jupyter Notebook and JupyterLab. Such software allows for ad hoc discovery as well as for the creation of sophisticated data analyses and machine learning models. A key issue becomes the management of the many data products (tables, dataframes, models) produced; and there is a key opportunity to help new users understand prior best-practices in using, importing, cleaning, extracting, and analyzing datasets. The Juneau project addresses these issues. Funded by NSF III-1910108.
Our collaborations with neuroscientists (esp. Profs Brian Litt in Bioengineering and Neurology, and Joost Wagenaar in Biostatistics, Informatics, and Epidemiology) has received a good deal of notice for its impact on data science:
- Seizure prediction contest results (504 teams, 82% accuracy)
- NIH Director's blog
- American Epilepsy Society press release
- Announcement of winners
- Science Daily: Crowdsourcing advances epileptic seizure detection, prediction
- NPR, A Crowd of Scientists Finds a Better Way to Predict Seizures
Several prior projects have resulted in building blocks towards our ongoing work in supporting large-scale data integration and analysis. These projects are no longer directly active, but their core ideas (and code) are part of our more recent projects:
Trustworthy Data Science. For any type of data science computation, the "glue" that links results to how they were derived is data provenance. Provenance explains the steps involved in the results, as well as what facts went into which conclusion. However, we need to develop better tools for collecting provenance in a convenient way; for reasoning about data's value given its provenance; for recommending related data; and broadly to assess trustworthiness of data analysis results. Funded by NSF (CiCi) and NIH (BD2K Targeted Software) and in collaboration with biologists at Penn, clinicians at UCSF, and computer scientists and computer engineers at U Memphis, Georgia Tech, and UCLA.
Developing a Testbed for Data Science. The IEEG Web Portal, in collaboration with Prof. Brian Litt of Bioengineering and Neurology, and Prof. Greg Worrell at Mayo Clinic, seeks to enable community-scale data integration and cloud-hosted science for epileptic seizure prediction (and beyond). Beyond its scientific applications, IEEG serves as a testbed for technologies from the Q System and other data integration research. As of Oct 2014 we have over 1200 datasets and 450 users. We have also hosted competitions for epileptic seizure detection and epileptic seizure prediction. Funded by NIH as well as grants from Amazon.
Acknowledgments: I have also received grants from DARPA CSSG (#HRO011-06-1-0016 and HRO1107-1-0029), Penn ISTAR, the State of Pennsylvania, Amazon, Google, and Lockheed Martin, and software donations from MarkLogic, Electric Software, and IBM Corp.