Research Projects

Simplifying Data Integration with AI

Data integration is an arduous and time-consuming task. To address this, we developed tools to make integrating complex datasets easier. This includes bdi-kit, a Python library packed with modern tools for tasks like matching different table schemas and standardizing values. Building on that, we created Harmonia, an intelligent agent that uses Large Language Models (LLMs) to automatically figure out how to combine datasets, working interactively with the user to get the job done right. This work helps researchers spend less time wrangling data and more time making discoveries, and it's part of the ARPA-H Biomedical Data Fabric and DARPA ASKEM initiatives.

Related Links:

Finding the Right Data for Better Machine Learning

Finding relevant data to improve machine learning models or data analysis can be tough. To tackle this, we designed and built Auctus, a search engine specifically for datasets. Unlike typical search tools that just look at metadata, Auctus can find datasets that can be joined or combined with your existing data, or datasets relevant to specific times and locations. Making this efficient required developing new, fast algorithms using techniques like sketching and hashing to quickly find relationships between huge amounts of data. This project was supported by DARPA and the National Science Foundation (NSF).

Related Links:

VLDB'21 Paper (Auctus)

Efficient Algorithms for Large-Scale Data Analysis

A core part of my research focuses on creating algorithms that can handle the massive scale of modern data efficiently. I've developed and analyzed novel randomized methods (sketching algorithms) to quickly estimate important relationships within and between datasets. This includes techniques for finding correlated attributes across different tables, estimating how much information different attributes share (mutual information), and estimating inner products using sampling methods like weighted minwise hashing and priority sampling. These algorithms are crucial for making systems like the Auctus dataset search engine fast and scalable.

Related Links:

Making Machine Learning Accessible to Everyone

Machine learning shouldn't just be for experts. I helped create Visus, an interactive system designed to empower subject-matter experts—people who know their field but aren't necessarily data scientists—to build their own ML models without writing any code. The goal was to make model building faster and more accessible, boosting productivity for everyone involved. Visus was developed as part of DARPA's Data-Driven Discovery of Models (D3M) program and our work was featured in Science Magazine.

Related Links:

ACHE: A Smarter Web Crawler for Focused Discovery

Standard web crawlers collect all kinds of pages, but often you need specific information from particular corners of the web. I led the development of ACHE, an open-source web crawler designed to intelligently focus on specific topics or domains. It was a key technology developed during the DARPA Memex program, which aimed to improve domain-specific web search and indexing and was notably used in efforts against human trafficking. ACHE has been widely adopted since its release, used by various companies and downloaded thousands of times.

Related Links:

Aécio Santos

Research Projects