Sushovan De :: Research

My research interests are in data cleaning in the context of information retrieval, and in probabilistic databases.

Data Cleaning

Try our system out: Download BayesWipe - a fully automatic database cleaner.

My thesis work is in data cleaning. Databases nowadays are being increasingly generated by casual users instead of being carefully curated by dedicated employees. Combined with the sheer volume of data being generated, data cleaning has become very important, yet the existing methods scale to neither the current volume of data, nor the variety of errors. My work details a system which uses the dirty data itself to learn a model of the clean data as well as the model of the errors in the data, and to clean the data in a probabilistically principled manner.

Relevant publications:

Collective Entity Classification

Traditionally, the problem of indentifying ambiguous entities in documents has been solved by looking at the immediate neighborhood of the entity within the document itself. In this work (done in an internship in IBM Research Labs, Bangalore, India), we looked at using signals from across documents to classify entities. The principal idea is that similar documents can often provide valuable clues to disambiguate entities that could not be otherwise classified.

Relevant publication:

Planning and Crowdsourcing

The best hole-in-the-wall restaurants in New York are known to New Yorkers, not to yelp. Yet, when making a travel plan, sources like this are easily overlooked due to the difficulty in getting, organizing, and tailoring this information into a useful travel plan. We experimented with having an automated planner take a travel requirements and constraints as input, and ask humans for recommendations. We then made the planner automatically check constraints and guide the humans towards making a better, more complete plan.

Relevant publications:

Social Networks Analytics

The avenues through which we express ourselves have dramatically changed from the physical world to online, which lack the expressive power of direct human conversation. It has become harder to detect mental health issues like depression and social anxiety. In this work, we study one particular social network, reddit, for characteristics that indicate issues of mental health. We also investigate how much the promise of anonymity online makes people likely to share their true feelings.

Relevant publication:

Probabilistic Databases

Most data today comes with a degree of uncertainty, yet we continue to store them in deterministic databases since creating, querying and bookkeeping of probabilistic databases is a difficult problem. In this work, I extended the definitions of the various kinds of functional dependencies (Deterministic, Approximate, and Conditional) to the realm of probabilistic databases. I proved certain properties for these dependencies and demonstrated efficient algorithms to compute their confidence in a given probabilistic database - and hence mine them.

Relevant publication: