Using the crowd for top-k and group-by queries

Published

Conference Paper

Group-by and top-k are fundamental constructs in database queries. However, the criteria used for grouping and ordering certain types of data - such as unlabeled photos clustered by the same person ordered by age - are difficult to evaluate by machines. In contrast, these tasks are easy for humans to evaluate and are therefore natural candidates for being crowd-sourced. We study the problem of evaluating top-k and group-by queries using the crowd to answer either type or value questions. Given two data elements, the answer to a type question is "yes" if the elements have the same type and therefore belong to the same group or cluster; the answer to a value question orders the two data elements. The assumption here is that there is an underlying ground truth, but that the answers returned by the crowd may sometimes be erroneous. We formalize the problems of top-k and group-by in the crowd-sourced setting, and give efficient algorithms that are guaranteed to achieve good results with high probability. We analyze the crowd-sourced cost of these algorithms in terms of the total number of type and value questions, and show that they are essentially the best possible. We also show that fewer questions are needed when values and types are correlated, or when the error model is one in which the error decreases as the distance between the two elements in the sorted order increases. Copyright 2013 ACM.

Full Text

Duke Authors

Cited Authors

  • Davidson, SB; Khanna, S; Milo, T; Roy, S

Published Date

  • January 1, 2013

Published In

  • Acm International Conference Proceeding Series

Start / End Page

  • 225 - 236

International Standard Book Number 13 (ISBN-13)

  • 9781450315982

Digital Object Identifier (DOI)

  • 10.1145/2448496.2448524

Citation Source

  • Scopus