Selecting data to clean for fact checking: Minimizing uncertainty vs. maximizing surprise


Journal Article

© VLDB Endowment. We study the optimization problem of selecting numerical quantities to clean in order to fact-check claims based on such data. Oftentimes, such claims are technically correct, but they can still mislead for two reasons. First, data may contain uncertainty and errors. Second, data can be “fished“ to advance particular positions. In practice, fact-checkers cannot afford to clean all data and must choose to clean what “matters the most“ to checking a claim. We explore alternative definitions of what “matters the most“: one is to ascertain claim qualities (by minimizing uncertainty in these measures), while an alternative is just to counter the claim (by maximizing the probability of finding a counterargument). We show whether the two objectives align with each other, with important implications on when fact-checkers should exercise care in selective data cleaning, to avoid potential bias introduced by their desire to counter claims. We develop efficient algorithms for solving the various variants of the optimization problem, showing significant improvements over naive solutions. The problem is particularly challenging because the objectives in the fact-checking context are complex, non-linear functions over data. We obtain results that generalize to a large class of functions, with potential applications beyond fact-checking.

Full Text

Duke Authors

Cited Authors

  • Sintos, S; Agarwal, PK; Yang, J

Published Date

  • January 1, 2020

Published In

Volume / Issue

  • 12 / 13

Start / End Page

  • 2408 - 2421

Electronic International Standard Serial Number (EISSN)

  • 2150-8097

Digital Object Identifier (DOI)

  • 10.14778/3358701.3358708

Citation Source

  • Scopus