Scholars@Duke publication: Schemaless join for result set preferences

Schemaless join for result set preferences

Publication , Conference

Gao, C; Pei, J; Wang, J; Chang, Y

Published in: Proceedings 2017 IEEE International Conference on Information Reuse and Integration Iri 2017

November 8, 2017

In many applications, such as data integration and big data analytics, one has to integrate data from multiple sources without detailed and accurate schema information. The state of the art focuses on matching attributes among sources based on the information derived from the data in those sources. However, a best join result according to a method's own pre-determined criteria may not fit a user's best interest. In this paper, we tackle the challenge from a novel angle and investigate how to join schemaless tables to meet a user preference the best. We identify a set of essential preferences that are useful in various scenarios, such as minimizing the number of tuples in outer join results and maximizing the entropy of the joining key's distribution. We also develop a systematic method to compute the best join predicate optimizing an objective function representing a user preference. We conduct extensive experiments on 4 large datasets and compare with 4 baselines from the state of the art of schema matching and attribute clustering. The experimental results clearly show that our algorithm outperforms the baselines significantly in accuracy in all the cases, and consumes comparable running time.

Duke Scholars

Author Jian Pei Computer Science

Published In

Proceedings 2017 IEEE International Conference on Information Reuse and Integration Iri 2017

DOI

10.1109/IRI.2017.26

Publication Date

November 8, 2017

Volume

2017-January

Start / End Page

569 / 578

Citation

APA

Chicago

ICMJE

MLA

NLM

Gao, C., Pei, J., Wang, J., & Chang, Y. (2017). Schemaless join for result set preferences. In Proceedings 2017 IEEE International Conference on Information Reuse and Integration Iri 2017 (Vol. 2017-January, pp. 569–578). https://doi.org/10.1109/IRI.2017.26

Gao, C., J. Pei, J. Wang, and Y. Chang. “Schemaless join for result set preferences.” In Proceedings 2017 IEEE International Conference on Information Reuse and Integration Iri 2017, 2017-January:569–78, 2017. https://doi.org/10.1109/IRI.2017.26.

Gao C, Pei J, Wang J, Chang Y. Schemaless join for result set preferences. In: Proceedings 2017 IEEE International Conference on Information Reuse and Integration Iri 2017. 2017. p. 569–78.

Gao, C., et al. “Schemaless join for result set preferences.” Proceedings 2017 IEEE International Conference on Information Reuse and Integration Iri 2017, vol. 2017-January, 2017, pp. 569–78. Scopus, doi:10.1109/IRI.2017.26.

Gao C, Pei J, Wang J, Chang Y. Schemaless join for result set preferences. Proceedings 2017 IEEE International Conference on Information Reuse and Integration Iri 2017. 2017. p. 569–578.

Published In

Proceedings 2017 IEEE International Conference on Information Reuse and Integration Iri 2017

DOI

10.1109/IRI.2017.26

Publication Date

November 8, 2017

Volume

2017-January

Start / End Page

569 / 578