Schemaless join for result set preferences
In many applications, such as data integration and big data analytics, one has to integrate data from multiple sources without detailed and accurate schema information. The state of the art focuses on matching attributes among sources based on the information derived from the data in those sources. However, a best join result according to a method's own pre-determined criteria may not fit a user's best interest. In this paper, we tackle the challenge from a novel angle and investigate how to join schemaless tables to meet a user preference the best. We identify a set of essential preferences that are useful in various scenarios, such as minimizing the number of tuples in outer join results and maximizing the entropy of the joining key's distribution. We also develop a systematic method to compute the best join predicate optimizing an objective function representing a user preference. We conduct extensive experiments on 4 large datasets and compare with 4 baselines from the state of the art of schema matching and attribute clustering. The experimental results clearly show that our algorithm outperforms the baselines significantly in accuracy in all the cases, and consumes comparable running time.