Mining top-k sequential patterns in transaction database graphs: A new challenging problem and a sampling-based approach
In many real world networks, a vertex is usually associated with a transaction database that comprehensively describes the behaviour of the vertex. A typical example is a social network, where the behaviours of every user are depicted by a transaction database that stores her daily posted contents. Specifically, a transaction database consists of a collection of transactions, where each transaction corresponds to a piece of tweet. For each transaction, it consists of a set of items, where each item may correspond to a keyword or a piece of video clip contained in this tweet. To model such type of scenario, we propose the novel notion of the transaction database graph, where each vertex is associated with a transaction database. Every path of the graph is a sequence of vertices that induces multiple sequences of transactions. The sequences of transactions induced by all of the paths in the graph form an extremely large sequence database. Finding frequent sequential patterns from such sequence database discovers interesting subsequences that frequently appear in many paths of the network. Our goal is to find the top-k frequent sequential patterns in the sequence database induced from a transaction database graph. However, it is challenging since the sequence database induced by a transaction database graph is too large to be explicitly induced and stored, and finding the top-k frequent sequential patterns is #P-hard. To tackle this problem, we propose an efficient two-step sampling algorithm that approximates the top-k frequent sequential patterns with the provable quality guarantee. Extensive experimental results on synthetic and real-world data sets demonstrate the effectiveness and efficiency of our method.
Duke Scholars
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Start / End Page
Related Subject Headings
- Information Systems
- 46 Information and computing sciences
- 0806 Information Systems
- 0805 Distributed Computing
- 0804 Data Format
Citation
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Start / End Page
Related Subject Headings
- Information Systems
- 46 Information and computing sciences
- 0806 Information Systems
- 0805 Distributed Computing
- 0804 Data Format