Skip to main content

Jian Pei

Arthur S. Pearse Distinguished Professor of Computer Science
Computer Science
308 Research Drive, Durham, NC 27708

Selected Publications


Ask Questions With Double Hints: Visual Question Generation With Answer-Awareness and Region-Reference.

Journal Article IEEE transactions on pattern analysis and machine intelligence · December 2024 The visual question generation (VQG) task aims to generate human-like questions from an image and potentially other side information (e.g., answer type). Previous works on VQG fall in two aspects: i) They suffer from one image to many questions mapping pro ... Full text Cite

The Fourth International Workshop on Smart Data for Blockchain and Distributed Ledger (SDBD'24)

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 25, 2024 With the advent of Bitcoin, a cryptographically-enabled peer-to-peer digital payment system, blockchain together with a whole package of distributed ledger technologies, which serve as the underlying foundation of all the crypto-currencies, have been gaini ... Full text Cite

Applications and Computation of the Shapley Value in Databases and Machine Learning

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 9, 2024 Recently, the Shapley value, a concept rooted in cooperative game theory, has found more and more applications in databases and machine learning. Due to its combinatoric nature, the computation of the Shapley value is #P-hard. To address this challenge, nu ... Full text Cite

Linear-Time Graph Neural Networks for Scalable Recommendations

Conference WWW 2024 - Proceedings of the ACM Web Conference · May 13, 2024 In an era of information explosion, recommender systems are vital tools to deliver personalized recommendations for users. The key of recommender systems is to forecast users' future behaviors based on previous user-item interactions. Due to their strong e ... Full text Cite

FairSample: Training Fair and Accurate Graph Convolutional Neural Networks Efficiently

Journal Article IEEE Transactions on Knowledge and Data Engineering · April 1, 2024 Fairness in Graph Convolutional Neural Networks (GCNs) becomes a more and more important concern as GCNs are adopted in many crucial applications. Societal biases against sensitive groups may exist in many real world graphs. GCNs trained on those graphs ma ... Full text Cite

Optimization of Graph Clustering Inspired by Dynamic Belief Systems

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2024 Graph clustering is essential to understand the nature and behavior of real world such as social network, technical network and transportation network. Different from the existing studies, we propose a new Markov clustering method inspired by belief dynami ... Full text Cite

Database Native Model Selection: Harnessing Deep Neural Networks in Database Systems

Conference Proceedings of the VLDB Endowment · January 1, 2024 The growing demand for advanced analytics beyond statistical aggregation calls for database systems that support effective model selection of deep neural networks (DNNs). However, existing model selection strategies are based on either training-based algor ... Full text Cite

Protecting Data Buyer Privacy in Data Markets

Journal Article IEEE Internet Computing · January 1, 2024 Data markets serve as crucial platforms facilitating data discovery, exchange, sharing, and integration among data users and providers. However, the paramount concern of privacy has predominantly centered on protecting privacy of data owners and third part ... Full text Cite

Shapley Value Approximation Based on Complementary Contribution

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2024 Shapley value provides a unique way to fairly assess each player's contribution in a coalition and has enjoyed many applications. However, the exact computation of Shapley value is #P-hard due to the combinatoric nature of Shapley value. Many existi ... Full text Cite

Position: TRUSTLLM: Trustworthiness in Large Language Models

Conference Proceedings of Machine Learning Research · January 1, 2024 Large language models (LLMs) have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. This paper introduces TRUSTLLM, a c ... Cite

Counterfactual Explanation of Shapley Value in Data Coalitions

Journal Article Proceedings of the VLDB Endowment · January 1, 2024 The Shapley value is widely used for data valuation in data markets. However, explaining the Shapley value of an owner in a data coalition is an unexplored and challenging task. To tackle this, we formulate the problem of finding the counterfactual explana ... Full text Cite

Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Conference SIGIR-AP 2023 - Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region · November 26, 2023 Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoder ... Full text Cite

RUEL: Retrieval-Augmented User Representation with Edge Browser Logs for Sequential Recommendation

Conference International Conference on Information and Knowledge Management, Proceedings · October 21, 2023 Online recommender systems (RS) aim to match user needs with the vast amount of resources available on various platforms. A key challenge is to model user preferences accurately under the condition of data sparsity. To address this challenge, some methods ... Full text Cite

Cost-Sensitive Learning for Medical Insurance Fraud Detection With Temporal Information

Journal Article IEEE Transactions on Knowledge and Data Engineering · October 1, 2023 Fraudulent activities within the U.S. healthcare system cost billions of dollars each year and harm the wellbeing of many qualifying beneficiaries. The implementation of an effective fraud detection method has become imperative to secure the welfare of the ... Full text Cite

DP2-Pub: Differentially Private High-Dimensional Data Publication With Invariant Post Randomization

Journal Article IEEE Transactions on Knowledge and Data Engineering · October 1, 2023 A large amount of high-dimensional and heterogeneous data appear in practical applications, which are often published to third parties for data analysis, recommendations, targeted advertising, and reliable predictions. However, publishing these data may di ... Full text Cite

Deep Learning on Graphs: Methods and Applications (DLG-KDD2023)

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 6, 2023 Deep Learning models are at the core of research in Artificial Intelligence research today. A tide in research for deep learning on graphs or graph neural networks. This wave of research at the intersection of graph theory and deep learning has also influe ... Full text Cite

Graph Neural Networks: Foundation, Frontiers and Applications

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 6, 2023 The field of graph neural networks (GNNs) has seen rapid and incredible strides over the recent years. Graph neural networks, also known as deep learning on graphs, graph representation learning, or geometric deep learning, have become one of the fastest-g ... Full text Cite

Serverless Federated AUPRC Optimization for Multi-Party Collaborative Imbalanced Data Mining

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 6, 2023 To address the big data challenges, serverless multi-party collaborative training has recently attracted attention in the data mining community, since they can cut down the communications cost by avoiding the server node bottleneck. However, traditional se ... Full text Cite

Decentralized Composite Optimization in Stochastic Networks: A Dual Averaging Approach with Linear Convergence

Journal Article IEEE Transactions on Automatic Control · August 1, 2023 Decentralized optimization, particularly the class of decentralized composite convex optimization (DCCO) problems, has found many applications. Due to ubiquitous communication congestion and random dropouts in practice, it is highly desirable to design dec ... Full text Cite

Identify Event Causality with Knowledge and Analogy

Conference Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023 · June 27, 2023 Event causality identification (ECI) aims to identify the causal relationship between events, which plays a crucial role in deep text understanding. Due to the diversity of real-world causality events and difficulty in obtaining sufficient training data, e ... Cite

A Graph Fusion Approach for Cross-Lingual Machine Reading Comprehension

Conference Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023 · June 27, 2023 Although great progress has been made for Machine Reading Comprehension (MRC) in English, scaling out to a large number of languages remains a huge challenge due to the lack of large amounts of annotated training data in non-English languages. To address t ... Cite

Multi-Behavior Sequential Recommendation With Temporal Graph Transformer

Journal Article IEEE Transactions on Knowledge and Data Engineering · June 1, 2023 Modeling time-evolving preferences of users with their sequential item interactions, has attracted increasing attention in many online applications. Hence, sequential recommender systems have been developed to learn the dynamic user interests from the hist ... Full text Cite

Permutation-Equivariant and Proximity-Aware Graph Neural Networks With Stochastic Message Passing

Journal Article IEEE Transactions on Knowledge and Data Engineering · June 1, 2023 Graph neural networks (GNNs) are emerging machine learning models on graphs. Permutation-equivariance and proximity-awareness are two important properties highly desirable for GNNs. Both properties are needed to tackle some challenging graph problems, such ... Full text Cite

Data-Driven Learning for Data Rights, Data Pricing, and Privacy Computing

Journal Article Engineering · June 1, 2023 In recent years, data has become one of the most important resources in the digital economy. Unlike traditional resources, the digital nature of data makes it difficult to value and contract. Therefore, establishing an efficient and standard data-transacti ... Full text Cite

Offline Policy Evaluation in Large Action Spaces via Outcome-Oriented Action Grouping

Conference ACM Web Conference 2023 - Proceedings of the World Wide Web Conference, WWW 2023 · April 30, 2023 Offline policy evaluation (OPE) aims to accurately estimate the performance of a hypothetical policy using only historical data, which has drawn increasing attention in a wide range of applications including recommender systems and personalized medicine. W ... Full text Cite

Tutorials at The Web Conference 2023

Conference Companion Proceedings of the ACM Web Conference 2023 · April 30, 2023 Full text Cite

Efficiently Cleaning Structured Event Logs: A Graph Repair Approach

Journal Article ACM Transactions on Database Systems · March 14, 2023 Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause serious damage to real applications, such as inaccurate provenance answers, poor profiling results, or concealing interesting patterns from ev ... Full text Cite

Eigen-GNN: A Graph Structure Preserving Plug-in for GNNs

Journal Article IEEE Transactions on Knowledge and Data Engineering · March 1, 2023 Graph Neural Networks (GNNs) are emerging machine learning models on graphs. Although sufficiently deep GNNs are shown theoretically capable of fully preserving graph structures, most existing GNN models in practice are shallow and essentially feature-cent ... Full text Cite

Applications of Differential Privacy in Social Network Analysis: A Survey

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2023 Differential privacy provides strong privacy preservation guarantee in information sharing. As social network analysis has been enjoying many applications, it opens a new arena for applications of differential privacy. This article presents a comprehensive ... Full text Cite

Dynamic Shapley Value Computation

Conference Proceedings - International Conference on Data Engineering · January 1, 2023 With the prevalence of data-driven research, data valuation has attracted attention from the computer science field. How to appraise a single datum becomes an imperative problem, especially in the context of machine learning. Shapley value is widely used t ... Full text Cite

Disentangled Graph Social Recommendation

Conference Proceedings - International Conference on Data Engineering · January 1, 2023 Social recommender systems have drawn a lot of attention in many online web services, because of the incorporation of social information between users in improving recommendation results. Despite the significant progress made by existing solutions, we argu ... Full text Cite

Factual Observation Based Heterogeneity Learning for Counterfactual Prediction

Conference Proceedings of Machine Learning Research · January 1, 2023 Extant causal methods exclusively exploit the heterogeneity based on the observed covariates for heterogeneous outcome prediction. Even with nowadays big data, the collected covariates may not contain complete confounders. When some confounders are absent, ... Cite

Preface: The 2023 ACM SIGKDD Workshop on Causal Discovery, Prediction and Decision

Conference Proceedings of Machine Learning Research · January 1, 2023 Cite

Data and AI Mo del Markets: Opportunities for Data and Model Sharing, Discovery, and Integration

Conference Proceedings of the VLDB Endowment · January 1, 2023 The markets for data and AI models are rapidly emerging and increasingly significant in the realm and the practices of data science and artificial intelligence. These markets are being studied from diverse perspectives, such as e-commerce, economics, machi ... Full text Cite

Alleviating Over-smoothing for Unsupervised Sentence Representation

Conference Proceedings of the Annual Meeting of the Association for Computational Linguistics · January 1, 2023 Currently, learning better unsupervised sentence representations is the pursuit of many natural language processing communities. Lots of approaches based on pre-trained language models (PLMs) and contrastive learning have achieved promising results on this ... Cite

Structural Contrastive Pretraining for Cross-Lingual Comprehension

Conference Proceedings of the Annual Meeting of the Association for Computational Linguistics · January 1, 2023 Multilingual language models trained using various pre-training tasks like mask language modeling (MLM) have yielded encouraging results on a wide range of downstream tasks. Despite the promising performances, structural knowledge in cross-lingual corpus i ... Cite

LazyGNN: Large-Scale Graph Neural Networks via Lazy Propagation

Conference Proceedings of Machine Learning Research · January 1, 2023 Recent works have demonstrated the benefits of capturing long-distance dependency in graphs by deeper graph neural networks (GNNs). But deeper GNNs suffer from the long-lasting scalability challenge due to the neighborhood explosion problem in large-scale ... Cite

Clinical Assessment of Pneumocystosis with MIMIC Data

Conference Proceedings - 2023 2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023 · January 1, 2023 Pneumocystosis remains a life-threatening disease with a high mortality rate. It's critical to understand its clinical course and risk factors for better disease management. In this retrospective analysis, we aimed to elucidate the prognostic determinants ... Full text Cite

TrustLOG: The First Workshop on Trustworthy Learning on Graphs

Conference International Conference on Information and Knowledge Management, Proceedings · October 17, 2022 Learning on graphs (LOG) plays a pivotal role in various high-impact application domains. The past decades have developed tremendous theories, algorithms, and open-source systems in answering what/who questions on graphs. However, recent studies reveal tha ... Full text Cite

A Survey on Data Pricing: From Economics to Data Science

Journal Article IEEE Transactions on Knowledge and Data Engineering · October 1, 2022 Data are invaluable. How can we assess the value of data objectively, systematically and quantitatively? Pricing data, or information goods in general, has been studied and practiced in dispersed areas and principles, such as economics, marketing, electron ... Full text Cite

Improving Social Network Embedding via New Second-Order Continuous Graph Neural Networks

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2022 Graph neural networks (GNN) are powerful tools in many web research problems. However, existing GNNs are not fully suitable for many real-world web applications. For example, over-smoothing may affect personalized recommendations and the lack of an explana ... Full text Cite

Deep Learning on Graphs: Methods and Applications (DLG-KDD2022)

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2022 Deep Learning models are at the core of research in Artificial Intelligence research today. A tide in research for deep learning on graphs or graph neural networks. This wave of research at the intersection of graph theory and deep learning has also influe ... Full text Cite

Graph Neural Networks: Foundation, Frontiers and Applications

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2022 The field of graph neural networks (GNNs) has seen rapid and incredible strides over the recent years. Graph neural networks, also known as deep learning on graphs, graph representation learning, or geometric deep learning, have become one of the fastest-g ... Full text Cite

Data pricing in machine learning pipelines

Journal Article Knowledge and Information Systems · June 1, 2022 Machine learning is disruptive. At the same time, machine learning can only succeed by collaboration among many parties in multiple steps naturally as pipelines in an eco-system, such as collecting data for possible machine learning applications, collabora ... Full text Cite

Multiple Choice Questions based Multi-Interest Policy Learning for Conversational Recommendation

Conference WWW 2022 - Proceedings of the ACM Web Conference 2022 · April 25, 2022 Conversational recommendation system (CRS) is able to obtain fine-grained and dynamic user preferences based on interactive dialogue. Previous CRS assumes that the user has a clear target item, which often deviates from the real scenario, that is for many ... Full text Cite

Robust Self-Supervised Structural Graph Neural Network for Social Network Prediction

Conference WWW 2022 - Proceedings of the ACM Web Conference 2022 · April 25, 2022 The self-supervised graph representation learning has achieved much success in recent web based research and applications, such as recommendation system, social networks, and anomaly detection. However, existing works suffer from two problems. Firstly, in ... Full text Cite

Heterogeneous global graph neural networks for personalized session-based recommendation

Conference WSDM 2022 - Proceedings of the 15th ACM International Conference on Web Search and Data Mining · February 11, 2022 Predicting the next interaction of a short-term interaction session is a challenging task in session-based recommendation. Almost all existing works rely on item transition patterns, and neglect user historical sessions while modeling user preference, whic ... Full text Cite

Two-Dimensional Functional Principal Component Analysis for Image Feature Extraction

Journal Article Journal of Computational and Graphical Statistics · January 1, 2022 Methodologies for functional principal component analysis are well established in the one-dimensional setting. However, for two-dimensional surfaces, for example, images, conducting functional principal component analysis is complicated and challenging, be ... Full text Cite

Mining Minority-Class Examples with Uncertainty Estimates

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2022 In the real world, the frequency of occurrence of objects is naturally skewed forming long-tail class distributions, which results in poor performance on the statistically rare classes. A promising solution is to mine tail-class examples to balance the tra ... Full text Cite

Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to Minimax Optimization

Journal Article Journal of Machine Learning Research · January 1, 2022 In the paper, we propose a class of accelerated zeroth-order and first-order momentum methods for both nonconvex mini-optimization and minimax-optimization. Specifically, we propose a new accelerated zeroth-order momentum (Acc-ZOM) method for black-box min ... Cite

From good to best: Two-stage training for cross-lingual machine reading comprehension

Conference Proceedings of the AAAI Conference on Artificial Intelligence · 2022 Cite

Graph neural networks

Chapter · 2022 Cite

Fair and efficient contribution valuation for vertical federated learning

Journal Article arXiv preprint arXiv:2201.02658 · 2022 Cite

Cosine Model Watermarking Against Ensemble Distillation

Journal Article arXiv preprint arXiv:2203.02777 · 2022 Cite

Bridging the Gap between Language Models and Cross-Lingual Sequence Labeling

Journal Article arXiv preprint arXiv:2204.05210 · 2022 Cite

Spatial-Temporal Hypergraph Self-Supervised Learning for Crime Prediction

Journal Article arXiv preprint arXiv:2204.08587 · 2022 Cite

Multi-level Contrastive Learning for Cross-lingual Spoken Language Understanding

Journal Article arXiv preprint arXiv:2205.03656 · 2022 Cite

Data pricing in machine learning pipelines

Journal Article Knowledge and Information Systems · 2022 Cite

Transformer-Empowered Content-Aware Collaborative Filtering

Journal Article arXiv preprint arXiv:2204.00849 · 2022 Cite

Trustworthy Graph Neural Networks: Aspects, Methods and Trends

Journal Article arXiv preprint arXiv:2205.07424 · 2022 Cite

Communication-Efficient Robust Federated Learning with Noisy Labels

Journal Article arXiv preprint arXiv:2206.05558 · 2022 Cite

Revealing Unfair Models by Mining Interpretable Evidence

Journal Article arXiv preprint arXiv:2207.05811 · 2022 Cite

On Shapley Value in Data Assemblage Under Independent Utility

Conference Proceedings of the VLDB Endowment · January 1, 2022 In many applications, an organization may want to acquire data from many data owners. Data marketplaces allow data owners to produce data assemblage needed by data buyers through coalition. To encourage coalitions to produce data, it is critical to allocat ... Full text Cite

Data Mining: Concepts and Techniques, Fourth Edition

Chapter · January 1, 2022 Data Mining: Concepts and Techniques, Fourth Edition introduces concepts, principles, and methods for mining patterns, knowledge, and models from various kinds of data for diverse applications. Specifically, it delves into the processes for uncovering patt ... Full text Cite

Combining Unstructured Content and Knowledge Graphs into Recommendation Datasets

Conference CEUR Workshop Proceedings · January 1, 2022 Popular book and movie recommendation datasets can be associated with Knowledge Graphs (KG) that enable the development of KG-based recommender systems. However, most of these approaches are based on Collaborative Filtering, leaving Content-based Filtering ... Cite

Toward Unified Data and Algorithm Fairness via Adversarial Data Augmentation and Adaptive Model Fine-tuning

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · January 1, 2022 There is some recent research interest in algorithmic fairness for biased data. There are a variety of pre-, in-, and post-processing methods designed for this problem. However, these methods are exclusively targeting data unfairness and algorithmic unfair ... Full text Cite

Label-aware Multi-level Contrastive Learning for Cross-lingual Spoken Language Understanding

Conference Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 · January 1, 2022 Despite the great success of spoken language understanding (SLU) in high-resource languages, it remains challenging in low-resource languages mainly due to the lack of labeled training data. The recent multilingual code-switching approach achieves better a ... Cite

Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval

Conference Findings of the Association for Computational Linguistics: EMNLP 2022 · January 1, 2022 Recent multilingual pre-trained models have shown better performance in various multilingual tasks. However, these models perform poorly on multilingual retrieval tasks due to lacking multilingual training data. In this paper, we propose to mine and genera ... Cite

Graph Neural Networks: Foundations, Frontiers, and Applications

Book · January 1, 2022 Deep Learning models are at the core of artificial intelligence research today. It is well known that deep learning techniques are disruptive for Euclidean data, such as images or sequence data, and not immediately applicable to graph-structured data such ... Full text Cite

Revisiting Graph Contrastive Learning from the Perspective of Graph Spectrum

Conference Advances in Neural Information Processing Systems · January 1, 2022 Graph Contrastive Learning (GCL), learning the node representations by augmenting graphs, has attracted considerable attentions. Despite the proliferation of various graph augmentation strategies, some fundamental questions still remain unclear: what infor ... Cite

Model complexity of deep learning: a survey

Journal Article Knowledge and Information Systems · October 1, 2021 Model complexity is a fundamental problem in deep learning. In this paper, we conduct a systematic overview of the latest studies on model complexity in deep learning. Model complexity of deep learning can be categorized into expressive capacity and effect ... Full text Cite

AsySQN: Faster Vertical Federated Learning Algorithms with Better Computation Resource Utilization

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021 Vertical federated learning (VFL) is an effective paradigm of training the emerging cross-organizational (e.g., different corporations, companies and organizations) collaborative learning with privacy preserving. Stochastic gradient descent (SGD) methods a ... Full text Cite

Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021 Named entity recognition (NER) is a fundamental component in many applications, such as Web Search and Voice Assistants. Although deep neural networks greatly improve the performance of NER, due to the requirement of large amounts of training data, deep ne ... Full text Cite

The Sixth International Workshop on Deep Learning on Graphs - Methods and Applications (DLG-KDD'21)

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021 Deep Learning models are at the core of research in Artificial Intelligence research today. A tide in research for deep learning on graphs or graph neural networks. This wave of research at the intersection of graph theory and deep learning has also influe ... Full text Cite

Data Pricing and Data Asset Governance in the AI Era

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021 Data is one of the most critical resources in the AI Era. While substantial research has been dedicated to training machine learning models using various types of data, much less efforts have been invested in the exploration of assessing and governing data ... Full text Cite

Towards Fair Federated Learning

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021 Federated learning has become increasingly popular as it facilitates collaborative training of machine learning models among multiple clients while preserving their data privacy. In practice, one major challenge for federated learning is to achieve fairnes ... Full text Cite

Auto-Split: A General Framework of Collaborative Edge-Cloud AI

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021 In many industry scale applications, large and resource consuming machine learning models reside in powerful cloud servers. At the same time, large amounts of input data are collected at the edge of cloud. The inference results are also communicated to use ... Full text Cite

The Third International Workshop on Smart Data for Blockchain and Distributed Ledger (SDBD2021): Joint Workshop with SIGKDD 2021 Trust Day

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021 Today's computing is characterized by an increasing degree of complexity, comprehensiveness and collaboration. The complexity can be observed by the wide application of gigantic models with a huge number of parameters and structures of an unprecedented lev ... Full text Cite

Language Scaling

Conference Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining · August 14, 2021 Full text Cite

CalibreNet: Calibration Networks for Multilingual Sequence Labeling

Conference WSDM 2021 - Proceedings of the 14th ACM International Conference on Web Search and Data Mining · August 3, 2021 Lack of training data in low-resource languages presents huge challenges to sequence labeling tasks such as named entity recognition (NER) and machine reading comprehension (MRC). One major obstacle is the errors on the boundary of predicted answers. To ta ... Full text Cite

Group-Based Skyline for Pareto Optimal Groups

Journal Article IEEE Transactions on Knowledge and Data Engineering · July 1, 2021 Skyline computation, aiming at identifying a set of skyline points that are not dominated by any other point, is particularly useful for multi-criteria data analysis and decision making. Traditional skyline computation, however, is inadequate to answer que ... Full text Cite

Visually aware recommendation with aesthetic features

Journal Article VLDB Journal · July 1, 2021 Visual information plays a critical role in human decision-making process. Recent developments on visually aware recommender systems have taken the product image into account. We argue that the aesthetic factor is very important in modeling and predicting ... Full text Cite

Automating entity matching model development

Conference Proceedings - International Conference on Data Engineering · April 1, 2021 This paper seeks to answer one important but unexplored question for Entity Matching (EM): can we develop a good machine learning pipeline automatically for the EM task? If yes, to what extent the process can be automated? To answer this question, we find ... Full text Cite

Eclipse: Generalizing kNN and skyline

Conference Proceedings - International Conference on Data Engineering · April 1, 2021 k nearest neighbor (kNN) queries and skyline queries are important operators on multi-dimensional data points. Given a query point, kNN returns the k nearest neighbors based on a scoring function such as a weighted sum of the attributes, which requires pre ... Full text Cite

Influence Analysis in Evolving Networks: A Survey

Journal Article IEEE Transactions on Knowledge and Data Engineering · March 1, 2021 Influence analysis aims at detecting influential vertices in networks and utilizing them in cost-effective business strategies. Influence analysis in large-scale networks is a key technique in many important applications ranging from viral marketing and on ... Full text Cite

Early diagnosis of Alzheimer's disease on ADNI data using novel longitudinal score based on functional principal component analysis.

Journal Article Journal of medical imaging (Bellingham, Wash.) · March 2021 Methods: Alzheimer's disease (AD) is a worldwide prevalent age-related neurodegenerative disease with no available cure yet. Early prognosis is therefore crucial for planning proper clinical intervention. It is especially true for people diagnosed w ... Full text Cite

Robust Counterfactual Explanations on Graph Neural Networks

Conference Advances in Neural Information Processing Systems · January 1, 2021 Massive deployment of Graph Neural Networks (GNNs) in high-stake applications generates a strong demand for explanations that are robust to noise and align well with human intuition. Most existing methods generate explanations by identifying a subgraph of ... Cite

Personalized Cross-Silo Federated Learning on Non-IID Data

Conference 35th AAAI Conference on Artificial Intelligence, AAAI 2021 · January 1, 2021 Non-IID data present a tough challenge for federated learning. In this paper, we explore a novel idea of facilitating pairwise collaborations between clients with similar data. We propose FedAMP, a new method employing federated attentive message passing t ... Cite

Knowledge-Enhanced Hierarchical Graph Transformer Network for Multi-Behavior Recommendation

Conference 35th AAAI Conference on Artificial Intelligence, AAAI 2021 · January 1, 2021 Accurate user and item embedding learning is crucial for modern recommender systems. However, most existing recommendation techniques have thus far focused on modeling users’ preferences over singular type of user-item interactions. Many practical recommen ... Cite

Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding

Conference EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings · January 1, 2021 Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Although various data augmentation approaches have been proposed to synthesize training data in low-resource target languages, th ... Cite

Modeling Event-Pair Relations in External Knowledge Graphs for Script Reasoning

Conference Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 · January 1, 2021 Script reasoning infers subsequent events from a given event chain, which involves the ability to understand relations between events. A human-labeled script reasoning dataset is usually of small size with limited event relations, which highlights the nece ... Cite

Demonstration of dealer: An end-to-end model marketplace with differential privacy

Conference Proceedings of the VLDB Endowment · January 1, 2021 Data-driven machine learning (ML) has witnessed great success across a variety of application domains. Since ML model training relies on a large amount of data, there is a growing demand for high-quality data to be collected for ML model training. Data mar ... Full text Cite

Slimchain: Scaling blockchain transactions through off-chain storage and parallel processing

Conference Proceedings of the VLDB Endowment · January 1, 2021 Blockchain technology has emerged as the cornerstone of many decentralized applications operating among otherwise untrusted peers. However, it is well known that existing blockchain systems do not scale well. Transactions are often executed and committed s ... Full text Cite

Reasoning over entity-action-location graph for procedural text understanding

Conference ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference · January 1, 2021 Procedural text understanding aims at tracking the states (e.g., create, move, destroy) and locations of the entities mentioned in a given paragraph. To effectively track the states and locations, it is essential to capture the rich semantic relations betw ... Cite

Comprehensible counterfactual explanation on Kolmogorov-Smirnov test

Conference Proceedings of the VLDB Endowment · January 1, 2021 The Kolmogorov-Smirnov (KS) test is popularly used in many applications, such as anomaly detection, astronomy, database security and AI systems. One challenge remained untouched is how we can obtain an explanation on why a test set fails the KS test. In th ... Full text Cite

Finding Representative Interpretations on Convolutional Neural Networks

Conference Proceedings of the IEEE International Conference on Computer Vision · January 1, 2021 Interpreting the decision logic behind effective deep convolutional neural networks (CNN) on images complements the success of deep learning models. However, the existing methods can only interpret some specific decision logic on individual or a small numb ... Full text Cite

Reinforced Multi-Teacher Selection for Knowledge Distillation

Conference 35th AAAI Conference on Artificial Intelligence, AAAI 2021 · January 1, 2021 In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a popular method for model compression, knowledge distillation transfers knowledge ... Cite

Dealer: An end-to-end model marketplace with differential privacy

Journal Article Proceedings of the VLDB Endowment · January 1, 2021 Data-driven machine learning has become ubiquitous. A marketplace for machine learning models connects data owners and model buyers, and can dramatically facilitate data-driven machine learning applications. In this paper, we take a formal data marketplace ... Full text Cite

Skyline diagram: Efficient space partitioning for skyline queries

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2021 Skyline queries are important in many application domains. In this paper, we propose a novel structure Skyline Diagram, which given a set of points, partitions the plane into a set of regions, referred to as skyline polyominos. All query points in the same ... Full text Cite

Graph neural networks for natural language processing: A survey

Journal Article arXiv preprint arXiv:2106.06090 · 2021 Cite

Language Scaling: Applications, Challenges and Approaches

Conference Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining · 2021 Cite

Finding representative interpretations on convolutional neural networks

Conference Proceedings of the IEEE/CVF International Conference on Computer Vision · 2021 Cite

Fedfair: Training fair models in cross-silo federated learning

Journal Article arXiv preprint arXiv:2109.05662 · 2021 Cite

Achieving Model Fairness in Vertical Federated Learning

Journal Article arXiv preprint arXiv:2109.08344 · 2021 Cite

Improving fairness for data valuation in federated learning

Journal Article arXiv preprint arXiv:2109.09046 · 2021 Cite

Recent Advances on Graph Analytics and Its Applications in Healthcare

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 23, 2020 Graph is a natural representation encoding both the features of the data samples and relationships among them. Analysis with graphs is a classic topic in data mining and many techniques have been proposed in the past. In recent years, because of the rapid ... Full text Cite

Data Pricing - From Economics to Data Science

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 23, 2020 Data are invaluable. How can we assess the value of data objectively and quantitatively? Pricing data, or information goods in general, has been studied and practiced in dispersed areas and principles, such as economics, data management, data mining, elect ... Full text Cite

Mining Implicit Relevance Feedback from User Behavior for Web Question Answering

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 23, 2020 Training and refreshing a web-scale Question Answering (QA) system for a multi-lingual commercial search engine often requires a huge amount of training examples. One principled idea is to mine implicit relevance feedback from user behavior recorded in sea ... Full text Cite

AM-GCN: Adaptive Multi-channel Graph Convolutional Networks

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 23, 2020 Graph Convolutional Networks (GCNs) have gained great popularity in tackling various analytics tasks on graph and network data. However, some recent studies raise concerns about whether GCNs can optimally integrate node features and topological structures ... Full text Cite

Measuring Model Complexity of Neural Networks with Curve Activation Functions

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 23, 2020 It is fundamental to measure model complexity of deep neural networks. A good model complexity measure can help to tackle many challenging problems, such as overfitting detection, model selection, and performance improvement. The existing literature on mod ... Full text Cite

Efficient Contour Computation of Group-Based Skyline

Journal Article IEEE Transactions on Knowledge and Data Engineering · July 1, 2020 Skyline, aiming at finding a Pareto optimal subset of points in a multi-dimensional dataset, has gained great interest due to its extensive use for multi-criteria analysis and decision making. The skyline consists of all points that are not dominated by an ... Full text Cite

On spatial keyword covering

Journal Article Knowledge and Information Systems · July 1, 2020 This article introduces and solves a spatial keyword cover problem (SK-Cover for short), which aims to identify the group of spatio-textual objects covering all the keywords in a query and minimizing a distance cost function that leads to fewer objects in ... Full text Cite

LightTrack: A generic framework for online top-down human pose tracking

Conference IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops · June 1, 2020 In this paper, we propose a simple yet effective framework, named LightTrack, for online human pose tracking. Existing methods usually perform human detection, pose estimation and tracking in sequential stages, where pose tracking is regarded as an offline ... Full text Cite

Continuous influence maximization

Journal Article ACM Transactions on Knowledge Discovery from Data · May 8, 2020 Imagine we are introducing a new product through a social network, where we know for each user in the network the function of purchase probability with respect to discount. Then, what discounts should we offer to those social network users so that, under a ... Full text Cite

VLDB SI 2018 editorial

Journal Article VLDB Journal · May 1, 2020 Full text Cite

Exact and consistent interpretation of piecewise linear models hidden behind APIs: A closed form solution

Conference Proceedings - International Conference on Data Engineering · April 1, 2020 More and more AI services are provided through APIs on cloud where predictive models are hidden behind APIs. To build trust with users and reduce potential application risk, it is important to interpret how such predictive models hidden behind APIs make th ... Full text Cite

Momentum-Based policy gradient methods

Conference 37th International Conference on Machine Learning, ICML 2020 · January 1, 2020 In the paper, we propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning, which use adaptive learning rates and do not require any large batches. Specifically, we propose a fast important-sampling momen ... Cite

Sinkhorn regression

Conference IJCAI International Joint Conference on Artificial Intelligence · January 1, 2020 This paper introduces a novel Robust Regression (RR) model, named Sinkhorn regression, which imposes Sinkhorn distances on both loss function and regularization. Traditional RR methods target at searching for an element-wise loss function (e.g., Lp-norm) t ... Cite

Discrete model compression with resource constraint for deep neural networks

Conference Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition · January 1, 2020 In this paper, we target to address the problem of compression and acceleration of Convolutional Neural Networks (CNNs). Specifically, we propose a novel structural pruning method to obtain a compact CNN with strong discriminative power. To find such netwo ... Full text Cite

Online density bursting subgraph detection from temporal graphs

Conference Proceedings of the VLDB Endowment · January 1, 2020 Given a temporal weighted graph that consists of a potentially endless stream of updates, we are interested in finding density bursting subgraphs (DBS for short), where a DBS is a subgraph that accumulates its density at the fastest speed. Online DBS detec ... Full text Cite

Mining top-k sequential patterns in transaction database graphs: A new challenging problem and a sampling-based approach

Journal Article World Wide Web · January 1, 2020 In many real world networks, a vertex is usually associated with a transaction database that comprehensively describes the behaviour of the vertex. A typical example is a social network, where the behaviours of every user are depicted by a transaction data ... Full text Cite

VLDB SI 2018 editorial

Journal Article The VLDB Journal · 2020 Cite

Discrete model compression with resource constraint for deep neural networks

Conference Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition · 2020 Cite

Momentum-based policy gradient methods

Conference International conference on machine learning · 2020 Cite

Differential privacy and its applications in social network analysis: A survey

Journal Article arXiv preprint arXiv:2010.02973 · 2020 Cite

A graph representation of semi-structured data for web question answering

Journal Article arXiv preprint arXiv:2010.06801 · 2020 Cite

Comprehensible counterfactual explanation on Kolmogorov-Smirnov test

Journal Article arXiv preprint arXiv:2011.01223 · 2020 Cite

Optimal estimation of low-rank factors via feature level data fusion of multiplex signal systems

Journal Article IEEE Transactions on Knowledge and Data Engineering · 2020 Cite

Discrete model compression with resource constraint

Conference IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2020) · 2020 Cite

Practicing the art of data science

Conference International Conference on Information and Knowledge Management, Proceedings · November 3, 2019 Full text Cite

Skyrec: Finding Pareto optimal groups

Conference International Conference on Information and Knowledge Management, Proceedings · November 3, 2019 We present SkyRec (Skyline Recommender), a recommendation toolkit for finding optimal groups based on the notion of group skyline. Skyline computation, aiming at identifying a set of skyline points that are not dominated by any other point, is particularly ... Full text Cite

Tracking top-k influential users with relative errors

Conference International Conference on Information and Knowledge Management, Proceedings · November 3, 2019 Tracking influential users in a dynamic social network is a fundamental step in fruitful applications, such as social recommendation, network topology optimization, and blocking rumour spreading. The major obstacle in mining top influential users is that e ... Full text Cite

Classification with label noise: a Markov chain sampling framework

Journal Article Data Mining and Knowledge Discovery · September 1, 2019 The effectiveness of classification methods relies largely on the correctness of instance labels. In real applications, however, the labels of instances are often not highly reliable due to the presence of label noise. Training effective classifiers in the ... Full text Cite

Autone: Hyperparameter optimization for massive network embedding

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019 Network embedding (NE) aims to embed the nodes of a network into a vector space, and serves as the bridge between machine learning and network data. Despite their widespread success, NE algorithms typically contain a large number of hyperparameters for pre ... Full text Cite

Learning from networks: Algorithms, theory, and applications

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019 Arguably, every entity in this universe is networked in one way or another. With the prevalence of network data collected, such as social media and biological networks, learning from networks has become an essential task in many applications. It is well re ... Full text Cite

Progan: Network embedding via proximity generative adversarial network

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019 Network embedding has attracted increasing attention in recent few years, which is to learn a low-dimensional representation for each node of a network to benefit downstream tasks, such as node classification, link prediction, and network visualization. Es ... Full text Cite

Multi-horizon time series forecasting with temporal attention learning

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019 We propose a novel data-driven approach for solving multi-horizon probabilistic forecasting tasks that predicts the full distribution of a time series on future horizons. We illustrate that temporal patterns hidden in historical information play an importa ... Full text Cite

Tackle balancing constraint for incremental semi-supervised support vector learning

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019 Semi-Supervised Support Vector Machine (S3VM) is one of the most popular methods for semi-supervised learning. To avoid the trivial solution of classifying all the unlabeled examples to a same class, balancing constraint is often used with S3VM (denoted as ... Full text Cite

Conditional random field enhanced graph convolutional neural networks

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019 Graph convolutional neural networks have attracted increasing attention in recent years. Unlike the standard convolutional neural network, graph convolutional neural networks perform the convolutional operation on the graph data. Compared with the generic ... Full text Cite

Secure and Efficient Skyline Queries on Encrypted Data.

Journal Article IEEE transactions on knowledge and data engineering · July 2019 Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the clo ... Full text Cite

SimRank*: effective and scalable pairwise similarity search based on graph topology

Journal Article VLDB Journal · June 1, 2019 Given a graph, how can we quantify similarity between two nodes in an effective and scalable way? SimRank is an attractive measure of pairwise similarity based on graph topologies. Its underpinning philosophy that “two nodes are similar if they are pointed ... Full text Cite

A Survey on Network Embedding

Journal Article IEEE Transactions on Knowledge and Data Engineering · May 1, 2019 Network embedding assigns nodes in a network to low-dimensional representations and effectively preserves the network structure. Recently, a significant amount of progresses have been made toward this emerging network analysis paradigm. In this survey, we ... Full text Cite

Is there a data science and engineering brain drain? if so, how can we rebalance them?

Conference Proceedings - International Conference on Data Engineering · April 1, 2019 Full text Cite

Finding theme communities from database networks

Conference Proceedings of the VLDB Endowment · January 1, 2019 Given a database network where each vertex is associated with a transaction database, we are interested in finding theme communities. Here, a theme community is a cohesive subgraph such that a common pattern is frequent in all transaction databases associa ... Full text Cite

Detecting customer complaint escalation with recurrent neural networks and manually-engineered features

Conference NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference · January 1, 2019 Consumers dissatisfied with the normal dispute resolution process provided by an ecommerce company's customer service agents have the option of escalating their complaints by filing grievances with a government authority. This paper tackles the challenge o ... Cite

Demystifying dropout

Conference 36th International Conference on Machine Learning, ICML 2019 · January 1, 2019 Dropout is a popular technique to train large-scale deep neural networks to alleviate the overfitting problem. To disclose the underlying reason for its gain, numerous works have tried to explain it from different perspectives. In this paper, unlike existi ... Cite

Data mining techniques

Journal Article · 2019 Cite

Demystifying dropout

Conference International Conference on Machine Learning · 2019 Cite

Finding Maximal Significant Linear Representation between Long Time Series

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 27, 2018 In some applications on time series data, finding linear correlation between time series is important. However, it is meaningless to measure the global correlation between two long time series. Moreover, more often than not, two time series may be correlat ... Full text Cite

High-Order Proximity Preserved Embedding for Dynamic Networks

Journal Article IEEE Transactions on Knowledge and Data Engineering · November 1, 2018 Network embedding, aiming to embed a network into a low dimensional vector space while preserving the inherent structural properties of the network, has attracted considerable attention. However, most existing embedding methods focus on the static network ... Full text Cite

Skyline diagram: finding the voronoi counterpart for skyline queries

Conference Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018 · October 24, 2018 Skyline queries are important in many application domains. In this paper, we propose a novel structure Skyline Diagram, which given a set of points, partitions the plane into a set of regions, referred to as skyline polyominos. All query points in the same ... Full text Cite

Mining density contrast subgraphs

Conference Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018 · October 24, 2018 Dense subgraph discovery is a key primitive in many graph mining applications, such as detecting communities in social networks and mining gene correlation from biological data. Most studies on dense subgraph mining only deal with one graph. However, in ma ... Full text Cite

Subspace multi-clustering: a review

Journal Article Knowledge and Information Systems · August 1, 2018 Clustering has been widely used to identify possible structures in data and help users to understand data in an unsupervised manner. Traditional clustering methods often provide a single partitioning of the data that groups similar data objects in one grou ... Full text Cite

Sketched follow-the-regularized-leader for online factorization machine

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 19, 2018 Factorization Machine (FM) is a supervised machine learning model for feature engineering, which is widely used in many real-world applications. In this paper, we consider the case that the data samples arrive sequentially. The existing convex formulation ... Full text Cite

Arbitrary-order proximity preserved network embedding

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 19, 2018 Network embedding has received increasing research attention in recent years. The existing methods show that the high-order proximity plays a key role in capturing the underlying structure of the network. However, two fundamental problems in preserving the ... Full text Cite

Exact and consistent interpretation for piecewise linear neural networks: A closed form solution

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 19, 2018 Strong intelligent machines powered by deep neural networks are increasingly deployed as black boxes to make decisions in risk-sensitive domains, such as finance and medical. To reduce potential risk and build trust with users, it is critical to interpret ... Full text Cite

Message from the general chairs: DSC 2018

Conference Proceedings - 2018 IEEE 3rd International Conference on Data Science in Cyberspace, DSC 2018 · July 16, 2018 Full text Cite

AQP++: Connecting approximate query processing with aggregate precomputation for interactive analytics

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · May 27, 2018 Interactive analytics requires database systems to be able to answer aggregation queries within interactive response times. As the amount of data is continuously growing at an unprecedented rate, this is becoming increasingly challenging. In the past, the ... Full text Cite

Online compact convexified factorization machine

Conference The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018 · April 10, 2018 Factorization Machine (FM) is a supervised learning approach with a powerful capability of feature engineering. It yields state-of-the-art performances in various batch learning tasks where all the training data is made available prior to the training. How ... Full text Cite

Cleaning crowdsourced labels using oracles for statistical classification

Conference Proceedings of the VLDB Endowment · January 1, 2018 Nowadays, crowdsourcing is being widely used to collect training data for solving classification problems. However, crowdsourced labels are often noisy, and there is a performance gap between classification with noisy labels and classification with ground- ... Full text Cite

Timers: Error-bounded SVD restart on dynamic networks

Conference 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 · January 1, 2018 Singular Value Decomposition (SVD) is a popular approach in various network applications, such as link prediction and network parameter characterization. Incremental SVD approaches are proposed to process newly changed nodes and edges in dynamic networks. ... Cite

Preface

Book · January 1, 2018 Cite

Preface

Book · January 1, 2018 Cite

Tracking top-K influential vertices in dynamic networks

Journal Article arXiv preprint arXiv:1803.01499 · 2018 Cite

LETTER FROM THE PROGRAM CHAIRS

Other PROCEEDINGS OF THE VLDB ENDOWMENT · 2018 Cite

Correction to: Database Systems for Advanced Applications

Conference International Conference on Database Systems for Advanced Applications · 2018 Cite

Schemaless join for result set preferences

Conference Proceedings - 2017 IEEE International Conference on Information Reuse and Integration, IRI 2017 · November 8, 2017 In many applications, such as data integration and big data analytics, one has to integrate data from multiple sources without detailed and accurate schema information. The state of the art focuses on matching attributes among sources based on the informat ... Full text Cite

JASIST special issue on biomedical information retrieval

Journal Article Journal of the Association for Information Science and Technology · November 1, 2017 Full text Cite

Tracking Influential Individuals in Dynamic Networks

Journal Article IEEE Transactions on Knowledge and Data Engineering · November 1, 2017 In this paper, we tackle a challenging problem inherent in a series of applications: tracking the influential nodes in dynamic networks. Specifically, we model a dynamic network as a stream of edge weight updates. This general model embraces many practical ... Full text Cite

Activity Maximization by Effective Information Diffusion in Social Networks

Journal Article IEEE Transactions on Knowledge and Data Engineering · November 1, 2017 In a social network, even about the same information the excitement between different users are different. If we want to spread a piece of new information and maximize the expected total amount of excitement, which seed users should we choose? This problem ... Full text Cite

Measuring in-network node similarity based on neighborhoods: a unified parametric approach

Journal Article Knowledge and Information Systems · October 1, 2017 In many applications, we need to measure similarity between nodes in a large network based on features of their neighborhoods. Although in-network node similarity based on proximity has been well investigated, surprisingly, measuring in-network node simila ... Full text Cite

Preference-driven similarity join

Conference Proceedings - 2017 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2017 · August 23, 2017 Similarity join, which can find similar objects (e.g., products, names, addresses) across different sources, is powerful in dealing with variety in big data, especially web data. Threshold-driven similarity join, which has been extensively studied in the p ... Full text Cite

Principal pattern mining on graphs

Conference Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017 · July 31, 2017 Given a graph, can we find a set of patterns, of which the cost of storing these patterns is economic (or satisfying specific user needs) but their coverage includes the entire graph? We denote these patterns by principal patterns of the given graph since ... Full text Cite

Finding multiple stable clusterings

Journal Article Knowledge and Information Systems · June 1, 2017 Multi-clustering, which tries to find multiple independent ways to partition a data set into groups, has enjoyed many applications, such as customer relationship management, bioinformatics and healthcare informatics. This paper addresses two fundamental qu ... Full text Cite

Secure Skyline Queries on Cloud Platform.

Conference Proceedings. International Conference on Data Engineering · April 2017 Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the clo ... Full text Cite

Message from the general chairs

Conference Proceedings - 2016 IEEE 1st International Conference on Data Science in Cyberspace, DSC 2016 · February 27, 2017 Full text Cite

Efficient mining of regional movement patterns in semantic trajectories

Conference Proceedings of the VLDB Endowment · January 1, 2017 Semantic trajectory pattern mining is becoming more and more important with the rapidly growing volumes of semantically rich trajectory data. Extracting sequential patterns in semantic trajectories plays a key role in understanding semantic behaviour of hu ... Full text Cite

Community preserving network embedding

Conference 31st AAAI Conference on Artificial Intelligence, AAAI 2017 · January 1, 2017 Network embedding, aiming to learn the low-dimensional representations of nodes in networks, is of paramount importance in many real applications. One basic requirement of network embedding is to preserve the structure and inherent properties of the networ ... Cite

Multidimensional benchmarking in data warehouses

Journal Article Intelligent Data Analysis · January 1, 2017 Benchmarking is among the most widely adopted practices in business today. However, to the best of our knowledge, conducting multidimensional benchmarking in data warehouses has not been explored from a technical efficiency perspective. In this paper, we f ... Full text Cite

Multidimensional business benchmarking analysis on data warehouses

Journal Article International Journal of Data Warehousing and Mining · January 1, 2017 Benchmarking analysis has been used extensively in industry for business analytics. Surprisingly, how to conduct benchmarking analysis efficiently over large data sets remains a technical problem untouched. In this paper, the authors formulate benchmark qu ... Full text Cite

Eclipse: Practicability Beyond kNN and Skyline

Journal Article arXiv preprint arXiv:1707.01223 · 2017 Cite

Scalable and accurate online feature selection for big data

Journal Article ACM Transactions on Knowledge Discovery from Data · December 1, 2016 Feature selection is important in many big data applications. Two critical challenges closely associate with big data. First, in many big data applications, the dimensionality is extremely high, in millions, and keeps growing. Second, big data applications ... Full text Cite

Tradeoffs between density and size in extracting dense subgraphs: A unified framework

Conference Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2016 · November 21, 2016 Extracting dense subgraphs is an important step in many graph related applications. There is a challenging struggle in exploring the tradeoffs between density and size in subgraphs extracted. More often than not, different methods aim at different specific ... Full text Cite

Online Visual Analytics of Text Streams.

Journal Article IEEE transactions on visualization and computer graphics · November 2016 We present an online visual analytics approach to helping users explore and understand hierarchical topic evolution in high-volume text streams. The key idea behind this approach is to identify representative topics in incoming documents and align them wit ... Full text Cite

Discovering outlying aspects in large datasets

Journal Article Data Mining and Knowledge Discovery · November 1, 2016 We address the problem of outlying aspects mining: given a query object and a reference multidimensional data set, how can we discover what aspects (i.e., subsets of features or subspaces) make the query object most outlying? Outlying aspects mining can be ... Full text Cite

Urban traffic prediction through the second use of inexpensive big data from buildings

Conference International Conference on Information and Knowledge Management, Proceedings · October 24, 2016 Traffic prediction, particularly in urban regions, is an important application of tremendous practical value. In this paper, we report a novel and interesting case study of urban traffic prediction in Central, Hong Kong, one of the densest urban areas in t ... Full text Cite

Preface

Journal Article Big Data Research · September 1, 2016 Full text Cite

Continuous similarity search for evolving queries

Journal Article Knowledge and Information Systems · September 1, 2016 In this paper, we study a novel problem of continuous similarity search for evolving queries. Given a set of objects, each being a set or multiset of items, and a data stream, we want to continuously maintain the top-k most similar objects using the last n ... Full text Cite

Finding gangs in war from signed networks

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 13, 2016 Given a signed network where edges are weighted in real number, and positive weights indicate cohesion between vertices and negative weights indicate opposition, we are interested in finding k-Oppositive Cohesive Groups (k-OCG). Each k-OCG is a group of k ... Full text Cite

When social influence meets item inference

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 13, 2016 Research issues and data mining techniques for product recommendation and viral marketing have been widely studied. Existing works on seed selection in social networks do not take into account the effect of product recommendations in e-commerce stores. In ... Full text Cite

Asymmetric transitivity preserving graph embedding

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 13, 2016 Graph embedding algorithms embed a graph into a vector space where the structure and the inherent properties of the graph are preserved. The existing graph embedding methods cannot preserve the asymmetric transitivity well, which is a critical property of ... Full text Cite

Using computer intelligence for depression diagnosis and crowdsourcing

Journal Article Computer · July 1, 2016 This installment of Computer's series highlighting the work published in IEEE Computer Society journals comes from IEEE Transactions on Affective Computing and IEEE Transactions on Knowledge and Data Engineering. ... Full text Cite

Preface

Journal Article Journal of Computer Science and Technology · July 1, 2016 Full text Cite

Continuous influence maximization: What discounts should we offer to social network users?

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 26, 2016 Imagine we are introducing a new product through a social network, where we know for each user in the network the purchase probability curve with respect to discount. Then, what discount should we offer to those social network users so that the adoption of ... Full text Cite

Finding the minimum spatial keyword cover

Conference 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016 · June 22, 2016 The existing works on spatial keyword search focus on finding a group of spatial objects covering all the query keywords and minimizing the diameter of the group. However, we observe that such a formulation may not address what users need in some applicati ... Full text Cite

Efficient discovery of contrast subspaces for object explanation and characterization

Journal Article Knowledge and Information Systems · April 1, 2016 We tackle the novel problem of mining contrast subspaces. Given a set of multidimensional objects in two classes (Formula presented.) and (Formula presented.) and a query object (Formula presented.) , we want to find the top- (Formula presented.) subspaces ... Full text Cite

Finding multiple stable clusterings

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · January 5, 2016 Multi-clustering, which tries to find multiple independent ways to partition a data set into groups, has enjoyed many applications, such as customer relationship management, bioinformatics and healthcare informatics. This paper addresses two fundamental qu ... Full text Cite

EIC Editorial

Journal Article IEEE Transactions on Knowledge & Data Engineering · 2016 Cite

Mining outlying aspects on numeric data

Journal Article Data Mining and Knowledge Discovery · September 22, 2015 When we are investigating an object in a data set, which itself may or may not be an outlier, can we identify unusual (i.e., outlying) aspects of the object? In this paper, we identify the novel problem of mining outlying aspects on numeric data. Given a q ... Full text Cite

Mining multidimensional contextual outliers from categorical relational data

Journal Article Intelligent Data Analysis · September 8, 2015 A wide range of methods have been proposed for detecting different types of outliers in both the full attribute space and its subspaces. However, the interpretability of outliers, that is, explaining in what ways and to what extent an object is an outlier, ... Full text Cite

Welcome from the ASONAM 2015 program chairs

Conference Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2015 · August 25, 2015 Cite

Tornado forecasting with multiple Markov boundaries

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 10, 2015 Reliable tornado forecasting with a long-lead time can greatly support emergency response and is of vital importance for the economy and society. The large number of meteorological variables in spatiotemporal domains and the complex relationships among var ... Full text Cite

COSNET: Connecting heterogeneous social networks with local and global consistency

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 10, 2015 More often than not, people are active in more than one social network. Identifying users from multiple heterogeneous social networks and integrating the different networks is a fundamental issue in many applications. The existing methods tackle this probl ... Full text Cite

Proximity-aware local-recoding anonymization with MapReduce for scalable big data privacy preservation in cloud

Journal Article IEEE Transactions on Computers · August 1, 2015 Cloud computing provides promising scalable IT infrastructure to support various processing of a variety of big data applications in sectors such as healthcare and business. Data sets like electronic health records in such applications often contain privac ... Full text Cite

Preface

Journal Article Journal of Computer Science and Technology · July 1, 2015 Full text Cite

Classification with streaming features: An emerging-pattern mining approach

Journal Article ACM Transactions on Knowledge Discovery from Data · June 1, 2015 Many datasets from real-world applications have very high-dimensional or increasing feature space. It is a new research problem to learn and maintain a classifier to deal with very high dimensionality or streaming features. In this article, we adapt the we ... Full text Cite

Cleaning structured event logs: A graph repair approach

Conference Proceedings - International Conference on Data Engineering · May 26, 2015 Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause many serious damages to real applications, such as inaccurate provenance answers, poor profiling results or concealing interesting patterns fr ... Full text Cite

The impact of market competition on search advertising

Journal Article Journal of Interactive Marketing · May 1, 2015 Although search advertising has gained popularity in recent years, research on the content of search advertising is scarce. This study develops a conceptual framework to understand how market competition affects what a firm advertises in its search ads. Se ... Full text Cite

Message from the conference chairs

Conference IEEE International Conference on Data Mining Workshops, ICDMW · January 26, 2015 Full text Cite

ALID: Scalable dominant cluster detection

Conference Proceedings of the VLDB Endowment · January 1, 2015 Detecting dominant clusters is important in many analytic applications. The state-of-the-art methods find dense subgraphs on the affinity graph as dominant clusters. However, the time and space complexities of those methods are dominated by the constructio ... Full text Cite

Mining frequent co-occurrence patterns across multiple data streams

Conference EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings · January 1, 2015 This paper studies the problem of mining frequent co-occurrence patterns across multiple data streams, which has not been addressed by existing works. Co-occurrence pattern in this context refers to the case that the same group of objects appear consecutiv ... Full text Cite

Efficiently computing Top-K shortest path join

Conference EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings · January 1, 2015 Driven by many applications, in this paper we study the problem of computing the top-k shortest paths from one set of target nodes to another set of target nodes in a graph, namely the top-k shortest path join (KPJ) between two sets of target nodes. While ... Full text Cite

Message from the DSDIS2015 Chairs

Conference Proceedings - 2015 IEEE International Conference on Data Science and Data Intensive Systems; 8th IEEE International Conference Cyber, Physical and Social Computing; 11th IEEE International Conference on Green Computing and Communications and 8th IEEE International Conference on Internet of Things, DSDIS/CPSCom/GreenCom/iThings 2015 · January 1, 2015 Full text Cite

Finding pareto optimal groups: Group-based skyline

Chapter · January 1, 2015 Skyline computation, aiming at identifying a set of skyline points that are not dominated by any other point, is particularly useful for multi-criteria data analysis and decision making. Traditional skyline computation, however, is inadequate to answer que ... Full text Cite

Reliable early classification on multivariate time series with numerical and categorical attributes

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2015 Early classification on multivariate time series has recently emerged as a novel and important topic in data mining fields with wide applications such as early detection of diseases in healthcare domains. Most of the existing studies on this topic focused ... Full text Cite

Scalable outlying-inlying aspects discovery via feature ranking

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2015 In outlying aspects mining, given a query object, we aim to answer the question as to what features make the query most outlying. The most recent works tackle this problem using two different strategies. (i) Feature selection approaches select the features ... Full text Cite

In-network neighborhood-based node similarity measure: A unified parametric model

Journal Article arXiv preprint arXiv:1510.03814 · 2015 Cite

State of the Journal Editorial

Journal Article IEEE Transactions on Knowledge & Data Engineering · 2015 Cite

Within-network classification using radius-constrained neighborhood patterns

Conference CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management · November 3, 2014 Within-Network Classification (WNC) techniques are designed for applications where objects to be classified and those with known labels are interlinked. For WNC tasks like web page classification, the homophily principle succeeds by assuming that linked ob ... Full text Cite

An appliance-driven approach to detection of corrupted load curve data

Conference CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management · November 3, 2014 Load curve data in power systems refers to users' electrical energy consumption data periodically collected with meters. It has become one of the most important assets for modern power systems. Many operational decisions are made based on the information d ... Full text Cite

Malicious URL detection by dynamically mining patterns without pre-defined elements

Journal Article World Wide Web · November 1, 2014 Detecting malicious URLs is an essential task in network security intelligence. In this paper, we make two new contributions beyond the state-of-the-art methods on malicious URL detection. First, instead of using any pre-defined features or fixed delimiter ... Full text Cite

Do neighbor buddies make a difference in reblog likelihood? An analysis on SINA Weibo data

Conference ASONAM 2014 - Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining · October 10, 2014 Reblogging, also known as retweeting in Twitter parlance, is a major type of activities in many online social networks. Although there are many studies on reblogging behaviors and potential applications, whether neighbors who are well connected with each o ... Full text Cite

Email mining: Tasks, common techniques, and tools

Journal Article Knowledge and Information Systems · October 1, 2014 Email is one of the most popular forms of communication nowadays, mainly due to its efficiency, low cost, and compatibility of diversified types of information. In order to facilitate better usage of emails and explore business potentials in emailing, vari ... Full text Cite

Pattern-growth methods

Chapter · July 1, 2014 Mining frequent patterns has been a focused topic in data mining research in recent years, with the development of numerous interesting algorithms for mining association, correlation, causality, sequential patterns, partial periodicity, constraint-based fr ... Full text Cite

EIC editorial

Journal Article IEEE Transactions on Knowledge and Data Engineering · July 1, 2014 Full text Cite

Mining most frequently changing component in evolving graphs

Journal Article World Wide Web · May 1, 2014 Many applications see huge demands of finding important changing areas in evolving graphs. In this paper, given a series of snapshots of an evolving graph, we model and develop algorithms to capture the most frequently changing component (MFCC). Motivated ... Full text Cite

Efficient matching of substrings in uncertain sequences

Conference SIAM International Conference on Data Mining 2014, SDM 2014 · January 1, 2014 Substring matching is fundamental to data mining methods for sequential data. It involves checking the existence of a short subsequence within a longer sequence, ensuring no gaps within a match. Whilst a large amount of existing work has focused on substri ... Full text Cite

Towards Scalable and Accurate Online Feature Selection for Big Data

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · January 1, 2014 Feature selection is important in many big data applications. There are at least two critical challenges. Firstly, in many applications, the dimensionality is extremely high, in millions, and keeps growing. Secondly, feature selection has to be highly scal ... Full text Cite

Message from the Conference Chairs

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · January 1, 2014 Full text Cite

SNOC: Streaming Network Node Classification

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · January 1, 2014 Many real-world networks are featured with dynamic changes, such as new nodes and edges, and modification of the node content. Because changes are continuously introduced to the network in a streaming fashion, we refer to such dynamic networks as streaming ... Full text Cite

How can I index my thousands of photos effectively and automatically? An unsupervised feature selection approach

Conference SIAM International Conference on Data Mining 2014, SDM 2014 · January 1, 2014 Given a large photo collection without domain knowledge (e.g., tourism photos, conference photos, event photos, images wrapped from webpages), it is not easy for human beings to organize or only view them within a reasonable time. In this paper, we propose ... Full text Cite

Shortest unique queries on strings

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2014 Let D be a long input string of n characters (from an alphabet of size up to 2w, where w is the number of bits in a machine word). Given a substring q of D, a shortest unique query returns a shortest unique substring of D that contains q. We present an opt ... Full text Cite

Distance metric learning using dropout: A structured regularization approach

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2014 Distance metric learning (DML) aims to learn a distance metric better than Euclidean distance. It has been successfully applied to various tasks, e.g., classification, clustering and information retrieval. Many DML algorithms suffer from the over-fitting p ... Full text Cite

A spatiotemporal compression based approach for efficient big data processing on Cloud

Journal Article Journal of Computer and System Sciences · January 1, 2014 It is well known that processing big graph data can be costly on Cloud. Processing big graph data introduces complex and multiple iterations that raise challenges such as parallel memory bottlenecks, deadlocks, and inefficiency. To tackle the challenges, w ... Full text Cite

Mining contrast subspaces

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2014 In this paper, we tackle a novel problem of mining contrast subspaces. Given a set of multidimensional objects in two classes C+ and C - and a query object o, we want to find top-k subspaces S that maximize the ratio of likelihood of o in C+ against that i ... Full text Cite

An iterative fusion approach to graph-based semi-supervised learning from multiple views

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2014 Often, a data object described by many features can be naturally decomposed into multiple "views", where each view consists of a subset of features. For example, a video clip may have a video view and an audio view. Given a set of training data objects wit ... Full text Cite

Structure-aware distance measures for comparing clusterings in graphs

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2014 Clustering in graphs aims to group vertices with similar patterns of connections. Applications include discovering communities and latent structures in graphs. Many algorithms have been proposed to find graph clusterings, but an open problem is the need fo ... Full text Cite

Editorial: Moving forward to respond to rapid changes of computer science and technology

Journal Article Journal of Computer Science and Technology · January 1, 2014 Full text Cite

Editorial

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2014 2013 marked a wonderful year for IEEE TKDE (Transactions on Knowledge and Data Engineering). While the statistics for November and December 2013 were not available when this editorial was written, TKDE received 822 submissions in the first 10 months of 201 ... Full text Cite

Consensus-based ranking of multivalued objects: A generalized borda count approach

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2014 In this paper, we tackle a novel problem of ranking multivalued objects, where an object has multiple instances in a multidimensional space, and the number of instances per object is not fixed. Given an ad hoc scoring function that assigns a score to a mul ... Full text Cite

Program committee chairs' welcome message

Conference International Conference on Information and Knowledge Management, Proceedings · December 11, 2013 Cite

Price information patterns in web search advertising: An empirical case study on accommodation industry

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2013 Unlike advertising in traditional media, web search advertising content can be easily customized with little cost. In this paper, we apply content analysis and regression models on 11,818 unique ads related to the accommodation industry to empirically inve ... Full text Cite

Mining statistically significant sequential patterns

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2013 Recent developments in the frequent pattern mining framework uses additional measures of interest to reduce the set of discovered patterns. We introduce a rigorous and efficient approach to mine statistically significant, unexpected patterns in sequences o ... Full text Cite

Mining probabilistic frequent spatio-temporal sequential patterns with gap constraints from uncertain databases

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2013 Uncertainty is common in real-world applications, for example, in sensor networks and moving object tracking, resulting in much interest in item set mining for uncertain transaction databases. In this paper, we focus on pattern mining for uncertain sequenc ... Full text Cite

Towards cohesive anomaly mining

Conference Proceedings of the 27th AAAI Conference on Artificial Intelligence, AAAI 2013 · December 1, 2013 In some applications, such as bioinformatics, social network analysis, and computational criminology, it is desirable to find compact clusters formed by a (very) small portion of objects in a large data set. Since such clusters are comprised of a small num ... Cite

What distinguish one from its peers in social networks?

Conference Data Mining and Knowledge Discovery · December 1, 2013 Being able to discover the uniqueness of an individual is a meaningful task in social network analysis. This paper proposes two novel problems in social network analysis: how to identify the uniqueness of a given query vertex, and how to identify a group o ... Full text Cite

Parallel field alignment for cross media retrieval

Conference MM 2013 - Proceedings of the 2013 ACM Multimedia Conference · November 18, 2013 Cross media retrieval systems have received increasing interest in recent years. Due to the semantic gap between low- level features and high-level semantic concepts of multimedia data, many researchers have explored joint-model techniques in cross media r ... Full text Cite

Some new progress in analyzing and mining uncertain and probabilistic data for big data analytics

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · November 18, 2013 Uncertainty is ubiquitous in big data. Consequently, analyzing and mining uncertain and probabilistic data is important in big data analytics. In this short article, we review some recent progress in mining uncertain and probabilistic data in the hope that ... Full text Cite

Mining search and browse logs for web search: A survey

Journal Article ACM Transactions on Intelligent Systems and Technology · October 21, 2013 Huge amounts of search log data have been accumulated at Web search engines. Currently, a popular Web search engine may receive billions of queries and collect terabytes of records about user search behavior daily. Beside search log data, huge amounts of b ... Full text Cite

A vlHMM approach to context-aware search

Journal Article ACM Transactions on the Web · October 1, 2013 Capturing the context of a user's query from the previous queries and clicks in the same session leads to a better understanding of the user's information need. A context-aware approach to document reranking, URL recommendation, and query suggestion may su ... Full text Cite

Mining multidimensional contextual outliers from categorical relational data

Conference ACM International Conference Proceeding Series · August 30, 2013 A wide range of methods have been proposed for detecting different types of outliers in full space and subspaces. However, the interpretability of outliers, that is, explaining in what ways and to what extent an object is an outlier, remains a critical ope ... Full text Cite

On shortest unique substring queries

Conference Proceedings - International Conference on Data Engineering · August 15, 2013 In this paper, we tackle a novel type of interesting queries - shortest unique substring queries. Given a (long) string S and a query point q in the string, can we find a shortest substring containing q that is unique in S? We illustrate that shortest uniq ... Full text Cite

Clustering uncertain data based on probability distribution similarity

Journal Article IEEE Transactions on Knowledge and Data Engineering · March 11, 2013 Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional pa ... Full text Cite

Finding email correspondents in online social networks

Journal Article World Wide Web · March 1, 2013 Email correspondents play an important role in many people's social networks. Finding email correspondents in social networks accurately, though may seem to be straightforward at a first glance, is challenging. Most of the existing online social networking ... Full text Cite

Recommendations for two-way selections using skyline view queries

Journal Article Knowledge and Information Systems · February 1, 2013 We study a practical and novel problem of making recommendations between two parties such as applicants and job positions. We model the competent choices of each party using skylines. In order to make recommendations in various scenarios, we propose a seri ... Full text Cite

Skyline distance: A measure of multidimensional competence

Journal Article Knowledge and Information Systems · February 1, 2013 Skyline has been widely recognized as being useful for multi-criteria decision-making applications. While most of the existing work computes skylines in various contexts, in this paper, we consider a novel problem: how far away a point is from the skyline? ... Full text Cite

New EIC editorial

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 2, 2013 Full text Cite

Message from BDSE2013 Chairs

Conference Proceedings - 16th IEEE International Conference on Computational Science and Engineering, CSE 2013 · January 1, 2013 Full text Cite

Editorial

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2013 Full text Cite

Preface to the first IEEE ICDM workshop on causal discovery

Conference Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013 · January 1, 2013 Full text Cite

A data-adaptive and dynamic segmentation index for whole matching on time series

Journal Article Proceedings of the VLDB Endowment · January 1, 2013 Similarity search on time series is an essential operation in manyapplications. In the state-of-the-art methods, such as the R-treebased methods, SAX and iSAX, time series are by default dividedinto equi-length segments globally, that is, all time series a ... Full text Cite

More is simpler: Effectively and efficiently assessing nodepair similarities based on hyperlinks

Conference Proceedings of the VLDB Endowment · January 1, 2013 Similarity assessment is one of the core tasks in hyperlink analysis. Recently, with the proliferation of applications, e.g., web search and collaborative filtering, SimRank has been a well-studied measure of similarity between two nodes in a graph. It rec ... Full text Cite

Association rules

Conference Data Mining · 2013 Cite

Introduction to the Special Issue ACM SIGKDD 2012

Other ACM Transactions on Knowledge Discovery from Data (TKDD) · 2013 Cite

Household Electricity Consumption Data Cleansing

Journal Article arXiv preprint arXiv:1307.7757 · 2013 Cite

Editorial [2012 & 2013 Associate Editors]

Journal Article IEEE Transactions on Knowledge and Data Engineering · 2013 Cite

Editorial [State of the Transactions]

Journal Article IEEE Transactions on Knowledge and Data Engineering · 2013 Cite

On compressing weighted time-evolving graphs

Conference ACM International Conference Proceeding Series · December 19, 2012 Existing graph compression techniquesmostly focus on static graphs. However for many practical graphs such as social networks the edge weights frequently change over time. This phenomenon raises the question of how to compress dynamic graphs while maintain ... Full text Cite

A practical method for estimating performance degradation on multicore processors, and its application to HPC workloads

Conference International Conference for High Performance Computing, Networking, Storage and Analysis, SC · December 1, 2012 When multiple threads or processes run on a multi-core CPU they compete for shared resources, such as caches and memory controllers, and can suffer performance degradation as high as 200%. We design and evaluate a new machine learning model that estimates ... Full text Cite

Community preserving lossy compression of social networks

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2012 Compression plays an important role in social network analysis from both practical and theoretical points of view. Although there are a few pioneering studies on social network compression, they mainly focus on lossless approaches. In this paper, we tackle ... Full text Cite

Mining frequent trajectory patterns for activity monitoring using radio frequency tag arrays

Journal Article IEEE Transactions on Parallel and Distributed Systems · October 16, 2012 Activity monitoring, a crucial task in many applications, is often conducted expensively using video cameras. Effectively monitoring a large field by analyzing images from multiple cameras remains a challenging issue. Other approaches generally require the ... Full text Cite

Efficient and effective aggregate keyword search on relational databases

Journal Article International Journal of Data Warehousing and Mining · October 1, 2012 Keyword search on relational databases is useful and popular for many users without technical background. Recently, aggregate keyword search on relational databases was proposed and has attracted interest. However, two important problems still remain. Firs ... Full text Cite

Mining query subtopics from search log data

Conference SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval · September 28, 2012 Most queries in web search are ambiguous and multifaceted. Identifying the major senses and facets of queries from search log data, referred to as query subtopic mining in this paper, is a very important issue in web search. Through search log analysis, we ... Full text Cite

Clustering in applications with multiple data sources-A mutual subspace clustering approach

Journal Article Neurocomputing · September 1, 2012 In many applications, such as bioinformatics and cross-market customer relationship management, there are data from multiple sources jointly describing the same set of objects. An important data mining task is to find interesting groups of objects that for ... Full text Cite

Guest editors' Introduction to the special section on the 27th international conference on data engineering (ICDE 2011)

Journal Article IEEE Transactions on Knowledge and Data Engineering · August 30, 2012 Full text Cite

Random error reduction in similarity search on time series: A statistical approach

Conference Proceedings - International Conference on Data Engineering · July 30, 2012 Errors in measurement can be categorized into two types: systematic errors that are predictable, and random errors that are inherently unpredictable and have null expected value. Random error is always present in a measurement. More often than not, reading ... Full text Cite

Aggregate queries on probabilistic record linkages

Conference ACM International Conference Proceeding Series · July 10, 2012 Record linkage analysis, which matches records referring to the same real world entities from different data sets, is an important task in data integration. Uncertainty often exists in record linkages due to incompleteness or ambiguity in data. Fortunately ... Full text Cite

Early classification on time series

Journal Article Knowledge and Information Systems · April 1, 2012 In this paper, we formulate the problem of early classification of time series data, which is important in some time-sensitive applications such as health informatics. We introduce a novel concept of MPL (minimum prediction length) and develop ECTS (early ... Full text Cite

Top-10 data mining case studies

Journal Article International Journal of Information Technology and Decision Making · March 1, 2012 We report on the panel discussion held at the ICDM'10 conference on the top 10 data mining case studies in order to provide a snapshot of where and how data mining techniques have made significant real-world impact. The tasks covered by 10 case studies ran ... Full text Cite

Probabilistic skylines on uncertain data: Model and bounding-pruning- refining methods

Journal Article Journal of Intelligent Information Systems · February 1, 2012 Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data r ... Full text Cite

Aggregate keyword search on large relational databases

Journal Article Knowledge and Information Systems · February 1, 2012 Keyword search has been recently extended to relational databases to retrieve information from text-rich attributes. However, all the existing methods focus on finding individual tuples matching a set of query keywords from one table or the join of multipl ... Full text Cite

Data Mining: Concepts and Techniques

Book · January 1, 2012 This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with t ... Full text Cite

Multi-level relationship outlier detection

Journal Article International Journal of Business Intelligence and Data Mining · January 1, 2012 Relationship management is critical in business. Particularly, it is important to detect abnormal relationships, such as fraudulent relationships between service providers and consumers. Surprisingly, in the literature there is no systematic study on detec ... Full text Cite

Getting to know your data

Conference Data mining · 2012 Cite

Top-10 Data Mining Case Studies

Journal Article · 2012 Cite

Data Cube Technology

Conference Data Mining · 2012 Cite

Outlier Detection

Journal Article · 2012 Cite

Classification: advanced methods

Journal Article Data mining concepts and techniques · 2012 Cite

6-mining frequent patterns, associations, and correlations: Basic concepts and methods

Journal Article Data mining: concepts and techniques · 2012 Cite

Classification: basic concepts

Conference Data Mining · 2012 Cite

Outlier detection

Journal Article Data mining: concepts and techniques · 2012 Cite

Clustering analysis

Journal Article Data Mining: Concept and Technique, MK imprint of Elsevier, New York · 2012 Cite

Data mining: concepts and techniques, Waltham, MA

Journal Article Morgan Kaufman Publishers · 2012 Cite

Data mining: Data mining concepts and techniques

Journal Article Data Min Concepts Tech · 2012 Cite

Publishing anonymous survey rating data

Journal Article Data Mining and Knowledge Discovery · November 1, 2011 We study the challenges of protecting privacy of individuals in the large public survey rating data in this paper. Recent study shows that personal information in supposedly anonymous movie rating records are de-identified. The survey rating data usually c ... Full text Cite

Mining concept sequences from large-scale search logs for context-aware query suggestion

Journal Article ACM Transactions on Intelligent Systems and Technology · October 1, 2011 Query suggestion plays an important role in improving usability of search engines. Although some recently proposed methods provide query suggestions by mining query patterns from search logs, none of them models the immediately preceding queries as context ... Full text Cite

Can the utility of anonymized data be used for privacy breaches?

Journal Article ACM Transactions on Knowledge Discovery from Data · August 1, 2011 Group based anonymization is the most widely studied approach for privacy-preserving data publishing. Privacy models/definitions using group based anonymization includes k-anonymity, ℓ-diversity, and t-closeness, to name a few. The goal of this article is ... Full text Cite

Ranking uncertain sky: The probabilistic top-k skyline operator

Journal Article Information Systems · July 1, 2011 Many recent applications involve processing and analyzing uncertain data. In this paper, we combine the feature of top-k objects with that of skyline to model the problem of top-k skyline objects against uncertain data. The problem of efficiently computing ... Full text Cite

Outlier detection on uncertain data: Objects, instances, and inferences

Conference Proceedings - International Conference on Data Engineering · June 6, 2011 This paper studies the problem of outlier detection on uncertain data. We start with a comprehensive model considering both uncertain objects and their instances. An uncertain object has some inherent attributes and consists of a set of instances which are ... Full text Cite

Multidimensional mining of large-scale search logs: A topic-concept cube approach

Conference Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 · March 14, 2011 In addition to search queries and the corresponding clickthrough information, search engine logs record multidimensional information about user search activities, such as search time, location, vertical, and search device. Multidimensional mining of search ... Full text Cite

Citation recommendation without author supervision

Conference Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 · March 14, 2011 Automatic recommendation of citations for a manuscript is highly valuable for scholarly activities since it can substantially improve the efficiency and quality of literature search. The prior techniques placed a considerable burden on users, who were requ ... Full text Cite

Ranking queries on uncertain data

Journal Article VLDB Journal · February 1, 2011 Uncertain data is inherent in a few important applications. It is far from trivial to extend ranking queries (also known as top-k queries), a popular type of queries on certain data, to uncertain data. In this paper, we cast ranking queries on uncertain da ... Full text Cite

Extracting interpretable features for early classification on time series

Conference Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011 · January 1, 2011 Early classification on time series data has been found highly useful in a few important applications, such as medical and health informatics, industry production management, safety and security management. While some classifiers have been proposed to achi ... Full text Cite

On pruning for top-k ranking in uncertain databases

Conference Proceedings of the VLDB Endowment · January 1, 2011 Top-k ranking for an uncertain database is to rank tuples in it so that the best k of them can be determined. The problem has been formalized under the unified approach based on parameterized ranking functions (PRFs) and the possible world semantics. Given ... Full text Cite

Privacy-aware data management in information networks

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2011 The proliferation of information networks, as a means of sharing information, has raised privacy concerns for enterprises who manage such networks and for individual users that participate in such networks. For enterprises, the main challenge is to satisfy ... Full text Cite

The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighborhood attacks

Journal Article Knowledge and Information Systems · January 1, 2011 Recently, more and more social network data have been published in one way or another. Preserving privacy in publishing social network data becomes an important concern. With some local knowledge about individuals in a social network, an adversary may atta ... Full text Cite

On k-skip shortest paths

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2011 Given two vertices s, t in a graph, let P be the shortest path (SP) from s to t, and P* a subset of the vertices in P. P* is a k-skip shortest path from s to t, if it includes at least a vertex out of every k consecutive vertices in P. In general, P* succi ... Full text Cite

Towards bounding sequential patterns

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2011 Given a sequence database, can we have a non-trivial upper bound on the number of sequential patterns? The problem of bounding sequential patterns is very challenging in theory due to the combinatorial complexity of sequences, even given some inspiring res ... Full text Cite

Mining frequent trajectory patterns for activity monitoring using radio frequency tag arrays

Journal Article IEEE Transactions on Parallel and Distributed Systems · 2011 Cite

Ranking queries on uncertain data

Journal Article The VLDB Journal · 2011 Cite

Early Classification: Problems

Journal Article Preliminary Results, and Opportunities · 2011 Cite

Probabilistic inference protection on anonymized data

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2010 Background knowledge is an important factor in privacy preserving data publishing. Probabilistic distribution-based background knowledge is a powerful kind of background knowledge which is easily accessible to adversaries. However, to the best of our knowl ... Full text Cite

Neighbor query friendly compression of social networks

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · September 7, 2010 Compressing social networks can substantially facilitate mining and advanced analysis of large social networks. Preferably, social networks should be compressed in a way that they still can be queried efficiently without decompression. Arguably, neighbor q ... Full text Cite

Context-aware ranking in web search

Conference SIGIR 2010 Proceedings - 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval · September 1, 2010 The context of a search query often provides a search engine meaningful hints for answering the current query better. Previous studies on context-aware search were either focused on the development of context models or limited to a relatively small scale i ... Full text Cite

Logging every footstep: Quantile summaries for the entire history

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · July 23, 2010 Quantiles are a crucial type of order statistics in databases. Extensive research has been focused on maintaining a space-efficient structure for approximate quantile computation as the underlying dataset is updated. The existing solutions, however, are de ... Full text Cite

Web search/browse log mining: Challenges, methods, and applications

Conference Proceedings of the 19th International Conference on World Wide Web, WWW '10 · July 20, 2010 Huge amounts of search and browse log data has been accumulated in various search engines. Such massive search/browse log data, on the one hand, provides great opportunities to mine the wisdom of crowds and improve Web search as well as online advertisemen ... Full text Cite

Context-aware citation recommendation

Conference Proceedings of the 19th International Conference on World Wide Web, WWW '10 · July 20, 2010 When you write papers, how many times do you want to make some citations at a place but you are not sure which papers to cite? Do you wish to have a recommendation system which can recommend a small number of good candidates for every place that you want t ... Full text Cite

Hierarchical distributed data classification in wireless sensor networks

Conference Computer Communications · July 15, 2010 Wireless sensor networks promise an unprecedented opportunity to monitor physical environments via inexpensive wireless embedded devices. Given the sheer amount of sensed data, efficient classification of them becomes a critical task in many sensor network ... Full text Cite

Mining discriminative items in multiple data streams

Journal Article World Wide Web · July 12, 2010 How can we maintain a dynamic profile capturing a user's reading interest against the common interest? What are the queries that have been asked 1,000 times more frequently to a search engine from users in Asia than in North America? What are the keywords ... Full text Cite

Exploring disease association from the NHANES data: Data mining, pattern summarization, and visual analytics

Journal Article International Journal of Data Warehousing and Mining · July 1, 2010 Finding associations among different diseases is an important task in medical data mining. The NHANES data is a valuable source in exploring disease associations. However, existing studies analyzing the NHANES data focus on using statistical techniques to ... Full text Cite

Superseding nearest neighbor search on uncertain spatial databases

Journal Article IEEE Transactions on Knowledge and Data Engineering · June 4, 2010 This paper proposes a new problem, called superseding nearest neighbor search, on uncertain spatial databases, where each object is described by a multidimensional probability density function. Given a query point q, an object is a nearest neighbor (NN) ca ... Full text Cite

Probabilistic path queries in road networks: Traffic uncertainty aware path selection

Conference Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings · May 19, 2010 Path queries such as "finding the shortest path in travel time from my hotel to the airport" are heavily used in many applications of road networks. Currently, simple statistic aggregates such as the average travel time between two vertices are often used ... Full text Cite

Towards progressive and load balancing distributed computation: A case study on skyline analysis

Journal Article Journal of Computer Science and Technology · May 1, 2010 Many latest high performance distributed computational environments come with high bandwidth in communication. Such high bandwidth distributed systems provide unprecedented opportunities for analyzing huge datasets, but simultaneously posts new technical c ... Full text Cite

Probabilistic reverse nearest neighbor queries on uncertain data

Journal Article IEEE Transactions on Knowledge and Data Engineering · April 1, 2010 Uncertain data are inherent in various important applications and reverse nearest neighbor (RNN) query is an important query type for many applications. While many different types of queries have been studied on uncertain data, there is no previous work on ... Full text Cite

Document clustering of scientific texts using citation contexts

Journal Article Information Retrieval · April 1, 2010 Document clustering has many important applications in the area of data mining and information retrieval. Many existing document clustering techniques use the bag-of-words model to represent the content of a document. However, this representation is only e ... Full text Cite

A binary decision diagram based approach for mining frequent subsequences

Journal Article Knowledge and Information Systems · January 1, 2010 Sequential pattern mining is an important problem in data mining. State of the art techniques for mining sequential patterns, such as frequent subsequences, are often based on the pattern-growth approach, which recursively projects conditional databases. E ... Full text Cite

Computing closed skycubes

Journal Article Proceedings of the VLDB Endowment · January 1, 2010 In this paper, we tackle the problem of efficient skycube computation. We introduce a novel approach significantly reducing domination tests for a given subspace and the number of subspaces searched. Technically, we identify two types of skyline points tha ... Full text Cite

Threshold-based probabilistic top-k dominating queries

Journal Article VLDB Journal · January 1, 2010 Recently, due to intrinsic characteristics in many underlying data sets, a number of probabilistic queries on uncertain data have been investigated. Top-k dominating queries are very important in many applications including decision making in a multidimens ... Full text Cite

Special issue on the best papers of SDM'10

Journal Article Statistical Analysis and Data Mining · January 1, 2010 Full text Cite

Correlation hiding by independence masking

Conference Proceedings - International Conference on Data Engineering · January 1, 2010 Extracting useful correlation from a dataset has been extensively studied. In this paper, we deal with the opposite, namely, a problem we call correlation hiding (CH), which is fundamental in numerous applications that need to disseminate data containing s ... Full text Cite

A brief survey on sequence classification

Journal Article ACM Sigkdd Explorations Newsletter · 2010 Cite

Search and browse log mining for web information retrieval: challenges, methods, and applications

Conference Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval · 2010 Cite

Outlier Detection

Journal Article · 2010 Cite

MobileMiner: A real world case study of data mining in mobile communication

Conference SIGMOD-PODS'09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems · December 4, 2009 Mobile communication data analysis has been often used as a background application to motivate many data mining problems. However, very few data mining researchers have a chance to see a working data mining system on real mobile communication data. In this ... Full text Cite

On mining maximal pattern-based clusters

Chapter · December 1, 2009 Pattern-based clustering is important in many applications, such as DNA micro-array data analysis in bio-informatics, as well as automatic recommendation systems and target marketing systems in e-business. However, pattern-based clustering in large databas ... Full text Cite

News article extraction with template-independent wrapper

Conference WWW'09 - Proceedings of the 18th International World Wide Web Conference · December 1, 2009 We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen templa ... Full text Cite

Towards context-aware search by learning a very large variable length Hidden Markov Model from search logs

Conference WWW'09 - Proceedings of the 18th International World Wide Web Conference · December 1, 2009 Capturing the context of a user's query from the previous queries and clicks in the same session may help understand the user's information need. A context-aware approach to document re-ranking, query suggestion, and URL recommendation may improve users' s ... Full text Cite

Understanding importance of collaborations in co-authorship networks: A supportiveness analysis approach

Conference Society for Industrial and Applied Mathematics - 9th SIAM International Conference on Data Mining 2009, Proceedings in Applied Mathematics · December 1, 2009 Co-authorship networks, an important type of social networks, have been studied extensively from various angles such as degree distribution analysis, social community extraction and social entity ranking. Most of the previous studies consider the co-author ... Cite

Detecting topic evolution in scientific literature: How can citations help?

Conference International Conference on Information and Knowledge Management, Proceedings · December 1, 2009 Understanding how topics in scientific literature evolve is an interesting and important problem. Previous work simply models each paper as a bag of words and also considers the impact of authors. However, the impact of one document on another as captured ... Full text Cite

Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, U'09 in Conjunction with KDD'09: Forward

Conference Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, U'09 in Conjunction with KDD'09 · November 30, 2009 Cite

Can we learn a template-independent wrapper for news article extraction from a single training site?

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · November 16, 2009 Automatic news extraction from news pages is important in many Web applications such as news aggregation. However, the existing news extraction methods based on templatelevel wrapper induction have three serious limitations. First, the existing methods can ... Full text Cite

OLAP on search logs: An infrastructure supporting data-driven applications in search engines

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · November 9, 2009 Search logs, which contain rich and up-to-date information about users' needs and preferences, have become a critical data source for search engines. Recently, more and more data-driven applications are being developed in search engines based on search log ... Full text Cite

Debt detection in social security by sequence classification using both positive and negative patterns

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · October 19, 2009 Debt detection is important for improving payment accuracy in social security. Since debt detection from customer transactional data can be generally modelled as a fraud detection problem, a straightforward solution is to extract features from transaction ... Full text Cite

Personalizing entity detection and recommendation with a fusion of web log mining techniques

Conference Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09 · September 21, 2009 Given the proliferation of technology sites and the growing diversity of their readership, readers are more and more likely to encounter specialized language and terminology that they may lack the sufficient background to understand. Such sites may lose re ... Full text Cite

Efficiently indexing shortest paths by exploiting symmetry in graphs

Conference Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09 · September 21, 2009 Shortest path queries (SPQ) are essential in many graph analysis and mining tasks. However, answering shortest path queries on-the-fly on large graphs is costly. To online answer shortest path queries, we may materialize and index shortest paths. However, ... Full text Cite

Answering aggregate keyword queries on relational databases using minimal group-bys

Conference Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09 · September 21, 2009 Keyword search has been recently extended to relational databases to retrieve information from text-rich attributes. However, all the existing methods focus on finding individual tuples matching a set of query keywords from one table or the join of multipl ... Full text Cite

MAPO: mining and recommending api usage patterns

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · September 14, 2009 To improve software productivity, when constructing new software systems, programmers often reuse existing libraries or frameworks by invoking methods provided in their APIs. Those API methods, however, are often complex and not well documented. To get fam ... Full text Cite

Continuously monitoring top-k uncertain data streams: a probabilistic threshold method

Journal Article Distributed and Parallel Databases · August 1, 2009 Recently, uncertain data processing has become more and more important. Although a significant amount of previous research explores various continuous queries on data streams, continuous queries on uncertain data streams have seldom been investigated. In t ... Full text Cite

Distance-based representative skyline

Conference Proceedings - International Conference on Data Engineering · July 8, 2009 Given an integer k, a representative skyline contains the k skyline points that best describe the tradeoffs among different dimensions offered by the full skyline. Although this topic has been previously studied, the existing solution may sometimes produce ... Full text Cite

Online interval skyline queries on time series

Conference Proceedings - International Conference on Data Engineering · July 8, 2009 In many applications, we need to analyze a large number of time series. Segments of time series demonstrating dominating advantages over others are often of particular interest. In this paper, we advocate interval skyline queries, a novel type of time seri ... Full text Cite

Link spam target detection using page farms

Journal Article ACM Transactions on Knowledge Discovery from Data · July 1, 2009 Currently, most popular Web search engines adopt some link-based ranking methods such as PageRank. Driven by the huge potential benefit of improving rankings of Web pages, many tricks have been attempted to boost page rankings. The most common way, which i ... Full text Cite

Anonymization-based attacks in privacy-preserving data publishing

Journal Article ACM Transactions on Database Systems · June 1, 2009 Data publishing generates much concern over the protection of individual privacy. Recent studies consider cases where the adversary may possess different kinds of knowledge about the data. In this article, we show that knowledge of the mechanism or algorit ... Full text Cite

OrthoClusterDB: an online platform for synteny blocks.

Journal Article BMC bioinformatics · June 2009 BackgroundThe recent availability of an expanding collection of genome sequences driven by technological advances has facilitated comparative genomics and in particular the identification of synteny among multiple genomes. However, the development ... Full text Cite

Top-k typicality queries and efficient query answering methods on large databases

Journal Article VLDB Journal · June 1, 2009 Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognitive science to database query answering, and study the novel problem of answerin ... Full text Cite

PADS: A simple yet effective pattern-aware dynamic search method for fast maximal frequent pattern mining

Journal Article Knowledge and Information Systems · January 1, 2009 While frequent pattern mining is fundamental for many data mining tasks, mining maximal frequent patterns efficiently is important in both theory and applications of frequent pattern mining. The fundamental challenge is how to search a large space of item ... Full text Cite

Online skyline analysis with dynamic preferences on nominal attributes

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2009 The importance of skyline analysis has been well recognized in multicriteria decision-making applications. All of the previous studies assume a fixed order on the attributes in question. However, in some applications, users may be interested in skylines wi ... Full text Cite

Continuous K-means monitoring with low reporting cost in sensor networks

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2009 In this paper, we study an interesting problem: continuously monitoring k-means clustering of sensor readings in a large sensor network. Given a set of sensors whose readings evolve over time, we want to maintain the k-means of the readings continuously. T ... Full text Cite

GenBlastA: enabling BLAST to identify homologous gene sequences.

Journal Article Genome research · January 2009 BLAST is an extensively used local similarity search tool for identifying homologous sequences. When a gene sequence (either protein sequence or nucleotide sequence) is used as a query to search for homologous sequences in a genome, the search results, rep ... Full text Cite

Early prediction on time series: A nearest neighbor approach

Conference IJCAI International Joint Conference on Artificial Intelligence · January 1, 2009 In this paper, we formulate the problem of early classification of time series data, which is important in some time-sensitive applications such as health-informatics. We introduce a novel concept of MPL (Minimum Prediction Length) and develop ECTS (Early ... Cite

Privacy preserving publishing on multiple quasi-identifiers

Conference Proceedings - International Conference on Data Engineering · January 1, 2009 In some applications of privacy preserving data publishing, a practical demand is to publish a data set on multiple quasi-identifiers for multiple users simultaneously, which poses several challenges. Can we generate one anonymized version of the data so t ... Full text Cite

Mining frequent cross-graph quasi-cliques

Journal Article ACM Transactions on Knowledge Discovery from Data · January 1, 2009 Joint mining of multiple datasets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in bioinformatics, jointly mining multiple gene expression datasets obtained by different la ... Full text Cite

Continuous privacy preserving publishing of data streams

Conference Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09 · January 1, 2009 Recently, privacy preserving data publishing has received a lot of attention in both research and applications. Most of the previous studies, however, focus on static data sets. In this paper, we study an emerging problem of continuous privacy preserving p ... Full text Cite

Understanding importance of collaborations in co-authorship networks: A supportiveness analysis approach

Conference Proceedings of the 2009 SIAM International Conference on Data Mining · 2009 Cite

OrthoClusterDB: an online platform for synteny blocks

Journal Article BMC bioinformatics · 2009 Cite

OSD: An online web spam detection system

Conference In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD · 2009 Cite

Anonymization with Worst-Case Distribution-Based Background Knowledge

Journal Article arXiv preprint arXiv:0909.1127 · 2009 Cite

Towards web search engine scale data mining

Conference Proceedings of the Eighth Australasian Data Mining Conference-Volume 101 · 2009 Cite

Debt detection in social security by sequence classification using both positive and negative patterns

Conference Joint European Conference on Machine Learning and Knowledge Discovery in Databases · 2009 Cite

Privacy preserving publishing on multiple quasi-identifiers

Conference 2009 IEEE 25th International Conference on Data Engineering · 2009 Cite

Fast and quality-guaranteed data streaming in resource-constrained sensor networks

Conference Proceedings of the International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc) · December 15, 2008 In many emerging applications, data streams are monitored in a network environment. Due to limited communication bandwidth and other resource constraints, a critical and practical demand is to online compress data streams continuously with quality guarante ... Full text Cite

Ranking queries on uncertain data: A probabilistic threshold approach

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · December 10, 2008 Uncertain data is inherent in a few important applications such as environmental surveillance and mobile object tracking. Top-k queries (also known as ranking queries) are often natural and useful in analyzing uncertain data in those applications. In this ... Full text Cite

DiMaC: A system for cleaning disguised missing data

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · December 10, 2008 In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, whi ... Full text Cite

Publishing sensitive transactions for itemset utility

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2008 We consider the problem of publishing sensitive transaction data with privacy preservation. High dimensionality of transaction data poses unique challenges on data privacy and data utility. On one hand, re-identification attacks tend to use a subset of ite ... Full text Cite

Context-aware query suggestion by mining click-through and session data

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2008 Query suggestion plays an important role in improving the usability of search engines. Although some recently proposed methods can make meaningful query suggestions by mining query patterns from search logs, none of them are context-aware - they do not tak ... Full text Cite

Mining preferences from superior and inferior examples

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2008 Mining user preferences plays a critical role in many important applications such as customer relationship management (CRM), product and service recommendation, and marketing campaigns. In this paper, we identify an interesting and practical problem of min ... Full text Cite

DiMaC: A disguised missing data cleaning tool

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2008 In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, whi ... Full text Cite

Efficiently answering probabilistic threshold top-k queries on uncertain data

Conference Proceedings - International Conference on Data Engineering · October 1, 2008 In this paper, we propose a novel type of probabilistic threshold top-k queries on uncertain data, and give an exact algorithm. More details can be found in [4]. © 2008 IEEE. ... Full text Cite

Preserving privacy in social networks against neighborhood attacks

Conference Proceedings - International Conference on Data Engineering · October 1, 2008 Recently, as more and more social network data has been published in one way or another, preserving privacy in publishing social network data becomes an important concern. With some local knowledge about individuals in a social network, an adversary may at ... Full text Cite

Managing uncertain data: Probabilistic approaches

Conference Proceedings - The 9th International Conference on Web-Age Information Management, WAIM 2008 · September 22, 2008 Uncertain data are inherent in many important applications. Recently, considerable research efforts have been put into the field of managing uncertain data. In this paper, we summarize existing techniques to query and model uncertain data and systems that ... Full text Cite

PLEDS: A personalized entity detection system based on web log mining techniques

Conference Proceedings - The 9th International Conference on Web-Age Information Management, WAIM 2008 · September 22, 2008 With the expansion of the internet, many specialized, high-profile sites have become available that bring very technical subject matter to readers with non-technical backgrounds. While the theme of these sites may be of interest to these readers, the posts ... Full text Cite

Anonymization by local recoding in data with attribute hierarchical taxonomies

Journal Article IEEE Transactions on Knowledge and Data Engineering · September 1, 2008 Individual privacy will be at risk if a published data set is not properly deidentified. k-Anonymity is a major technique to deidentify a data set. Among a number of k-anonymization schemes, local recoding methods are promising for minimizing the distortio ... Full text Cite

Clustering by pattern similarity

Journal Article Journal of Computer Science and Technology · July 1, 2008 The task of clustering is to identify classes of similar objects among a set of objects. The definition of similarity varies from one clustering model to another. However, in most of these models the concept of similarity is often based on such metrics as ... Full text Cite

Anonymity for continuous data publishing

Conference Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings · May 16, 2008 k-anonymization is an important privacy protection mechanism in data publishing. While there has been a great deal of work in recent years, almost all considered a single static release. Such mechanisms only protect the data up to the first release or firs ... Full text Cite

OrthoCluster: A new tool for mining synteny blocks and applications in comparative genomics

Conference Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings · May 16, 2008 By comparing genomes among both closely and distally related species, comparative genomics analysis characterizes structures and functions of different genomes in both conserved and divergent regions. Synteny blocks, which are conserved blocks of genes on ... Full text Cite

Efficient skyline querying with variable user preferences on nominal attributes

Journal Article Proceedings of the VLDB Endowment · January 1, 2008 Current skyline evaluation techniques assume a xed ordering on the attributes. However, dynamic preferences on nominal attributes are more realistic in known applications. In order to generate online response for any such preference issued by a user, one o ... Full text Cite

A spamicity approach to web spam detection

Conference Society for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130 · January 1, 2008 Web spam, which refers to any deliberate actions bringing to selected web pages an unjustifiable favorable relevance or importance, is one of the major obstacles for high quality information retrieval on the web. Most of the existing web spam detection met ... Full text Cite

Query answering techniques on uncertain and probabilistic data

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2008 Uncertain data are inherent in some important applications, such as environmental surveillance, market analysis, and quantitative economics research. Due to the importance of those applications and the rapidly increasing amount of uncertain data collected ... Full text Cite

Mining sequence classifiers for early prediction

Conference Society for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130 · January 1, 2008 Supervised learning on sequence data, also known as sequence classification, has been well recognized as an important data mining task with many significant applications. Since temporal order is important in sequence data, in many critical applications of ... Full text Cite

Query answering techniques on uncertain and probabilistic data: tutorial summary

Conference Proceedings of the 2008 ACM SIGMOD international conference on Management of data · 2008 Cite

Advances in information and knowledge management

Conference ACM SIGIR Forum · 2008 Cite

Mining uncertain and probabilistic data: problems, challenges, methods, and applications

Conference Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining · 2008 Cite

Query answering techniques on uncertain and probabilistic data: tutorial summary

Conference Proceedings of the 2008 ACM SIGMOD international conference on Management of data · 2008 Cite

??

Journal Article ????????? · 2008 Cite

Mining favorable facets

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 14, 2007 The importance of dominance and skyline analysis has been well recognized in multi-criteria decision making applications. Most previous studies assume a fixed order on the attributes. In practice, different customers may have different preferences on nomin ... Full text Cite

Cleaning disguised missing data: A heuristic approach

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 14, 2007 In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, whi ... Full text Cite

IX-cubes: Iceberg cubes for data warehousing and OLAP on XML data

Conference International Conference on Information and Knowledge Management, Proceedings · December 1, 2007 With increasing amount of data being stored in XML format, OLAP queries over these data become important. OLAP queries have been well studied in the relational database systems. However, the evaluation of OLAP queries over XML data is not a trivial extensi ... Full text Cite

A system framework for web service semantic and automatic orchestration

Conference 2007 2nd International Conference on Pervasive Computing and Applications, ICPCA'07 · December 1, 2007 In this paper, we present the framework of Semantic and Automatic Service Orchestration (SASO) system for Web services modeling and composition. The SASO system has the following feature's: 1) it adopts a semantic approach to model Web services, and 2) it ... Full text Cite

Mining API patterns as partial orders from source code: From usage scenarios to specifications

Conference 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE 2007 · December 1, 2007 A software system interacts with third-party libraries through various APIs. Using these library APIs often needs tofollow certain usage patterns. Furthermore, ordering rules (specifications) exist between APIs, and these rules govern the secure and robust ... Full text Cite

Maintaining K-anonymity against incremental updates

Conference Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM · December 1, 2007 K-anonymity is a simple yet practical mechanism to protect privacy against attacks of re-identifying individuals by joining multiple public data sources. All existing methods achieving k-anonymity assume implicitly that the data objects to be anonymized ar ... Full text Cite

TS-trees: A non-alterable search tree index for trustworthy databases on Write-Once-Read-Many (WORM) storage

Conference Proceedings - International Conference on Advanced Information Networking and Applications, AINA · September 25, 2007 Trustworthy data processing, which ensures the credibility and irrefutability of data, is crucial in many business applications. Recently, the Write-Once-Read-Many (WORM) devices have been used as trustworthy data storage. Nevertheless, how to efficiently ... Full text Cite

Mining software engineering data

Conference Proceedings - International Conference on Software Engineering · September 25, 2007 Software engineering data (such as code bases, execution traces, historical code changes, mailing lists, and bug databases) contains a wealth of information about a project's status, progress, and evolution. Using well-established data mining techniques, p ... Full text Cite

Computing compressed multidimensional skyline cubes efficiently

Conference Proceedings - International Conference on Data Engineering · September 24, 2007 Recently, the skyline computation and analysis have been extended from one single full space to multidimensional subspaces, which can lead to valuable insights in some applications. Particularly, compressed skyline cubes in the form of skyline groups and t ... Full text Cite

Efficient skyline and top-k retrieval in subspaces

Journal Article IEEE Transactions on Knowledge and Data Engineering · August 1, 2007 Skyline and top-k queries are two popular operations for preference retrieval. In practice, applications that require these operations usually provide numerous candidate attributes, whereas, depending on their interests, users may issue queries regarding d ... Full text Cite

An energy-efficient data collection framework for wireless sensor networks by exploiting spatiotemporal correlation

Journal Article IEEE Transactions on Parallel and Distributed Systems · July 1, 2007 Limited energy supply is one of the major constraints in wireless sensor networks. A feasible strategy is to aggressively reduce the spatial sampling rate of sensors, that is, the density of the measure points in a field. By properly scheduling, we want to ... Full text Cite

H-Mine: Fast and space-preserving frequent pattern mining in a large databases

Journal Article IIE Transactions (Institute of Industrial Engineers) · June 1, 2007 In this study, we propose a simple and novel data structure using hyper-links, H-struct, and a new mining algorithm, H-mine, which takes advantage of this data structure and dynamically adjusts links in the mining process. A distinct feature of this method ... Full text Cite

Constraint-based sequential pattern mining: The pattern-growth methods

Journal Article Journal of Intelligent Information Systems · April 1, 2007 Constraints are essential for many sequential pattern mining applications. However, there is no systematic study on constraint-based sequential pattern mining. In this paper, we investigate this issue and point out that the framework developed for constrai ... Full text Cite

Probabilistic skylines on uncertain data

Conference 33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings · January 1, 2007 Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data r ... Cite

Efficiently answering top-k typicality queries on large databases

Conference 33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings · January 1, 2007 Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognition science to database query answering, and study the novel problem of answerin ... Cite

Minimality attack in privacy preserving data publishing

Conference 33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings · January 1, 2007 Data publishing generates much concern over the protection of individual privacy. Recent studies consider cases where the adversary may possess different kinds of knowledge about the data. In this paper, we show that knowledge of the mechanism or algorithm ... Cite

Classifying noisy and incomplete medical data by a differential latent semantic indexing approach

Chapter · January 1, 2007 It is well-recognized that medical datasets are often noisy and incomplete due to the difficulties in data collection and integration. Noise and incompleteness in medical data post substantial challenges for accurate classification. A differential latent s ... Full text Cite

(α, k)-anonymity based privacy preservation by lossy join

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2007 Privacy-preserving data publication for data mining is to protect sensitive information of individuals in published data while the distortion to the data is minimized. Recently, it is shown that (α, k)-anonymity is a feasible technique when we are given so ... Full text Cite

Sketching landscapes of page farms

Conference Proceedings of the 7th SIAM International Conference on Data Mining · January 1, 2007 The Web is a very large social network. It is important and interesting to understand the "ecology" of the Web: the general relations of Web pages to their environment. The understanding of such relations has a few important applications, including Web com ... Full text Cite

Mining gene-sample-time microarray data: A coherent gene cluster discovery approach

Journal Article Knowledge and Information Systems · January 1, 2007 Extensive studies have shown that mining microarray data sets is important in bioinformatics research and biomedical applications. In this paper, we explore a novel type of gene-sample-time microarray data sets that records the expression levels of various ... Full text Cite

Mining frequent trajectory patterns for activity monitoring using radio frequency tag arrays

Conference Proceedings - Fifth Annual IEEE International Conference on Pervasive Computing and Communications, PerCom 2007 · January 1, 2007 Activity monitoring, a crucial task in many applications, is often conducted expensively using video cameras. Also, effectively monitoring a large field by analyzing images from multiple cameras remains a challenging problem. In this paper, we introduce a ... Full text Cite

Active rules termination analysis through conditional formula containing updatable variable

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2007 While active rules have been applied in many areas including active databases, XML documentation and Semantic Web, current methods remain largely uncertain of how to terminate active behaviors. Some existing methods have been provided in the form of a logi ... Full text Cite

Answering ad hoc aggregate queries from data streams using prefix aggregate trees

Journal Article Knowledge and Information Systems · January 1, 2007 In some business applications such as trading management in financial institutions, it is required to accurately answer ad hoc aggregate queries over data streams. Materializing and incrementally maintaining a full data cube or even its compression or appr ... Full text Cite

WAT: Finding top-K discords in time series database

Conference Proceedings of the 7th SIAM International Conference on Data Mining · January 1, 2007 Finding discords in time series database is an important problem in a great variety of applications, such as space shuttle telemetry, mechanical industry, biomedicine, and financial data analysis. However, most previous methods for this problem suffer from ... Full text Cite

PIKM 2007 foreword

Conference International Conference on Information and Knowledge Management, Proceedings · January 1, 2007 Cite

Introduction to the special issue on data mining for health informatics

Journal Article ACM SIGKDD Explorations Newsletter · 2007 Cite

Efficient skyline querying with variable user preferences on nominal attributes

Journal Article arXiv preprint arXiv:0710.2604 · 2007 Cite

PIKM 2007 foreword

Conference International Conference on Information and Knowledge Management, Proceedings · 2007 Cite

(&ALPHA-anonymity based privacy preservation by lossy join

Conference ADVANCES IN DATA AND WEB MANAGEMENT, PROCEEDINGS · 2007 Cite

Regression cubes with lossless compression and aggregation

Journal Article IEEE Transactions on Knowledge and Data Engineering · December 1, 2006 As OLAP engines are widely used to support multidimensional data analysis, it is desirable to support in data cubes advanced statistical measures, such as regression and filtering, in addition to the traditional simple measures such as count and average. S ... Full text Cite

Classification spanning correlated data streams

Conference International Conference on Information and Knowledge Management, Proceedings · December 1, 2006 In many applications, classifiers need to be built based on multiple related data streams. For example, stock streams and news streams are related, where the classification patterns may involve features from both streams. Thus instead of mining on a single ... Full text Cite

Towards multidimensional subspace skyline analysis

Conference ACM Transactions on Database Systems · December 1, 2006 The skyline operator is important for multicriteria decision-making applications. Although many recent studies developed efficient methods to compute skyline objects in a given space, none of them considers skylines in multiple subspaces simultaneously. Mo ... Full text Cite

An effective approach to entity resolution problem using Quasi-Clique and its application to digital libraries

Conference Proceedings of the ACM/IEEE Joint Conference on Digital Libraries · December 1, 2006 We study how to resolve entities that contain a group of related elements in them (e.g., an author entity with a list of citations or an intermediate result by GROUP BY SQL query). Such entities, named as grouped-entities, frequently occur in many applicat ... Full text Cite

MAPO: Mining API usages from open source repositories

Conference Proceedings - International Conference on Software Engineering · December 1, 2006 To improve software productivity, when constructing new software systems, developers often reuse existing class libraries or frameworks by invoking their APIs. Those APIs, however, are often complex and not well documented, posing barriers for developers t ... Full text Cite

Minimum description length principle: Generators are preferable to closed patterns

Conference Proceedings of the National Conference on Artificial Intelligence · November 13, 2006 The generators and the unique closed pattern of an equivalence class of itemsets share a common set of transactions. The generators are the minimal ones among the equivalent itemsets, while the closed pattern is the maximum one. As a generator is usually s ... Cite

Mining changing regions from access-constrained snapshots: A cluster-embedded decision tree approach

Conference Journal of Intelligent Information Systems · November 1, 2006 Change detection on spatial data is important in many applications, such as environmental monitoring. Given a set of snapshots of spatial objects at various temporal instants, a user may want to derive the changing regions between any two snapshots. Most o ... Full text Cite

Mining co-location patterns with rare events from spatial data sets

Journal Article GeoInformatica · September 1, 2006 A co-location pattern is a group of spatial features/events that are frequently co-located in the same region. For example, human cases of West Nile Virus often occur in regions with poor mosquito control and the presence of birds. For co-location pattern ... Full text Cite

Closed constrained gradient mining in retail databases

Journal Article IEEE Transactions on Knowledge and Data Engineering · June 1, 2006 Incorporating constraints into frequent itemset mining not only improves data mining efficiency, but also leads to concise and meaningful results. In this paper, a framework for closed constrained gradient itemset mining in retail databases is proposed by ... Full text Cite

Discovering frequent closed partial orders from strings

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2006 Mining knowledge about ordering from sequence data is an important problem with many applications, such as bioinformatics, Web mining, network management, and intrusion detection. For example, if many customers follow a partial order in their purchases of ... Full text Cite

Using High Dimensional Indexes to Support Relevance Feedback Based Interactive Images Retrieval∗

Conference VLDB 2006 - Proceedings of the 32nd International Conference on Very Large Data Bases · January 1, 2006 Image retrieval has found more and more applications. Due to the well recognized semantic gap problem, the accuracy and the recall of image similarity search are often still low. As an effective method to improve the quality of image retrieval, the relevan ... Cite

Suppressing model overfitting in mining concept-drifting data streams

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2006 Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical da ... Full text Cite

Achieving k-anonymity by clustering in attribute hierarchical structures

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2006 Individual privacy will be at risk if a published data set is not properly de-identified, k-anonymity is a major technique to de-identify a data set. A more general view of k-anonymity is clustering with a constraint of the minimum number of objects in eve ... Cite

Improving grouped-entity resolution using Quasi-Cliques

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · January 1, 2006 The entity resolution (ER) problem, which identifies duplicate entities that refer to the same real world entity, is essential in many applications. In this paper, in particular, we focus on resolving entities that contain a group of related elements in th ... Full text Cite

Utility-based anonymization using local recoding

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2006 Privacy becomes a more and more serious concern in applications involving microdata. Recently, efficient anonymization has attracted much research work. Most of the previous methods use global recoding, which maps the domains of the quasi-identifier attrib ... Full text Cite

SUBSKY: Efficient computation of skylines in subspaces

Conference Proceedings - International Conference on Data Engineering · January 1, 2006 Given a set of multi-dimensional points, the skyline contains the best points according to any preference function that is monotone on all axes. In practice, applications that require skyline analysis usually provide numerous candidate attributes, and vari ... Full text Cite

On privacy preservation against adversarial data mining

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2006 Privacy preserving data processing has become an important topic recently because of advances in hardware technology which have lead to widespread proliferation of demographic and sensitive data. A rudimentary way to preserve privacy is to simply hide the ... Full text Cite

Granularity adaptive density estimation and on demand clustering of concept-drifting data streams

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2006 Clustering data streams has found a few important applications. While many previous studies focus on clustering objects arriving in a data stream, in this paper, we consider the novel problem of on demand clustering concept drifting data streams. In order ... Full text Cite

Subsky: Efficient computation of skylines in subspaces

Conference 22nd International Conference on Data Engineering (ICDE’06) · 2006 Cite

Utility-based anonymization for privacy preservation with less information loss

Journal Article Acm Sigkdd Explorations Newsletter · 2006 Cite

Multidimensional k-anonymization by linear clustering using space-filling curves

Journal Article Simon Fraser University School of Computing Science Technical Report · 2006 Cite

Data mining: concepts and techniques Morgan Kaufmann

Journal Article San Francisco · 2006 Cite

Online mining of data streams: Applications, techniques and progress

Conference Proceedings - International Conference on Data Engineering · December 12, 2005 Full text Cite

Mining cross-graph quasi-cliques in gene expression and protein interaction data

Conference Proceedings - International Conference on Data Engineering · December 12, 2005 Full text Cite

Mining the most general multidimensional summarization of "probable groups" in data warehouses

Conference Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM · December 1, 2005 Data summarization is an important data analysis task in data warehousing and online analytic processing. In this paper, we consider a novel type of summarization queries, probable group queries, such as "What are the groups of patients that have a 50% or ... Cite

Efficiently mining frequent closed partial orders (extended abstract)

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2005 Full text Cite

Catching the best views of skyline: A semantic approach based on decisive subspaces

Conference VLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases · December 1, 2005 The skyline operator is important for multi-criteria decision making applications. Although many recent studies developed efficient methods to compute skyline objects in a specific space, the fundamental problem on the semantics of skylines remains open: W ... Cite

On mining cross-graph quasi-cliques

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2005 Joint mining of multiple data sets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in cross-market customer segmentation, a group of customers who behave similarly in multipl ... Full text Cite

A stratification-based approach to accurate and fast image annotation

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · December 1, 2005 Image annotation is an important research problem in content-based image retrieval (CBIR) and computer vision with broad applications. A major challenge is the so-called "semantic gap" between the low-level visual features and the high-level semantic conce ... Full text Cite

A dynamic clustering and scheduling approach to energy saving in data collection from wireless sensor networks

Conference 2005 Second Annual IEEE Communications Society Conference on Sensor and AdHoc Communications and Networks, SECON 2005 · December 1, 2005 Energy consumption is one of the major constraints in wireless sensor networks. A highly feasible strategy is to aggressively reduce the spatial sampling rate of sensors (i.e., the density of the measure points in a field). By properly scheduling, we want ... Full text Cite

GraphMiner: A structural pattern-mining system for large disk-based graph databases and its applications

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · December 1, 2005 Mining frequent structural patterns from graph databases is an important research problem with broad applications. Recently, we developed an effective index structure, ADI, and efficient algorithms for mining frequent patterns from large, disk-based graph ... Cite

An interactive approach to mining gene expression data

Journal Article IEEE Transactions on Knowledge and Data Engineering · October 1, 2005 Effective identification of coexpressed genes and coherent patterns in gene expression data is an important task in bioinformatics research and biomedical applications. Several clustering methods have recently been proposed to identify coexpressed genes th ... Full text Cite

Scream cube: An architecture for multi-dimensional analysis of data streams

Journal Article Distributed and Parallel Databases · September 1, 2005 Real-time surveillance systems, telecommunication systems, and other dynamic environments often generate tremendous (potentially infinite) volume of stream data: the volume is too huge to be scanned multiple times. Much of such data resides at rather low l ... Full text Cite

Preference-Based Frequent Pattern Mining

Journal Article International Journal of Data Warehousing and Mining (IJDWM) · January 1, 2005 Frequent pattern mining is an important data-mining problem with broad applications. Although there are many in-depth studies on efficient frequent pattern mining algorithms and constraint pushing techniques, the effectiveness of frequent pattern mining re ... Full text Cite

Cross table cubing: Mining iceberg cubes from data warehouses

Conference Proceedings of the 2005 SIAM International Conference on Data Mining, SDM 2005 · January 1, 2005 All of the existing (iceberg) cube computation algorithms assume that the data is stored in a single base table, however, in practice, a data warehouse is often organized in a schema of multiple tables, such as star schema and snowflake schema. In terms of ... Full text Cite

Mining succinct systems of minimal generators of formal concepts

Conference Lecture Notes in Computer Science · January 1, 2005 Formal concept analysis has become an active field of study for data analysis and knowledge discovery. A formal concept C is determined by its extent (the set of objects that fall under C) and its intent (the set of properties or attributes covered by C). ... Full text Cite

A general approach to mining quality pattern-based clusters from microarray data

Conference Lecture Notes in Computer Science · January 1, 2005 Pattern-based clustering has broad applications in microarray data analysis, customer segmentation, e-business data analysis, etc. However, pattern-based clustering often returns a large number of highly-overlapping clusters, which makes it hard for users ... Full text Cite

Pattern-based similarity search for microarray data

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2005 One fundamental task in near-neighbor search as well as other similarity matching efforts is to find a distance function that can efficiently quantify the similarity between two objects in a meaningful way. In DNA microarray analysis, the expression levels ... Full text Cite

A random method for quantifying changing distributions in data streams

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2005 In applications such as fraud and intrusion detection, it is of great interest to measure the evolving trends in the data. We consider the problem of quantifying changes between two datasets with class labels. Traditionally, changes are often measured by f ... Full text Cite

Stream cube: An architecture for multi-dimensional analysis of data streams

Journal Article Distributed and Parallel Databases · 2005 Cite

Efficiently mining frequent closed partial orders

Conference Fifth IEEE International Conference on Data Mining (ICDM’05) · 2005 Cite

Data mining: The next generation

Conference Dagstuhl Seminar Proceedings · 2005 Cite

Online mining data streams: Problems, applications and progress

Conference Proc. the 21st International Conference on Data Engineering, ICDE�05 · 2005 Cite

Rank sum method for related gene selection and its application to tumor diagnosis

Journal Article Chinese Science Bulletin · December 1, 2004 Tumor diagnosis by analyzing gene expression profiles becomes an interesting topic in bioinformatics and the main problem is to identify the genes related to a tumor. This paper proposes a rank sum method to identify the related genes based on the rank sum ... Full text Cite

Mining sequential patterns by pattern-growth: The prefixspan approach

Journal Article IEEE Transactions on Knowledge and Data Engineering · November 1, 2004 Sequential pattern mining is an important data mining problem with broad applications. However, it Is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the pre ... Full text Cite

A fast algorithm for subspace clustering by pattern similarity

Conference Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM · October 25, 2004 Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends ... Cite

Mining constrained gradients in large databases

Journal Article IEEE Transactions on Knowledge and Data Engineering · August 1, 2004 Many data analysis tasks can be viewed as search or mining in a multidimensional space (MDS). In such MDSs, dimensions capture potentially important factors for given applications, and cells represent combinations of values for the factors. To systematical ... Full text Cite

Pushing convertible constraints in frequent itemset mining

Journal Article Data Mining and Knowledge Discovery · May 1, 2004 Recent work has highlighted the importance of the constraint-based mining paradigm in the context of frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases. Constraint pushing techniques h ... Full text Cite

Efficient pattern-growth methods for frequent tree pattern mining

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2004 Mining frequent tree patterns is an important research problems with broad applications in bioinformatics, digital library, ecommerce, and so on. Previous studies highly suggested that patterngrowth methods are efficient in frequent pattern mining. In this ... Full text Cite

A rank sum test method for informative gene discovery

Conference KDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2004 Finding informative genes from microarray data is an important research problem in bioinformatics research and applications. Most of the existing methods rank features according to their discriminative capability and then find a subset of discriminative ge ... Full text Cite

Mining frequent patterns without candidate generation: A frequent-pattern tree approach

Journal Article Data Mining and Knowledge Discovery · January 1, 2004 Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. H ... Full text Cite

Mining coherent gene clusters from gene-sample-time microarray data

Conference KDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2004 Extensive studies have shown that mining microarray data sets is important in bioinformatics research and biomedical applications. In this paper, we explore a novel type of gene-sample-time microarray data sets, which records the expression levels of vario ... Full text Cite

Scalable mining of large disk-based graph databases

Conference KDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2004 Mining frequent structural patterns from graph databases is an interesting problem with broad applications. Most of the previous studies focus on pruning unfruitful search subspaces effectively, but few of them address the mining on large, disk-based datab ... Full text Cite

From sequential pattern mining to structured pattern mining: A pattern-growth approach

Journal Article Journal of Computer Science and Technology · January 1, 2004 Sequential pattern mining is an important data mining problem with broad applications. However, it is also a challenging problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Recent studie ... Full text Cite

Mining condensed frequent-pattern bases

Journal Article Knowledge and Information Systems · 2004 Cite

Data mining for intrusion detection: techniques, applications and systems

Conference Proceedings. 20th International Conference on Data Engineering · 2004 Cite

Mining coherent gene clusters from three-dimensional microarray data

Conference Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD�04) · 2004 Cite

GPX: interactive mining of gene expression data

Conference Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 · 2004 Cite

Adatbányászat

Journal Article Koncepciók és technikák Panem Könyvkiadó, Budapest · 2004 Cite

GPX

Chapter · January 1, 2004 Discovering co-expressed genes and coherent expression patterns in gene expression data is an important data analysis task in bioinformatics research and biomedical applications. Although various clustering methods have been proposed, two tough challenges ... Full text Cite

QC-Trees: An Efficient Summary Structure for Semantic OLAP

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · December 1, 2003 Recently, a technique called quotient cube was proposed as a summary structure for a data cube that preserves its semantics, with applications for online exploration and visualization. The authors showed that a quotient cube can be constructed very efficie ... Cite

Mining phenotypes and informative genes from gene expression data

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2003 Mining microarray gene expression data is an important research topic in bioinformatics with broad applications. While most of the previous studies focus on clustering either genes or samples, it is interesting to ask whether we can partition the complete ... Full text Cite

Interactive exploration of coherent patterns in time-series gene expression data

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2003 Discovering coherent gene expression patterns in time-series gene expression data is an important task in bioinformatics research and biomedical applications. In this paper, we propose an interactive exploration framework for mining coherent expression pat ... Full text Cite

CLOSET+: Searching for the best strategies for mining frequent closed itemsets

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2003 Mining frequent closed itemsets provides complete and non-redundant results for frequent pattern analysis. Extensive studies have proposed various strategies for efficient frequent closed itemset mining, such as depth-first search vs. breadthfirst search, ... Full text Cite

MaPle: A fast algorithm for maximal pattern-based clustering

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2003 Pattern-based clustering is important in many applications, such as DNA micro-array data analysis, automatic recommendation systems and target marketing systems. However, pattern-based clustering in large databases is challenging. On the one hand, there ca ... Cite

SOCQET: Semantic OLAP with Compressed Cube and Summarization

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · December 1, 2003 Cite

Efficacious Data Cube Exploration by Semantic Summarization and Compression

Chapter · January 1, 2003 This chapter discusses the efficacious data cube exploration by semantic summarization and compression. Data cube is the core operator in data warehousing and online analytical processing (OLAP). Its efficient computation, maintenance, and utilization for ... Full text Cite

DHC: A density-based hierarchical clustering method for time series gene expression data

Conference Proceedings - 3rd IEEE Symposium on BioInformatics and BioEngineering, BIBE 2003 · January 1, 2003 Clustering the time series gene expression data is an important task in bioinformatics research and biomedical applications. Recently, some clustering methods have been adapted or proposed. However, some concerns still remain, such as the robustness of the ... Full text Cite

Efficacious data cube exploration by semantic summarization and compression

Conference Proceedings - 29th International Conference on Very Large Data Bases, VLDB 2003 · January 1, 2003 Data cube is the core operator in data warehousing and OLAP. Its efficient computation, maintenance, and utilization for query answering and advanced analysis have been the subjects of numerous studies. However, for many applications, the huge size of the ... Full text Cite

A general model for online analytical processing of complex data

Journal Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2003 It has been well recognized that online analytical processing (OLAP) can provide important insights into huge archives of data. While the conventional OLAP model is capable of analyzing relational business data, it often cannot fit many kinds of complex da ... Full text Cite

Mining confident co-location rules without a support threshold

Conference Proceedings of the ACM Symposium on Applied Computing · January 1, 2003 Mining co-location patterns from spatial databases may reveal types of spatial features likely located as neighbors in space. In this paper, we address the problem of mining confident co-location rules without a support threshold. First, we propose a novel ... Full text Cite

Mining frequent patterns in data streams at multiple time granularities

Journal Article Next generation data mining · 2003 Cite

Online mining of changes from data streams: Research problems and preliminary results

Conference Proceedings of the 2003 ACM SIGMOD Workshop on Management and Processing of Data Streams · 2003 Cite

ApproxMAP: Approximate mining of consensus sequential patterns

Conference Proceedings of the 2003 SIAM International Conference on Data Mining · 2003 Cite

Towards interactive exploration of gene expression patterns

Journal Article ACM SIGKDD Explorations Newsletter · 2003 Cite

Vasodilation effect of puerarin on abdominal aortic artery in the rat and the underlying mechanism

Journal Article Journal of the Fourth Military Medical University · 2003 Cite

Recent Progress on Selected Topics in Database Research - A Report by Nine Young Chinese Researchers Working in the United States

Journal Article Journal of Computer Science and Technology · January 1, 2003 The study on database technologies, or more generally, the technologies of data and information management, is an important and active research field. Recently, many exciting results have been reported. In this fast growing field, Chinese researchers play ... Full text Cite

On computing condensed frequent pattern bases

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2002 Frequent pattern mining has been studied extensively. However, the effectiveness and efficiency of this mining is often limited, since the number of frequent patterns generated is often too large. In many applications it is sufficient to generate and exami ... Cite

COMMIX: Towards effective web information extraction, integration and query answering

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · September 17, 2002 As WWW becomes more and more popular and powerful, how to search information on the web in database way becomes an important research topic. COMMIX, which is developed in the DB group in Peking University (China), is a system towards building very large da ... Cite

CubeExplorer: Online exploration of data cubes

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · September 17, 2002 Data cube enables fast online analysis of large data repositories, which is attractive in many applications. Although there are several kinds of available cube-based OLAP products, users may still encounter challenges on effectiveness and efficiency in the ... Cite

COMMIX: Towards Effective Web Information Extraction, Integration and Query Answering

Conference Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002 · June 3, 2002 As WWW becomes more and more popular and powerful, how to search information on the web in database way becomes an important research topic. COMMIX, which is developed in the DB group in Peking University (China), is a system towards building very large da ... Full text Cite

CubeExplorer: Online Exploration of Data Cubes

Conference Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002 · June 3, 2002 Data cube enables fast online analysis of large data repositories which is attractive in many applications. Although there are several kinds of available cube-based OLAP products, users may still encounter challenges on effectiveness and efficiency in the ... Full text Cite

Mining sequential patterns with constraints in large databases

Conference International Conference on Information and Knowledge Management, Proceedings · January 1, 2002 Constraints are essential for many sequential pattern mining applications. However, there is no systematic study on constraint-based sequential pattern mining. In this paper, we investigate this issue and point out that the framework developed for constrai ... Full text Cite

Quotient cube: How to summarize the semantics of a data cube

Conference VLDB’02: Proceedings of the 28th International Conference on Very Large Databases · 2002 Cite

Constrained frequent pattern mining: a pattern-growth view

Journal Article ACM SIGKDD Explorations Newsletter · 2002 Cite

PATTERN-GROWTH METHODS FOR FREQUENT

Thesis Dissertation · 2002 Cite

Olaping stream data: Is it feasible

Conference Proc. Workshop on Research Issues in Data Mining and Knowledge Discovery, ACM SIGMOD · 2002 Cite

Constraint-based sequential pattern mining in large databases

Conference Proc. 2002 Int�l Conf. Information and Knowledge Management (CIKM�02) · 2002 Cite

H-mine: Hyper-structure mining of frequent patterns in large databases

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2001 Methods for efficient mining of frequent patterns have been studied extensively by many researchers. However, the previously proposed methods still encounter some performance bottlenecks when mining databases with different data characteristics, such as de ... Cite

CMAR: Accurate and efficient classification based on multiple class-association rules

Conference Proceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2001 Previous studies propose that associative classification has high classification accuracy and strong flexibility at handling unstructured data. However, it still suffers from the huge set of mined rules and sometimes biased classification or overfitting si ... Cite

DNA-Miner: A system prototype for mining DNA sequences

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · September 29, 2001 Cite

DNA-Miner: A System Prototype for Mining DNA Sequences

Journal Article SIGMOD Record · January 1, 2001 Full text Cite

Mining multi-dimensional constrained gradients in data cubes

Conference VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases · January 1, 2001 Constrained gradient analysis (similar to the "cubegrade" problem posed by Imielinski, et al. [9]) is to extract pairs of similar cell characteristics associated with big changes in measure in a data cube. Cells are considered similar if they are related b ... Cite

Multi-dimensional sequential pattern mining

Conference International Conference on Information and Knowledge Management, Proceedings · January 1, 2001 Sequential pattern mining, which finds the set of frequent subsequences in sequence databases, is an important data-mining task and has broad applications. Usually, sequence patterns are associated with different circumstances, and such circumstances form ... Full text Cite

Efficient computation of iceberg cubes with complex measures

Journal Article SIGMOD Record (ACM Special Interest Group on Management of Data) · January 1, 2001 It is often too expensive to compute and materialize a complete high-dimensional data cube. Computing an iceberg cube, which contains only aggregates above certain thresholds, is an effective way to derive nontrivial multidimensional aggregations for OLAP ... Full text Cite

Mining frequent itemsets with convertible constraints

Journal Article Proceedings - International Conference on Data Engineering · January 1, 2001 Recent work has highlighted the importance of the constraint-based mining paradigm in the context of frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases. In this paper, we study constra ... Full text Cite

PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth

Conference Proceedings - International Conference on Data Engineering · January 1, 2001 Sequential pattern mining is an important data mining problem with broad applications. It is challenging since one may need to examine a combinatorially explosive number of possible subsequence patterns. Most of the previously developed sequential pattern ... Cite

Efficient computation of iceberg cubes with complex measures

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2001 It is often too expensive to compute and materialize a complete high-dimensional data cube. Computing an iceberg cube, which contains only aggregates above certain thresholds, is an effective way to derive nontrivial multidimensional aggregations for OLAP ... Full text Cite

Scalable frequent-pattern mining methods: an overview

Conference Tutorial notes of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining · 2001 Cite

Pattern growth methods for sequential pattern mining: Principles and extensions

Conference Workshop on Temporal Data Mining, 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD�01). ACM Press · 2001 Cite

Data mining: concepts and technologies

Journal Article Data Mining Concepts Models Methods & Algorithms · 2001 Cite

FreeSpan: Frequent pattern-projected sequential pattern mining

Conference Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2000 Sequential pattern mining is an important data mining problem with broad applications. It is also a difficult problem since one may need to examine a combinatorially explosive number of possible subsequence patterns. Most of the previously developed sequen ... Cite

Mining Frequent Patterns without Candidate Generation

Conference SIGMOD 2000 - Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data · January 1, 2000 Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. H ... Full text Cite

Can we push more constraints into frequent pattern mining?

Conference Proceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2000 Recent studies show that constraint pushing may substantially improve the performance of frequent pattern mining, and methods have been proposed to incorporate interesting constraints in frequent pattern mining. However, some popularly encountered constrai ... Full text Cite

Mining access patterns efficiently from web logs

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2000 With the explosive growth of data avaiilable on the World Wide Web, discovery and analysis of useful information from the World Wide Web becomes a practical necessity. Web access pattern, which is the sequence of accesses pursued by users frequently, is a ... Full text Cite

Mining frequent patterns without candidate generation

Journal Article SIGMOD Record (ACM Special Interest Group on Management of Data) · January 1, 2000 Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. H ... Full text Cite

CLOSET: An efficient algorithm for mining frequent closed itemsets.

Conference ACM SIGMOD workshop on research issues in data mining and knowledge discovery · 2000 Cite

Mining frequent patterns by pattern-growth: methodology and implications

Journal Article ACM SIGKDD explorations newsletter · 2000 Cite

Frequent pattern-projected sequential pattern mining

Journal Article Proc. of the ACM SIGKDD, 2000 · 2000 Cite

Algebra for online analytical processing data cube

Journal Article Acta Metallurgica Sinica (English Letters) · October 1, 1999 Data cube is the central mechanism in multi-dimensional data warehouse and online analytical processing (OLAP) based on multi-dimensional analysis. The algebra for OLAP data cube, including the basic conception, data logic model, important properties and o ... Cite

Genes

Journal Article · 1997 Cite

Online mining changes of clusters in data streams

Journal Article Submitted for publication Cite

Data Mining Techniques for Web Spam Detection

Journal Article Simon Fras University Microsoft Ad Center Cite

CLOSET: An E cient Algorithm for Mining Frequent Closed Itemsets

Conference ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery Cite

DSS 2017

Journal Article Cite

chun Hsu, M.(2001). Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth

Conference ICDE�01: Proceedings of the 2001 International Conference on Data Engineering Cite

DSAA 2019

Journal Article Cite

A Fast Algorithm for Subspace Clustering by Pattern Similarity

Conference Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM�04) Cite

BigDataSE 2021

Journal Article Cite