Journal ArticleIEEE transactions on pattern analysis and machine intelligence · December 2024
The visual question generation (VQG) task aims to generate human-like questions from an image and potentially other side information (e.g., answer type). Previous works on VQG fall in two aspects: i) They suffer from one image to many questions mapping pro ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 25, 2024
With the advent of Bitcoin, a cryptographically-enabled peer-to-peer digital payment system, blockchain together with a whole package of distributed ledger technologies, which serve as the underlying foundation of all the crypto-currencies, have been gaini ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · June 9, 2024
Recently, the Shapley value, a concept rooted in cooperative game theory, has found more and more applications in databases and machine learning. Due to its combinatoric nature, the computation of the Shapley value is #P-hard. To address this challenge, nu ...
Full textCite
ConferenceWWW 2024 - Proceedings of the ACM Web Conference · May 13, 2024
In an era of information explosion, recommender systems are vital tools to deliver personalized recommendations for users. The key of recommender systems is to forecast users' future behaviors based on previous user-item interactions. Due to their strong e ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · April 1, 2024
Fairness in Graph Convolutional Neural Networks (GCNs) becomes a more and more important concern as GCNs are adopted in many crucial applications. Societal biases against sensitive groups may exist in many real world graphs. GCNs trained on those graphs ma ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · January 1, 2024
Graph clustering is essential to understand the nature and behavior of real world such as social network, technical network and transportation network. Different from the existing studies, we propose a new Markov clustering method inspired by belief dynami ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2024
The growing demand for advanced analytics beyond statistical aggregation calls for database systems that support effective model selection of deep neural networks (DNNs). However, existing model selection strategies are based on either training-based algor ...
Full textCite
Journal ArticleIEEE Internet Computing · January 1, 2024
Data markets serve as crucial platforms facilitating data discovery, exchange, sharing, and integration among data users and providers. However, the paramount concern of privacy has predominantly centered on protecting privacy of data owners and third part ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · January 1, 2024
Shapley value provides a unique way to fairly assess each player's contribution in a coalition and has enjoyed many applications. However, the exact computation of Shapley value is #P-hard due to the combinatoric nature of Shapley value. Many existi ...
Full textCite
ConferenceProceedings of Machine Learning Research · January 1, 2024
Large language models (LLMs) have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. This paper introduces TRUSTLLM, a c ...
Cite
Journal ArticleProceedings of the VLDB Endowment · January 1, 2024
The Shapley value is widely used for data valuation in data markets. However, explaining the Shapley value of an owner in a data coalition is an unexplored and challenging task. To tackle this, we formulate the problem of finding the counterfactual explana ...
Full textCite
ConferenceSIGIR-AP 2023 - Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region · November 26, 2023
Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoder ...
Full textCite
ConferenceInternational Conference on Information and Knowledge Management, Proceedings · October 21, 2023
Online recommender systems (RS) aim to match user needs with the vast amount of resources available on various platforms. A key challenge is to model user preferences accurately under the condition of data sparsity. To address this challenge, some methods ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · October 1, 2023
Fraudulent activities within the U.S. healthcare system cost billions of dollars each year and harm the wellbeing of many qualifying beneficiaries. The implementation of an effective fraud detection method has become imperative to secure the welfare of the ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · October 1, 2023
A large amount of high-dimensional and heterogeneous data appear in practical applications, which are often published to third parties for data analysis, recommendations, targeted advertising, and reliable predictions. However, publishing these data may di ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 6, 2023
Deep Learning models are at the core of research in Artificial Intelligence research today. A tide in research for deep learning on graphs or graph neural networks. This wave of research at the intersection of graph theory and deep learning has also influe ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 6, 2023
The field of graph neural networks (GNNs) has seen rapid and incredible strides over the recent years. Graph neural networks, also known as deep learning on graphs, graph representation learning, or geometric deep learning, have become one of the fastest-g ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 6, 2023
To address the big data challenges, serverless multi-party collaborative training has recently attracted attention in the data mining community, since they can cut down the communications cost by avoiding the server node bottleneck. However, traditional se ...
Full textCite
Journal ArticleIEEE Transactions on Automatic Control · August 1, 2023
Decentralized optimization, particularly the class of decentralized composite convex optimization (DCCO) problems, has found many applications. Due to ubiquitous communication congestion and random dropouts in practice, it is highly desirable to design dec ...
Full textCite
ConferenceProceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023 · June 27, 2023
Event causality identification (ECI) aims to identify the causal relationship between events, which plays a crucial role in deep text understanding. Due to the diversity of real-world causality events and difficulty in obtaining sufficient training data, e ...
Cite
ConferenceProceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023 · June 27, 2023
Although great progress has been made for Machine Reading Comprehension (MRC) in English, scaling out to a large number of languages remains a huge challenge due to the lack of large amounts of annotated training data in non-English languages. To address t ...
Cite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · June 1, 2023
Modeling time-evolving preferences of users with their sequential item interactions, has attracted increasing attention in many online applications. Hence, sequential recommender systems have been developed to learn the dynamic user interests from the hist ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · June 1, 2023
Graph neural networks (GNNs) are emerging machine learning models on graphs. Permutation-equivariance and proximity-awareness are two important properties highly desirable for GNNs. Both properties are needed to tackle some challenging graph problems, such ...
Full textCite
Journal ArticleEngineering · June 1, 2023
In recent years, data has become one of the most important resources in the digital economy. Unlike traditional resources, the digital nature of data makes it difficult to value and contract. Therefore, establishing an efficient and standard data-transacti ...
Full textCite
ConferenceACM Web Conference 2023 - Proceedings of the World Wide Web Conference, WWW 2023 · April 30, 2023
Offline policy evaluation (OPE) aims to accurately estimate the performance of a hypothetical policy using only historical data, which has drawn increasing attention in a wide range of applications including recommender systems and personalized medicine. W ...
Full textCite
Journal ArticleACM Transactions on Database Systems · March 14, 2023
Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause serious damage to real applications, such as inaccurate provenance answers, poor profiling results, or concealing interesting patterns from ev ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · March 1, 2023
Graph Neural Networks (GNNs) are emerging machine learning models on graphs. Although sufficiently deep GNNs are shown theoretically capable of fully preserving graph structures, most existing GNN models in practice are shallow and essentially feature-cent ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · January 1, 2023
Differential privacy provides strong privacy preservation guarantee in information sharing. As social network analysis has been enjoying many applications, it opens a new arena for applications of differential privacy. This article presents a comprehensive ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · January 1, 2023
With the prevalence of data-driven research, data valuation has attracted attention from the computer science field. How to appraise a single datum becomes an imperative problem, especially in the context of machine learning. Shapley value is widely used t ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · January 1, 2023
Social recommender systems have drawn a lot of attention in many online web services, because of the incorporation of social information between users in improving recommendation results. Despite the significant progress made by existing solutions, we argu ...
Full textCite
ConferenceProceedings of Machine Learning Research · January 1, 2023
Extant causal methods exclusively exploit the heterogeneity based on the observed covariates for heterogeneous outcome prediction. Even with nowadays big data, the collected covariates may not contain complete confounders. When some confounders are absent, ...
Cite
ConferenceProceedings of the VLDB Endowment · January 1, 2023
The markets for data and AI models are rapidly emerging and increasingly significant in the realm and the practices of data science and artificial intelligence. These markets are being studied from diverse perspectives, such as e-commerce, economics, machi ...
Full textCite
ConferenceProceedings of the Annual Meeting of the Association for Computational Linguistics · January 1, 2023
Currently, learning better unsupervised sentence representations is the pursuit of many natural language processing communities. Lots of approaches based on pre-trained language models (PLMs) and contrastive learning have achieved promising results on this ...
Cite
ConferenceProceedings of the Annual Meeting of the Association for Computational Linguistics · January 1, 2023
Multilingual language models trained using various pre-training tasks like mask language modeling (MLM) have yielded encouraging results on a wide range of downstream tasks. Despite the promising performances, structural knowledge in cross-lingual corpus i ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2023
Recent works have demonstrated the benefits of capturing long-distance dependency in graphs by deeper graph neural networks (GNNs). But deeper GNNs suffer from the long-lasting scalability challenge due to the neighborhood explosion problem in large-scale ...
Cite
ConferenceProceedings - 2023 2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023 · January 1, 2023
Pneumocystosis remains a life-threatening disease with a high mortality rate. It's critical to understand its clinical course and risk factors for better disease management. In this retrospective analysis, we aimed to elucidate the prognostic determinants ...
Full textCite
ConferenceInternational Conference on Information and Knowledge Management, Proceedings · October 17, 2022
Learning on graphs (LOG) plays a pivotal role in various high-impact application domains. The past decades have developed tremendous theories, algorithms, and open-source systems in answering what/who questions on graphs. However, recent studies reveal tha ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · October 1, 2022
Data are invaluable. How can we assess the value of data objectively, systematically and quantitatively? Pricing data, or information goods in general, has been studied and practiced in dispersed areas and principles, such as economics, marketing, electron ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2022
Graph neural networks (GNN) are powerful tools in many web research problems. However, existing GNNs are not fully suitable for many real-world web applications. For example, over-smoothing may affect personalized recommendations and the lack of an explana ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2022
Deep Learning models are at the core of research in Artificial Intelligence research today. A tide in research for deep learning on graphs or graph neural networks. This wave of research at the intersection of graph theory and deep learning has also influe ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2022
The field of graph neural networks (GNNs) has seen rapid and incredible strides over the recent years. Graph neural networks, also known as deep learning on graphs, graph representation learning, or geometric deep learning, have become one of the fastest-g ...
Full textCite
Journal ArticleKnowledge and Information Systems · June 1, 2022
Machine learning is disruptive. At the same time, machine learning can only succeed by collaboration among many parties in multiple steps naturally as pipelines in an eco-system, such as collecting data for possible machine learning applications, collabora ...
Full textCite
ConferenceWWW 2022 - Proceedings of the ACM Web Conference 2022 · April 25, 2022
Conversational recommendation system (CRS) is able to obtain fine-grained and dynamic user preferences based on interactive dialogue. Previous CRS assumes that the user has a clear target item, which often deviates from the real scenario, that is for many ...
Full textCite
ConferenceWWW 2022 - Proceedings of the ACM Web Conference 2022 · April 25, 2022
The self-supervised graph representation learning has achieved much success in recent web based research and applications, such as recommendation system, social networks, and anomaly detection. However, existing works suffer from two problems. Firstly, in ...
Full textCite
ConferenceWSDM 2022 - Proceedings of the 15th ACM International Conference on Web Search and Data Mining · February 11, 2022
Predicting the next interaction of a short-term interaction session is a challenging task in session-based recommendation. Almost all existing works rely on item transition patterns, and neglect user historical sessions while modeling user preference, whic ...
Full textCite
Journal ArticleJournal of Computational and Graphical Statistics · January 1, 2022
Methodologies for functional principal component analysis are well established in the one-dimensional setting. However, for two-dimensional surfaces, for example, images, conducting functional principal component analysis is complicated and challenging, be ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2022
In the real world, the frequency of occurrence of objects is naturally skewed forming long-tail class distributions, which results in poor performance on the statistically rare classes. A promising solution is to mine tail-class examples to balance the tra ...
Full textCite
Journal ArticleJournal of Machine Learning Research · January 1, 2022
In the paper, we propose a class of accelerated zeroth-order and first-order momentum methods for both nonconvex mini-optimization and minimax-optimization. Specifically, we propose a new accelerated zeroth-order momentum (Acc-ZOM) method for black-box min ...
Cite
ConferenceProceedings of the VLDB Endowment · January 1, 2022
In many applications, an organization may want to acquire data from many data owners. Data marketplaces allow data owners to produce data assemblage needed by data buyers through coalition. To encourage coalitions to produce data, it is critical to allocat ...
Full textCite
Chapter · January 1, 2022
Data Mining: Concepts and Techniques, Fourth Edition introduces concepts, principles, and methods for mining patterns, knowledge, and models from various kinds of data for diverse applications. Specifically, it delves into the processes for uncovering patt ...
Full textCite
ConferenceCEUR Workshop Proceedings · January 1, 2022
Popular book and movie recommendation datasets can be associated with Knowledge Graphs (KG) that enable the development of KG-based recommender systems. However, most of these approaches are based on Collaborative Filtering, leaving Content-based Filtering ...
Cite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · January 1, 2022
There is some recent research interest in algorithmic fairness for biased data. There are a variety of pre-, in-, and post-processing methods designed for this problem. However, these methods are exclusively targeting data unfairness and algorithmic unfair ...
Full textCite
ConferenceProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 · January 1, 2022
Despite the great success of spoken language understanding (SLU) in high-resource languages, it remains challenging in low-resource languages mainly due to the lack of labeled training data. The recent multilingual code-switching approach achieves better a ...
Cite
ConferenceFindings of the Association for Computational Linguistics: EMNLP 2022 · January 1, 2022
Recent multilingual pre-trained models have shown better performance in various multilingual tasks. However, these models perform poorly on multilingual retrieval tasks due to lacking multilingual training data. In this paper, we propose to mine and genera ...
Cite
Book · January 1, 2022
Deep Learning models are at the core of artificial intelligence research today. It is well known that deep learning techniques are disruptive for Euclidean data, such as images or sequence data, and not immediately applicable to graph-structured data such ...
Full textCite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2022
Graph Contrastive Learning (GCL), learning the node representations by augmenting graphs, has attracted considerable attentions. Despite the proliferation of various graph augmentation strategies, some fundamental questions still remain unclear: what infor ...
Cite
Journal ArticleKnowledge and Information Systems · October 1, 2021
Model complexity is a fundamental problem in deep learning. In this paper, we conduct a systematic overview of the latest studies on model complexity in deep learning. Model complexity of deep learning can be categorized into expressive capacity and effect ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021
Vertical federated learning (VFL) is an effective paradigm of training the emerging cross-organizational (e.g., different corporations, companies and organizations) collaborative learning with privacy preserving. Stochastic gradient descent (SGD) methods a ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021
Named entity recognition (NER) is a fundamental component in many applications, such as Web Search and Voice Assistants. Although deep neural networks greatly improve the performance of NER, due to the requirement of large amounts of training data, deep ne ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021
Deep Learning models are at the core of research in Artificial Intelligence research today. A tide in research for deep learning on graphs or graph neural networks. This wave of research at the intersection of graph theory and deep learning has also influe ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021
Data is one of the most critical resources in the AI Era. While substantial research has been dedicated to training machine learning models using various types of data, much less efforts have been invested in the exploration of assessing and governing data ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021
Federated learning has become increasingly popular as it facilitates collaborative training of machine learning models among multiple clients while preserving their data privacy. In practice, one major challenge for federated learning is to achieve fairnes ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021
In many industry scale applications, large and resource consuming machine learning models reside in powerful cloud servers. At the same time, large amounts of input data are collected at the edge of cloud. The inference results are also communicated to use ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 14, 2021
Today's computing is characterized by an increasing degree of complexity, comprehensiveness and collaboration. The complexity can be observed by the wide application of gigantic models with a huge number of parameters and structures of an unprecedented lev ...
Full textCite
ConferenceWSDM 2021 - Proceedings of the 14th ACM International Conference on Web Search and Data Mining · August 3, 2021
Lack of training data in low-resource languages presents huge challenges to sequence labeling tasks such as named entity recognition (NER) and machine reading comprehension (MRC). One major obstacle is the errors on the boundary of predicted answers. To ta ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · July 1, 2021
Skyline computation, aiming at identifying a set of skyline points that are not dominated by any other point, is particularly useful for multi-criteria data analysis and decision making. Traditional skyline computation, however, is inadequate to answer que ...
Full textCite
Journal ArticleVLDB Journal · July 1, 2021
Visual information plays a critical role in human decision-making process. Recent developments on visually aware recommender systems have taken the product image into account. We argue that the aesthetic factor is very important in modeling and predicting ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · April 1, 2021
This paper seeks to answer one important but unexplored question for Entity Matching (EM): can we develop a good machine learning pipeline automatically for the EM task? If yes, to what extent the process can be automated? To answer this question, we find ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · April 1, 2021
k nearest neighbor (kNN) queries and skyline queries are important operators on multi-dimensional data points. Given a query point, kNN returns the k nearest neighbors based on a scoring function such as a weighted sum of the attributes, which requires pre ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · March 1, 2021
Influence analysis aims at detecting influential vertices in networks and utilizing them in cost-effective business strategies. Influence analysis in large-scale networks is a key technique in many important applications ranging from viral marketing and on ...
Full textCite
Journal ArticleJournal of medical imaging (Bellingham, Wash.) · March 2021
Methods: Alzheimer's disease (AD) is a worldwide prevalent age-related neurodegenerative disease with no available cure yet. Early prognosis is therefore crucial for planning proper clinical intervention. It is especially true for people diagnosed w ...
Full textCite
ConferenceAdvances in Neural Information Processing Systems · January 1, 2021
Massive deployment of Graph Neural Networks (GNNs) in high-stake applications generates a strong demand for explanations that are robust to noise and align well with human intuition. Most existing methods generate explanations by identifying a subgraph of ...
Cite
Conference35th AAAI Conference on Artificial Intelligence, AAAI 2021 · January 1, 2021
Non-IID data present a tough challenge for federated learning. In this paper, we explore a novel idea of facilitating pairwise collaborations between clients with similar data. We propose FedAMP, a new method employing federated attentive message passing t ...
Cite
Conference35th AAAI Conference on Artificial Intelligence, AAAI 2021 · January 1, 2021
Accurate user and item embedding learning is crucial for modern recommender systems. However, most existing recommendation techniques have thus far focused on modeling users’ preferences over singular type of user-item interactions. Many practical recommen ...
Cite
ConferenceEMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings · January 1, 2021
Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Although various data augmentation approaches have been proposed to synthesize training data in low-resource target languages, th ...
Cite
ConferenceFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021 · January 1, 2021
Script reasoning infers subsequent events from a given event chain, which involves the ability to understand relations between events. A human-labeled script reasoning dataset is usually of small size with limited event relations, which highlights the nece ...
Cite
ConferenceProceedings of the VLDB Endowment · January 1, 2021
Data-driven machine learning (ML) has witnessed great success across a variety of application domains. Since ML model training relies on a large amount of data, there is a growing demand for high-quality data to be collected for ML model training. Data mar ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2021
Blockchain technology has emerged as the cornerstone of many decentralized applications operating among otherwise untrusted peers. However, it is well known that existing blockchain systems do not scale well. Transactions are often executed and committed s ...
Full textCite
ConferenceACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference · January 1, 2021
Procedural text understanding aims at tracking the states (e.g., create, move, destroy) and locations of the entities mentioned in a given paragraph. To effectively track the states and locations, it is essential to capture the rich semantic relations betw ...
Cite
ConferenceProceedings of the VLDB Endowment · January 1, 2021
The Kolmogorov-Smirnov (KS) test is popularly used in many applications, such as anomaly detection, astronomy, database security and AI systems. One challenge remained untouched is how we can obtain an explanation on why a test set fails the KS test. In th ...
Full textCite
ConferenceProceedings of the IEEE International Conference on Computer Vision · January 1, 2021
Interpreting the decision logic behind effective deep convolutional neural networks (CNN) on images complements the success of deep learning models. However, the existing methods can only interpret some specific decision logic on individual or a small numb ...
Full textCite
Conference35th AAAI Conference on Artificial Intelligence, AAAI 2021 · January 1, 2021
In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a popular method for model compression, knowledge distillation transfers knowledge ...
Cite
Journal ArticleProceedings of the VLDB Endowment · January 1, 2021
Data-driven machine learning has become ubiquitous. A marketplace for machine learning models connects data owners and model buyers, and can dramatically facilitate data-driven machine learning applications. In this paper, we take a formal data marketplace ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · January 1, 2021
Skyline queries are important in many application domains. In this paper, we propose a novel structure Skyline Diagram, which given a set of points, partitions the plane into a set of regions, referred to as skyline polyominos. All query points in the same ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 23, 2020
Graph is a natural representation encoding both the features of the data samples and relationships among them. Analysis with graphs is a classic topic in data mining and many techniques have been proposed in the past. In recent years, because of the rapid ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 23, 2020
Data are invaluable. How can we assess the value of data objectively and quantitatively? Pricing data, or information goods in general, has been studied and practiced in dispersed areas and principles, such as economics, data management, data mining, elect ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 23, 2020
Training and refreshing a web-scale Question Answering (QA) system for a multi-lingual commercial search engine often requires a huge amount of training examples. One principled idea is to mine implicit relevance feedback from user behavior recorded in sea ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 23, 2020
Graph Convolutional Networks (GCNs) have gained great popularity in tackling various analytics tasks on graph and network data. However, some recent studies raise concerns about whether GCNs can optimally integrate node features and topological structures ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 23, 2020
It is fundamental to measure model complexity of deep neural networks. A good model complexity measure can help to tackle many challenging problems, such as overfitting detection, model selection, and performance improvement. The existing literature on mod ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · July 1, 2020
Skyline, aiming at finding a Pareto optimal subset of points in a multi-dimensional dataset, has gained great interest due to its extensive use for multi-criteria analysis and decision making. The skyline consists of all points that are not dominated by an ...
Full textCite
Journal ArticleKnowledge and Information Systems · July 1, 2020
This article introduces and solves a spatial keyword cover problem (SK-Cover for short), which aims to identify the group of spatio-textual objects covering all the keywords in a query and minimizing a distance cost function that leads to fewer objects in ...
Full textCite
ConferenceIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops · June 1, 2020
In this paper, we propose a simple yet effective framework, named LightTrack, for online human pose tracking. Existing methods usually perform human detection, pose estimation and tracking in sequential stages, where pose tracking is regarded as an offline ...
Full textCite
Journal ArticleACM Transactions on Knowledge Discovery from Data · May 8, 2020
Imagine we are introducing a new product through a social network, where we know for each user in the network the function of purchase probability with respect to discount. Then, what discounts should we offer to those social network users so that, under a ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · April 1, 2020
More and more AI services are provided through APIs on cloud where predictive models are hidden behind APIs. To build trust with users and reduce potential application risk, it is important to interpret how such predictive models hidden behind APIs make th ...
Full textCite
Conference37th International Conference on Machine Learning, ICML 2020 · January 1, 2020
In the paper, we propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning, which use adaptive learning rates and do not require any large batches. Specifically, we propose a fast important-sampling momen ...
Cite
ConferenceIJCAI International Joint Conference on Artificial Intelligence · January 1, 2020
This paper introduces a novel Robust Regression (RR) model, named Sinkhorn regression, which imposes Sinkhorn distances on both loss function and regularization. Traditional RR methods target at searching for an element-wise loss function (e.g., Lp-norm) t ...
Cite
ConferenceProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition · January 1, 2020
In this paper, we target to address the problem of compression and acceleration of Convolutional Neural Networks (CNNs). Specifically, we propose a novel structural pruning method to obtain a compact CNN with strong discriminative power. To find such netwo ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2020
Given a temporal weighted graph that consists of a potentially endless stream of updates, we are interested in finding density bursting subgraphs (DBS for short), where a DBS is a subgraph that accumulates its density at the fastest speed. Online DBS detec ...
Full textCite
Journal ArticleWorld Wide Web · January 1, 2020
In many real world networks, a vertex is usually associated with a transaction database that comprehensively describes the behaviour of the vertex. A typical example is a social network, where the behaviours of every user are depicted by a transaction data ...
Full textCite
ConferenceInternational Conference on Information and Knowledge Management, Proceedings · November 3, 2019
We present SkyRec (Skyline Recommender), a recommendation toolkit for finding optimal groups based on the notion of group skyline. Skyline computation, aiming at identifying a set of skyline points that are not dominated by any other point, is particularly ...
Full textCite
ConferenceInternational Conference on Information and Knowledge Management, Proceedings · November 3, 2019
Tracking influential users in a dynamic social network is a fundamental step in fruitful applications, such as social recommendation, network topology optimization, and blocking rumour spreading. The major obstacle in mining top influential users is that e ...
Full textCite
Journal ArticleData Mining and Knowledge Discovery · September 1, 2019
The effectiveness of classification methods relies largely on the correctness of instance labels. In real applications, however, the labels of instances are often not highly reliable due to the presence of label noise. Training effective classifiers in the ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019
Network embedding (NE) aims to embed the nodes of a network into a vector space, and serves as the bridge between machine learning and network data. Despite their widespread success, NE algorithms typically contain a large number of hyperparameters for pre ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019
Arguably, every entity in this universe is networked in one way or another. With the prevalence of network data collected, such as social media and biological networks, learning from networks has become an essential task in many applications. It is well re ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019
Network embedding has attracted increasing attention in recent few years, which is to learn a low-dimensional representation for each node of a network to benefit downstream tasks, such as node classification, link prediction, and network visualization. Es ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019
We propose a novel data-driven approach for solving multi-horizon probabilistic forecasting tasks that predicts the full distribution of a time series on future horizons. We illustrate that temporal patterns hidden in historical information play an importa ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019
Semi-Supervised Support Vector Machine (S3VM) is one of the most popular methods for semi-supervised learning. To avoid the trivial solution of classifying all the unlabeled examples to a same class, balancing constraint is often used with S3VM (denoted as ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 25, 2019
Graph convolutional neural networks have attracted increasing attention in recent years. Unlike the standard convolutional neural network, graph convolutional neural networks perform the convolutional operation on the graph data. Compared with the generic ...
Full textCite
Journal ArticleIEEE transactions on knowledge and data engineering · July 2019
Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the clo ...
Full textCite
Journal ArticleVLDB Journal · June 1, 2019
Given a graph, how can we quantify similarity between two nodes in an effective and scalable way? SimRank is an attractive measure of pairwise similarity based on graph topologies. Its underpinning philosophy that “two nodes are similar if they are pointed ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · May 1, 2019
Network embedding assigns nodes in a network to low-dimensional representations and effectively preserves the network structure. Recently, a significant amount of progresses have been made toward this emerging network analysis paradigm. In this survey, we ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2019
Given a database network where each vertex is associated with a transaction database, we are interested in finding theme communities. Here, a theme community is a cohesive subgraph such that a common pattern is frequent in all transaction databases associa ...
Full textCite
ConferenceNAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference · January 1, 2019
Consumers dissatisfied with the normal dispute resolution process provided by an ecommerce company's customer service agents have the option of escalating their complaints by filing grievances with a government authority. This paper tackles the challenge o ...
Cite
Conference36th International Conference on Machine Learning, ICML 2019 · January 1, 2019
Dropout is a popular technique to train large-scale deep neural networks to alleviate the overfitting problem. To disclose the underlying reason for its gain, numerous works have tried to explain it from different perspectives. In this paper, unlike existi ...
Cite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · December 27, 2018
In some applications on time series data, finding linear correlation between time series is important. However, it is meaningless to measure the global correlation between two long time series. Moreover, more often than not, two time series may be correlat ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · November 1, 2018
Network embedding, aiming to embed a network into a low dimensional vector space while preserving the inherent structural properties of the network, has attracted considerable attention. However, most existing embedding methods focus on the static network ...
Full textCite
ConferenceProceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018 · October 24, 2018
Skyline queries are important in many application domains. In this paper, we propose a novel structure Skyline Diagram, which given a set of points, partitions the plane into a set of regions, referred to as skyline polyominos. All query points in the same ...
Full textCite
ConferenceProceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018 · October 24, 2018
Dense subgraph discovery is a key primitive in many graph mining applications, such as detecting communities in social networks and mining gene correlation from biological data. Most studies on dense subgraph mining only deal with one graph. However, in ma ...
Full textCite
Journal ArticleKnowledge and Information Systems · August 1, 2018
Clustering has been widely used to identify possible structures in data and help users to understand data in an unsupervised manner. Traditional clustering methods often provide a single partitioning of the data that groups similar data objects in one grou ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 19, 2018
Factorization Machine (FM) is a supervised machine learning model for feature engineering, which is widely used in many real-world applications. In this paper, we consider the case that the data samples arrive sequentially. The existing convex formulation ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 19, 2018
Network embedding has received increasing research attention in recent years. The existing methods show that the high-order proximity plays a key role in capturing the underlying structure of the network. However, two fundamental problems in preserving the ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · July 19, 2018
Strong intelligent machines powered by deep neural networks are increasingly deployed as black boxes to make decisions in risk-sensitive domains, such as finance and medical. To reduce potential risk and build trust with users, it is critical to interpret ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · May 27, 2018
Interactive analytics requires database systems to be able to answer aggregation queries within interactive response times. As the amount of data is continuously growing at an unprecedented rate, this is becoming increasingly challenging. In the past, the ...
Full textCite
ConferenceThe Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018 · April 10, 2018
Factorization Machine (FM) is a supervised learning approach with a powerful capability of feature engineering. It yields state-of-the-art performances in various batch learning tasks where all the training data is made available prior to the training. How ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2018
Nowadays, crowdsourcing is being widely used to collect training data for solving classification problems. However, crowdsourced labels are often noisy, and there is a performance gap between classification with noisy labels and classification with ground- ...
Full textCite
Conference32nd AAAI Conference on Artificial Intelligence, AAAI 2018 · January 1, 2018
Singular Value Decomposition (SVD) is a popular approach in various network applications, such as link prediction and network parameter characterization. Incremental SVD approaches are proposed to process newly changed nodes and edges in dynamic networks. ...
Cite
ConferenceProceedings - 2017 IEEE International Conference on Information Reuse and Integration, IRI 2017 · November 8, 2017
In many applications, such as data integration and big data analytics, one has to integrate data from multiple sources without detailed and accurate schema information. The state of the art focuses on matching attributes among sources based on the informat ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · November 1, 2017
In this paper, we tackle a challenging problem inherent in a series of applications: tracking the influential nodes in dynamic networks. Specifically, we model a dynamic network as a stream of edge weight updates. This general model embraces many practical ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · November 1, 2017
In a social network, even about the same information the excitement between different users are different. If we want to spread a piece of new information and maximize the expected total amount of excitement, which seed users should we choose? This problem ...
Full textCite
Journal ArticleKnowledge and Information Systems · October 1, 2017
In many applications, we need to measure similarity between nodes in a large network based on features of their neighborhoods. Although in-network node similarity based on proximity has been well investigated, surprisingly, measuring in-network node simila ...
Full textCite
ConferenceProceedings - 2017 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2017 · August 23, 2017
Similarity join, which can find similar objects (e.g., products, names, addresses) across different sources, is powerful in dealing with variety in big data, especially web data. Threshold-driven similarity join, which has been extensively studied in the p ...
Full textCite
ConferenceProceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017 · July 31, 2017
Given a graph, can we find a set of patterns, of which the cost of storing these patterns is economic (or satisfying specific user needs) but their coverage includes the entire graph? We denote these patterns by principal patterns of the given graph since ...
Full textCite
Journal ArticleKnowledge and Information Systems · June 1, 2017
Multi-clustering, which tries to find multiple independent ways to partition a data set into groups, has enjoyed many applications, such as customer relationship management, bioinformatics and healthcare informatics. This paper addresses two fundamental qu ...
Full textCite
ConferenceProceedings. International Conference on Data Engineering · April 2017
Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the clo ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2017
Semantic trajectory pattern mining is becoming more and more important with the rapidly growing volumes of semantically rich trajectory data. Extracting sequential patterns in semantic trajectories plays a key role in understanding semantic behaviour of hu ...
Full textCite
Conference31st AAAI Conference on Artificial Intelligence, AAAI 2017 · January 1, 2017
Network embedding, aiming to learn the low-dimensional representations of nodes in networks, is of paramount importance in many real applications. One basic requirement of network embedding is to preserve the structure and inherent properties of the networ ...
Cite
Journal ArticleIntelligent Data Analysis · January 1, 2017
Benchmarking is among the most widely adopted practices in business today. However, to the best of our knowledge, conducting multidimensional benchmarking in data warehouses has not been explored from a technical efficiency perspective. In this paper, we f ...
Full textCite
Journal ArticleInternational Journal of Data Warehousing and Mining · January 1, 2017
Benchmarking analysis has been used extensively in industry for business analytics. Surprisingly, how to conduct benchmarking analysis efficiently over large data sets remains a technical problem untouched. In this paper, the authors formulate benchmark qu ...
Full textCite
Journal ArticleACM Transactions on Knowledge Discovery from Data · December 1, 2016
Feature selection is important in many big data applications. Two critical challenges closely associate with big data. First, in many big data applications, the dimensionality is extremely high, in millions, and keeps growing. Second, big data applications ...
Full textCite
ConferenceProceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2016 · November 21, 2016
Extracting dense subgraphs is an important step in many graph related applications. There is a challenging struggle in exploring the tradeoffs between density and size in subgraphs extracted. More often than not, different methods aim at different specific ...
Full textCite
Journal ArticleIEEE transactions on visualization and computer graphics · November 2016
We present an online visual analytics approach to helping users explore and understand hierarchical topic evolution in high-volume text streams. The key idea behind this approach is to identify representative topics in incoming documents and align them wit ...
Full textCite
Journal ArticleData Mining and Knowledge Discovery · November 1, 2016
We address the problem of outlying aspects mining: given a query object and a reference multidimensional data set, how can we discover what aspects (i.e., subsets of features or subspaces) make the query object most outlying? Outlying aspects mining can be ...
Full textCite
ConferenceInternational Conference on Information and Knowledge Management, Proceedings · October 24, 2016
Traffic prediction, particularly in urban regions, is an important application of tremendous practical value. In this paper, we report a novel and interesting case study of urban traffic prediction in Central, Hong Kong, one of the densest urban areas in t ...
Full textCite
Journal ArticleKnowledge and Information Systems · September 1, 2016
In this paper, we study a novel problem of continuous similarity search for evolving queries. Given a set of objects, each being a set or multiset of items, and a data stream, we want to continuously maintain the top-k most similar objects using the last n ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 13, 2016
Given a signed network where edges are weighted in real number, and positive weights indicate cohesion between vertices and negative weights indicate opposition, we are interested in finding k-Oppositive Cohesive Groups (k-OCG). Each k-OCG is a group of k ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 13, 2016
Research issues and data mining techniques for product recommendation and viral marketing have been widely studied. Existing works on seed selection in social networks do not take into account the effect of product recommendations in e-commerce stores. In ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 13, 2016
Graph embedding algorithms embed a graph into a vector space where the structure and the inherent properties of the graph are preserved. The existing graph embedding methods cannot preserve the asymmetric transitivity well, which is a critical property of ...
Full textCite
Journal ArticleComputer · July 1, 2016
This installment of Computer's series highlighting the work published in IEEE Computer Society journals comes from IEEE Transactions on Affective Computing and IEEE Transactions on Knowledge and Data Engineering. ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · June 26, 2016
Imagine we are introducing a new product through a social network, where we know for each user in the network the purchase probability curve with respect to discount. Then, what discount should we offer to those social network users so that the adoption of ...
Full textCite
Conference2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016 · June 22, 2016
The existing works on spatial keyword search focus on finding a group of spatial objects covering all the query keywords and minimizing the diameter of the group. However, we observe that such a formulation may not address what users need in some applicati ...
Full textCite
Journal ArticleKnowledge and Information Systems · April 1, 2016
We tackle the novel problem of mining contrast subspaces. Given a set of multidimensional objects in two classes (Formula presented.) and (Formula presented.) and a query object (Formula presented.) , we want to find the top- (Formula presented.) subspaces ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · January 5, 2016
Multi-clustering, which tries to find multiple independent ways to partition a data set into groups, has enjoyed many applications, such as customer relationship management, bioinformatics and healthcare informatics. This paper addresses two fundamental qu ...
Full textCite
Journal ArticleData Mining and Knowledge Discovery · September 22, 2015
When we are investigating an object in a data set, which itself may or may not be an outlier, can we identify unusual (i.e., outlying) aspects of the object? In this paper, we identify the novel problem of mining outlying aspects on numeric data. Given a q ...
Full textCite
Journal ArticleIntelligent Data Analysis · September 8, 2015
A wide range of methods have been proposed for detecting different types of outliers in both the full attribute space and its subspaces. However, the interpretability of outliers, that is, explaining in what ways and to what extent an object is an outlier, ...
Full textCite
ConferenceProceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2015 · August 25, 2015Cite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 10, 2015
Reliable tornado forecasting with a long-lead time can greatly support emergency response and is of vital importance for the economy and society. The large number of meteorological variables in spatiotemporal domains and the complex relationships among var ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 10, 2015
More often than not, people are active in more than one social network. Identifying users from multiple heterogeneous social networks and integrating the different networks is a fundamental issue in many applications. The existing methods tackle this probl ...
Full textCite
Journal ArticleIEEE Transactions on Computers · August 1, 2015
Cloud computing provides promising scalable IT infrastructure to support various processing of a variety of big data applications in sectors such as healthcare and business. Data sets like electronic health records in such applications often contain privac ...
Full textCite
Journal ArticleACM Transactions on Knowledge Discovery from Data · June 1, 2015
Many datasets from real-world applications have very high-dimensional or increasing feature space. It is a new research problem to learn and maintain a classifier to deal with very high dimensionality or streaming features. In this article, we adapt the we ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · May 26, 2015
Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause many serious damages to real applications, such as inaccurate provenance answers, poor profiling results or concealing interesting patterns fr ...
Full textCite
Journal ArticleJournal of Interactive Marketing · May 1, 2015
Although search advertising has gained popularity in recent years, research on the content of search advertising is scarce. This study develops a conceptual framework to understand how market competition affects what a firm advertises in its search ads. Se ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2015
Detecting dominant clusters is important in many analytic applications. The state-of-the-art methods find dense subgraphs on the affinity graph as dominant clusters. However, the time and space complexities of those methods are dominated by the constructio ...
Full textCite
ConferenceEDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings · January 1, 2015
This paper studies the problem of mining frequent co-occurrence patterns across multiple data streams, which has not been addressed by existing works. Co-occurrence pattern in this context refers to the case that the same group of objects appear consecutiv ...
Full textCite
ConferenceEDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings · January 1, 2015
Driven by many applications, in this paper we study the problem of computing the top-k shortest paths from one set of target nodes to another set of target nodes in a graph, namely the top-k shortest path join (KPJ) between two sets of target nodes. While ...
Full textCite
ConferenceProceedings - 2015 IEEE International Conference on Data Science and Data Intensive Systems; 8th IEEE International Conference Cyber, Physical and Social Computing; 11th IEEE International Conference on Green Computing and Communications and 8th IEEE International Conference on Internet of Things, DSDIS/CPSCom/GreenCom/iThings 2015 · January 1, 2015Full textCite
Chapter · January 1, 2015
Skyline computation, aiming at identifying a set of skyline points that are not dominated by any other point, is particularly useful for multi-criteria data analysis and decision making. Traditional skyline computation, however, is inadequate to answer que ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2015
Early classification on multivariate time series has recently emerged as a novel and important topic in data mining fields with wide applications such as early detection of diseases in healthcare domains. Most of the existing studies on this topic focused ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2015
In outlying aspects mining, given a query object, we aim to answer the question as to what features make the query most outlying. The most recent works tackle this problem using two different strategies. (i) Feature selection approaches select the features ...
Full textCite
ConferenceCIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management · November 3, 2014
Within-Network Classification (WNC) techniques are designed for applications where objects to be classified and those with known labels are interlinked. For WNC tasks like web page classification, the homophily principle succeeds by assuming that linked ob ...
Full textCite
ConferenceCIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management · November 3, 2014
Load curve data in power systems refers to users' electrical energy consumption data periodically collected with meters. It has become one of the most important assets for modern power systems. Many operational decisions are made based on the information d ...
Full textCite
Journal ArticleWorld Wide Web · November 1, 2014
Detecting malicious URLs is an essential task in network security intelligence. In this paper, we make two new contributions beyond the state-of-the-art methods on malicious URL detection. First, instead of using any pre-defined features or fixed delimiter ...
Full textCite
ConferenceASONAM 2014 - Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining · October 10, 2014
Reblogging, also known as retweeting in Twitter parlance, is a major type of activities in many online social networks. Although there are many studies on reblogging behaviors and potential applications, whether neighbors who are well connected with each o ...
Full textCite
Journal ArticleKnowledge and Information Systems · October 1, 2014
Email is one of the most popular forms of communication nowadays, mainly due to its efficiency, low cost, and compatibility of diversified types of information. In order to facilitate better usage of emails and explore business potentials in emailing, vari ...
Full textCite
Chapter · July 1, 2014
Mining frequent patterns has been a focused topic in data mining research in recent years, with the development of numerous interesting algorithms for mining association, correlation, causality, sequential patterns, partial periodicity, constraint-based fr ...
Full textCite
Journal ArticleWorld Wide Web · May 1, 2014
Many applications see huge demands of finding important changing areas in evolving graphs. In this paper, given a series of snapshots of an evolving graph, we model and develop algorithms to capture the most frequently changing component (MFCC). Motivated ...
Full textCite
ConferenceSIAM International Conference on Data Mining 2014, SDM 2014 · January 1, 2014
Substring matching is fundamental to data mining methods for sequential data. It involves checking the existence of a short subsequence within a longer sequence, ensuring no gaps within a match. Whilst a large amount of existing work has focused on substri ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · January 1, 2014
Feature selection is important in many big data applications. There are at least two critical challenges. Firstly, in many applications, the dimensionality is extremely high, in millions, and keeps growing. Secondly, feature selection has to be highly scal ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · January 1, 2014
Many real-world networks are featured with dynamic changes, such as new nodes and edges, and modification of the node content. Because changes are continuously introduced to the network in a streaming fashion, we refer to such dynamic networks as streaming ...
Full textCite
ConferenceSIAM International Conference on Data Mining 2014, SDM 2014 · January 1, 2014
Given a large photo collection without domain knowledge (e.g., tourism photos, conference photos, event photos, images wrapped from webpages), it is not easy for human beings to organize or only view them within a reasonable time. In this paper, we propose ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2014
Let D be a long input string of n characters (from an alphabet of size up to 2w, where w is the number of bits in a machine word). Given a substring q of D, a shortest unique query returns a shortest unique substring of D that contains q. We present an opt ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2014
Distance metric learning (DML) aims to learn a distance metric better than Euclidean distance. It has been successfully applied to various tasks, e.g., classification, clustering and information retrieval. Many DML algorithms suffer from the over-fitting p ...
Full textCite
Journal ArticleJournal of Computer and System Sciences · January 1, 2014
It is well known that processing big graph data can be costly on Cloud. Processing big graph data introduces complex and multiple iterations that raise challenges such as parallel memory bottlenecks, deadlocks, and inefficiency. To tackle the challenges, w ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2014
In this paper, we tackle a novel problem of mining contrast subspaces. Given a set of multidimensional objects in two classes C+ and C - and a query object o, we want to find top-k subspaces S that maximize the ratio of likelihood of o in C+ against that i ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2014
Often, a data object described by many features can be naturally decomposed into multiple "views", where each view consists of a subset of features. For example, a video clip may have a video view and an audio view. Given a set of training data objects wit ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2014
Clustering in graphs aims to group vertices with similar patterns of connections. Applications include discovering communities and latent structures in graphs. Many algorithms have been proposed to find graph clusterings, but an open problem is the need fo ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · January 1, 2014
2013 marked a wonderful year for IEEE TKDE (Transactions on Knowledge and Data Engineering). While the statistics for November and December 2013 were not available when this editorial was written, TKDE received 822 submissions in the first 10 months of 201 ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · January 1, 2014
In this paper, we tackle a novel problem of ranking multivalued objects, where an object has multiple instances in a multidimensional space, and the number of instances per object is not fixed. Given an ad hoc scoring function that assigns a score to a mul ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2013
Unlike advertising in traditional media, web search advertising content can be easily customized with little cost. In this paper, we apply content analysis and regression models on 11,818 unique ads related to the accommodation industry to empirically inve ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2013
Recent developments in the frequent pattern mining framework uses additional measures of interest to reduce the set of discovered patterns. We introduce a rigorous and efficient approach to mine statistically significant, unexpected patterns in sequences o ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2013
Uncertainty is common in real-world applications, for example, in sensor networks and moving object tracking, resulting in much interest in item set mining for uncertain transaction databases. In this paper, we focus on pattern mining for uncertain sequenc ...
Full textCite
ConferenceProceedings of the 27th AAAI Conference on Artificial Intelligence, AAAI 2013 · December 1, 2013
In some applications, such as bioinformatics, social network analysis, and computational criminology, it is desirable to find compact clusters formed by a (very) small portion of objects in a large data set. Since such clusters are comprised of a small num ...
Cite
ConferenceData Mining and Knowledge Discovery · December 1, 2013
Being able to discover the uniqueness of an individual is a meaningful task in social network analysis. This paper proposes two novel problems in social network analysis: how to identify the uniqueness of a given query vertex, and how to identify a group o ...
Full textCite
ConferenceMM 2013 - Proceedings of the 2013 ACM Multimedia Conference · November 18, 2013
Cross media retrieval systems have received increasing interest in recent years. Due to the semantic gap between low- level features and high-level semantic concepts of multimedia data, many researchers have explored joint-model techniques in cross media r ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · November 18, 2013
Uncertainty is ubiquitous in big data. Consequently, analyzing and mining uncertain and probabilistic data is important in big data analytics. In this short article, we review some recent progress in mining uncertain and probabilistic data in the hope that ...
Full textCite
Journal ArticleACM Transactions on Intelligent Systems and Technology · October 21, 2013
Huge amounts of search log data have been accumulated at Web search engines. Currently, a popular Web search engine may receive billions of queries and collect terabytes of records about user search behavior daily. Beside search log data, huge amounts of b ...
Full textCite
Journal ArticleACM Transactions on the Web · October 1, 2013
Capturing the context of a user's query from the previous queries and clicks in the same session leads to a better understanding of the user's information need. A context-aware approach to document reranking, URL recommendation, and query suggestion may su ...
Full textCite
ConferenceACM International Conference Proceeding Series · August 30, 2013
A wide range of methods have been proposed for detecting different types of outliers in full space and subspaces. However, the interpretability of outliers, that is, explaining in what ways and to what extent an object is an outlier, remains a critical ope ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · August 15, 2013
In this paper, we tackle a novel type of interesting queries - shortest unique substring queries. Given a (long) string S and a query point q in the string, can we find a shortest substring containing q that is unique in S? We illustrate that shortest uniq ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · March 11, 2013
Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional pa ...
Full textCite
Journal ArticleWorld Wide Web · March 1, 2013
Email correspondents play an important role in many people's social networks. Finding email correspondents in social networks accurately, though may seem to be straightforward at a first glance, is challenging. Most of the existing online social networking ...
Full textCite
Journal ArticleKnowledge and Information Systems · February 1, 2013
We study a practical and novel problem of making recommendations between two parties such as applicants and job positions. We model the competent choices of each party using skylines. In order to make recommendations in various scenarios, we propose a seri ...
Full textCite
Journal ArticleKnowledge and Information Systems · February 1, 2013
Skyline has been widely recognized as being useful for multi-criteria decision-making applications. While most of the existing work computes skylines in various contexts, in this paper, we consider a novel problem: how far away a point is from the skyline? ...
Full textCite
Journal ArticleProceedings of the VLDB Endowment · January 1, 2013
Similarity search on time series is an essential operation in manyapplications. In the state-of-the-art methods, such as the R-treebased methods, SAX and iSAX, time series are by default dividedinto equi-length segments globally, that is, all time series a ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2013
Similarity assessment is one of the core tasks in hyperlink analysis. Recently, with the proliferation of applications, e.g., web search and collaborative filtering, SimRank has been a well-studied measure of similarity between two nodes in a graph. It rec ...
Full textCite
ConferenceACM International Conference Proceeding Series · December 19, 2012
Existing graph compression techniquesmostly focus on static graphs. However for many practical graphs such as social networks the edge weights frequently change over time. This phenomenon raises the question of how to compress dynamic graphs while maintain ...
Full textCite
ConferenceInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC · December 1, 2012
When multiple threads or processes run on a multi-core CPU they compete for shared resources, such as caches and memory controllers, and can suffer performance degradation as high as 200%. We design and evaluate a new machine learning model that estimates ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2012
Compression plays an important role in social network analysis from both practical and theoretical points of view. Although there are a few pioneering studies on social network compression, they mainly focus on lossless approaches. In this paper, we tackle ...
Full textCite
Journal ArticleIEEE Transactions on Parallel and Distributed Systems · October 16, 2012
Activity monitoring, a crucial task in many applications, is often conducted expensively using video cameras. Effectively monitoring a large field by analyzing images from multiple cameras remains a challenging issue. Other approaches generally require the ...
Full textCite
Journal ArticleInternational Journal of Data Warehousing and Mining · October 1, 2012
Keyword search on relational databases is useful and popular for many users without technical background. Recently, aggregate keyword search on relational databases was proposed and has attracted interest. However, two important problems still remain. Firs ...
Full textCite
ConferenceSIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval · September 28, 2012
Most queries in web search are ambiguous and multifaceted. Identifying the major senses and facets of queries from search log data, referred to as query subtopic mining in this paper, is a very important issue in web search. Through search log analysis, we ...
Full textCite
Journal ArticleNeurocomputing · September 1, 2012
In many applications, such as bioinformatics and cross-market customer relationship management, there are data from multiple sources jointly describing the same set of objects. An important data mining task is to find interesting groups of objects that for ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · July 30, 2012
Errors in measurement can be categorized into two types: systematic errors that are predictable, and random errors that are inherently unpredictable and have null expected value. Random error is always present in a measurement. More often than not, reading ...
Full textCite
ConferenceACM International Conference Proceeding Series · July 10, 2012
Record linkage analysis, which matches records referring to the same real world entities from different data sets, is an important task in data integration. Uncertainty often exists in record linkages due to incompleteness or ambiguity in data. Fortunately ...
Full textCite
Journal ArticleKnowledge and Information Systems · April 1, 2012
In this paper, we formulate the problem of early classification of time series data, which is important in some time-sensitive applications such as health informatics. We introduce a novel concept of MPL (minimum prediction length) and develop ECTS (early ...
Full textCite
Journal ArticleInternational Journal of Information Technology and Decision Making · March 1, 2012
We report on the panel discussion held at the ICDM'10 conference on the top 10 data mining case studies in order to provide a snapshot of where and how data mining techniques have made significant real-world impact. The tasks covered by 10 case studies ran ...
Full textCite
Journal ArticleJournal of Intelligent Information Systems · February 1, 2012
Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data r ...
Full textCite
Journal ArticleKnowledge and Information Systems · February 1, 2012
Keyword search has been recently extended to relational databases to retrieve information from text-rich attributes. However, all the existing methods focus on finding individual tuples matching a set of query keywords from one table or the join of multipl ...
Full textCite
Book · January 1, 2012
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with t ...
Full textCite
Journal ArticleInternational Journal of Business Intelligence and Data Mining · January 1, 2012
Relationship management is critical in business. Particularly, it is important to detect abnormal relationships, such as fraudulent relationships between service providers and consumers. Surprisingly, in the literature there is no systematic study on detec ...
Full textCite
Journal ArticleData Mining and Knowledge Discovery · November 1, 2011
We study the challenges of protecting privacy of individuals in the large public survey rating data in this paper. Recent study shows that personal information in supposedly anonymous movie rating records are de-identified. The survey rating data usually c ...
Full textCite
Journal ArticleACM Transactions on Intelligent Systems and Technology · October 1, 2011
Query suggestion plays an important role in improving usability of search engines. Although some recently proposed methods provide query suggestions by mining query patterns from search logs, none of them models the immediately preceding queries as context ...
Full textCite
Journal ArticleACM Transactions on Knowledge Discovery from Data · August 1, 2011
Group based anonymization is the most widely studied approach for privacy-preserving data publishing. Privacy models/definitions using group based anonymization includes k-anonymity, ℓ-diversity, and t-closeness, to name a few. The goal of this article is ...
Full textCite
Journal ArticleInformation Systems · July 1, 2011
Many recent applications involve processing and analyzing uncertain data. In this paper, we combine the feature of top-k objects with that of skyline to model the problem of top-k skyline objects against uncertain data. The problem of efficiently computing ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · June 6, 2011
This paper studies the problem of outlier detection on uncertain data. We start with a comprehensive model considering both uncertain objects and their instances. An uncertain object has some inherent attributes and consists of a set of instances which are ...
Full textCite
ConferenceProceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 · March 14, 2011
In addition to search queries and the corresponding clickthrough information, search engine logs record multidimensional information about user search activities, such as search time, location, vertical, and search device. Multidimensional mining of search ...
Full textCite
ConferenceProceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 · March 14, 2011
Automatic recommendation of citations for a manuscript is highly valuable for scholarly activities since it can substantially improve the efficiency and quality of literature search. The prior techniques placed a considerable burden on users, who were requ ...
Full textCite
Journal ArticleVLDB Journal · February 1, 2011
Uncertain data is inherent in a few important applications. It is far from trivial to extend ranking queries (also known as top-k queries), a popular type of queries on certain data, to uncertain data. In this paper, we cast ranking queries on uncertain da ...
Full textCite
ConferenceProceedings of the 11th SIAM International Conference on Data Mining, SDM 2011 · January 1, 2011
Early classification on time series data has been found highly useful in a few important applications, such as medical and health informatics, industry production management, safety and security management. While some classifiers have been proposed to achi ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2011
Top-k ranking for an uncertain database is to rank tuples in it so that the best k of them can be determined. The problem has been formalized under the unified approach based on parameterized ranking functions (PRFs) and the possible world semantics. Given ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2011
The proliferation of information networks, as a means of sharing information, has raised privacy concerns for enterprises who manage such networks and for individual users that participate in such networks. For enterprises, the main challenge is to satisfy ...
Full textCite
Journal ArticleKnowledge and Information Systems · January 1, 2011
Recently, more and more social network data have been published in one way or another. Preserving privacy in publishing social network data becomes an important concern. With some local knowledge about individuals in a social network, an adversary may atta ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2011
Given two vertices s, t in a graph, let P be the shortest path (SP) from s to t, and P* a subset of the vertices in P. P* is a k-skip shortest path from s to t, if it includes at least a vertex out of every k consecutive vertices in P. In general, P* succi ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2011
Given a sequence database, can we have a non-trivial upper bound on the number of sequential patterns? The problem of bounding sequential patterns is very challenging in theory due to the combinatorial complexity of sequences, even given some inspiring res ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2010
Background knowledge is an important factor in privacy preserving data publishing. Probabilistic distribution-based background knowledge is a powerful kind of background knowledge which is easily accessible to adversaries. However, to the best of our knowl ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · September 7, 2010
Compressing social networks can substantially facilitate mining and advanced analysis of large social networks. Preferably, social networks should be compressed in a way that they still can be queried efficiently without decompression. Arguably, neighbor q ...
Full textCite
ConferenceSIGIR 2010 Proceedings - 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval · September 1, 2010
The context of a search query often provides a search engine meaningful hints for answering the current query better. Previous studies on context-aware search were either focused on the development of context models or limited to a relatively small scale i ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · July 23, 2010
Quantiles are a crucial type of order statistics in databases. Extensive research has been focused on maintaining a space-efficient structure for approximate quantile computation as the underlying dataset is updated. The existing solutions, however, are de ...
Full textCite
ConferenceProceedings of the 19th International Conference on World Wide Web, WWW '10 · July 20, 2010
Huge amounts of search and browse log data has been accumulated in various search engines. Such massive search/browse log data, on the one hand, provides great opportunities to mine the wisdom of crowds and improve Web search as well as online advertisemen ...
Full textCite
ConferenceProceedings of the 19th International Conference on World Wide Web, WWW '10 · July 20, 2010
When you write papers, how many times do you want to make some citations at a place but you are not sure which papers to cite? Do you wish to have a recommendation system which can recommend a small number of good candidates for every place that you want t ...
Full textCite
ConferenceComputer Communications · July 15, 2010
Wireless sensor networks promise an unprecedented opportunity to monitor physical environments via inexpensive wireless embedded devices. Given the sheer amount of sensed data, efficient classification of them becomes a critical task in many sensor network ...
Full textCite
Journal ArticleWorld Wide Web · July 12, 2010
How can we maintain a dynamic profile capturing a user's reading interest against the common interest? What are the queries that have been asked 1,000 times more frequently to a search engine from users in Asia than in North America? What are the keywords ...
Full textCite
Journal ArticleInternational Journal of Data Warehousing and Mining · July 1, 2010
Finding associations among different diseases is an important task in medical data mining. The NHANES data is a valuable source in exploring disease associations. However, existing studies analyzing the NHANES data focus on using statistical techniques to ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · June 4, 2010
This paper proposes a new problem, called superseding nearest neighbor search, on uncertain spatial databases, where each object is described by a multidimensional probability density function. Given a query point q, an object is a nearest neighbor (NN) ca ...
Full textCite
ConferenceAdvances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings · May 19, 2010
Path queries such as "finding the shortest path in travel time from my hotel to the airport" are heavily used in many applications of road networks. Currently, simple statistic aggregates such as the average travel time between two vertices are often used ...
Full textCite
Journal ArticleJournal of Computer Science and Technology · May 1, 2010
Many latest high performance distributed computational environments come with high bandwidth in communication. Such high bandwidth distributed systems provide unprecedented opportunities for analyzing huge datasets, but simultaneously posts new technical c ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · April 1, 2010
Uncertain data are inherent in various important applications and reverse nearest neighbor (RNN) query is an important query type for many applications. While many different types of queries have been studied on uncertain data, there is no previous work on ...
Full textCite
Journal ArticleInformation Retrieval · April 1, 2010
Document clustering has many important applications in the area of data mining and information retrieval. Many existing document clustering techniques use the bag-of-words model to represent the content of a document. However, this representation is only e ...
Full textCite
Journal ArticleKnowledge and Information Systems · January 1, 2010
Sequential pattern mining is an important problem in data mining. State of the art techniques for mining sequential patterns, such as frequent subsequences, are often based on the pattern-growth approach, which recursively projects conditional databases. E ...
Full textCite
Journal ArticleProceedings of the VLDB Endowment · January 1, 2010
In this paper, we tackle the problem of efficient skycube computation. We introduce a novel approach significantly reducing domination tests for a given subspace and the number of subspaces searched. Technically, we identify two types of skyline points tha ...
Full textCite
Journal ArticleVLDB Journal · January 1, 2010
Recently, due to intrinsic characteristics in many underlying data sets, a number of probabilistic queries on uncertain data have been investigated. Top-k dominating queries are very important in many applications including decision making in a multidimens ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · January 1, 2010
Extracting useful correlation from a dataset has been extensively studied. In this paper, we deal with the opposite, namely, a problem we call correlation hiding (CH), which is fundamental in numerous applications that need to disseminate data containing s ...
Full textCite
ConferenceSIGMOD-PODS'09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems · December 4, 2009
Mobile communication data analysis has been often used as a background application to motivate many data mining problems. However, very few data mining researchers have a chance to see a working data mining system on real mobile communication data. In this ...
Full textCite
Chapter · December 1, 2009
Pattern-based clustering is important in many applications, such as DNA micro-array data analysis in bio-informatics, as well as automatic recommendation systems and target marketing systems in e-business. However, pattern-based clustering in large databas ...
Full textCite
ConferenceWWW'09 - Proceedings of the 18th International World Wide Web Conference · December 1, 2009
We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen templa ...
Full textCite
ConferenceWWW'09 - Proceedings of the 18th International World Wide Web Conference · December 1, 2009
Capturing the context of a user's query from the previous queries and clicks in the same session may help understand the user's information need. A context-aware approach to document re-ranking, query suggestion, and URL recommendation may improve users' s ...
Full textCite
ConferenceSociety for Industrial and Applied Mathematics - 9th SIAM International Conference on Data Mining 2009, Proceedings in Applied Mathematics · December 1, 2009
Co-authorship networks, an important type of social networks, have been studied extensively from various angles such as degree distribution analysis, social community extraction and social entity ranking. Most of the previous studies consider the co-author ...
Cite
ConferenceInternational Conference on Information and Knowledge Management, Proceedings · December 1, 2009
Understanding how topics in scientific literature evolve is an interesting and important problem. Previous work simply models each paper as a bag of words and also considers the impact of authors. However, the impact of one document on another as captured ...
Full textCite
ConferenceProceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, U'09 in Conjunction with KDD'09 · November 30, 2009Cite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · November 16, 2009
Automatic news extraction from news pages is important in many Web applications such as news aggregation. However, the existing news extraction methods based on templatelevel wrapper induction have three serious limitations. First, the existing methods can ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · November 9, 2009
Search logs, which contain rich and up-to-date information about users' needs and preferences, have become a critical data source for search engines. Recently, more and more data-driven applications are being developed in search engines based on search log ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · October 19, 2009
Debt detection is important for improving payment accuracy in social security. Since debt detection from customer transactional data can be generally modelled as a fraud detection problem, a straightforward solution is to extract features from transaction ...
Full textCite
ConferenceProceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09 · September 21, 2009
Given the proliferation of technology sites and the growing diversity of their readership, readers are more and more likely to encounter specialized language and terminology that they may lack the sufficient background to understand. Such sites may lose re ...
Full textCite
ConferenceProceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09 · September 21, 2009
Shortest path queries (SPQ) are essential in many graph analysis and mining tasks. However, answering shortest path queries on-the-fly on large graphs is costly. To online answer shortest path queries, we may materialize and index shortest paths. However, ...
Full textCite
ConferenceProceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09 · September 21, 2009
Keyword search has been recently extended to relational databases to retrieve information from text-rich attributes. However, all the existing methods focus on finding individual tuples matching a set of query keywords from one table or the join of multipl ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · September 14, 2009
To improve software productivity, when constructing new software systems, programmers often reuse existing libraries or frameworks by invoking methods provided in their APIs. Those API methods, however, are often complex and not well documented. To get fam ...
Full textCite
Journal ArticleDistributed and Parallel Databases · August 1, 2009
Recently, uncertain data processing has become more and more important. Although a significant amount of previous research explores various continuous queries on data streams, continuous queries on uncertain data streams have seldom been investigated. In t ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · July 8, 2009
Given an integer k, a representative skyline contains the k skyline points that best describe the tradeoffs among different dimensions offered by the full skyline. Although this topic has been previously studied, the existing solution may sometimes produce ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · July 8, 2009
In many applications, we need to analyze a large number of time series. Segments of time series demonstrating dominating advantages over others are often of particular interest. In this paper, we advocate interval skyline queries, a novel type of time seri ...
Full textCite
Journal ArticleACM Transactions on Knowledge Discovery from Data · July 1, 2009
Currently, most popular Web search engines adopt some link-based ranking methods such as PageRank. Driven by the huge potential benefit of improving rankings of Web pages, many tricks have been attempted to boost page rankings. The most common way, which i ...
Full textCite
Journal ArticleACM Transactions on Database Systems · June 1, 2009
Data publishing generates much concern over the protection of individual privacy. Recent studies consider cases where the adversary may possess different kinds of knowledge about the data. In this article, we show that knowledge of the mechanism or algorit ...
Full textCite
Journal ArticleBMC bioinformatics · June 2009
BackgroundThe recent availability of an expanding collection of genome sequences driven by technological advances has facilitated comparative genomics and in particular the identification of synteny among multiple genomes. However, the development ...
Full textCite
Journal ArticleVLDB Journal · June 1, 2009
Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognitive science to database query answering, and study the novel problem of answerin ...
Full textCite
Journal ArticleKnowledge and Information Systems · January 1, 2009
While frequent pattern mining is fundamental for many data mining tasks, mining maximal frequent patterns efficiently is important in both theory and applications of frequent pattern mining. The fundamental challenge is how to search a large space of item ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · January 1, 2009
The importance of skyline analysis has been well recognized in multicriteria decision-making applications. All of the previous studies assume a fixed order on the attributes in question. However, in some applications, users may be interested in skylines wi ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · January 1, 2009
In this paper, we study an interesting problem: continuously monitoring k-means clustering of sensor readings in a large sensor network. Given a set of sensors whose readings evolve over time, we want to maintain the k-means of the readings continuously. T ...
Full textCite
Journal ArticleGenome research · January 2009
BLAST is an extensively used local similarity search tool for identifying homologous sequences. When a gene sequence (either protein sequence or nucleotide sequence) is used as a query to search for homologous sequences in a genome, the search results, rep ...
Full textCite
ConferenceIJCAI International Joint Conference on Artificial Intelligence · January 1, 2009
In this paper, we formulate the problem of early classification of time series data, which is important in some time-sensitive applications such as health-informatics. We introduce a novel concept of MPL (Minimum Prediction Length) and develop ECTS (Early ...
Cite
ConferenceProceedings - International Conference on Data Engineering · January 1, 2009
In some applications of privacy preserving data publishing, a practical demand is to publish a data set on multiple quasi-identifiers for multiple users simultaneously, which poses several challenges. Can we generate one anonymized version of the data so t ...
Full textCite
Journal ArticleACM Transactions on Knowledge Discovery from Data · January 1, 2009
Joint mining of multiple datasets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in bioinformatics, jointly mining multiple gene expression datasets obtained by different la ...
Full textCite
ConferenceProceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09 · January 1, 2009
Recently, privacy preserving data publishing has received a lot of attention in both research and applications. Most of the previous studies, however, focus on static data sets. In this paper, we study an emerging problem of continuous privacy preserving p ...
Full textCite
ConferenceProceedings of the International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc) · December 15, 2008
In many emerging applications, data streams are monitored in a network environment. Due to limited communication bandwidth and other resource constraints, a critical and practical demand is to online compress data streams continuously with quality guarante ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · December 10, 2008
Uncertain data is inherent in a few important applications such as environmental surveillance and mobile object tracking. Top-k queries (also known as ranking queries) are often natural and useful in analyzing uncertain data in those applications. In this ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · December 10, 2008
In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, whi ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2008
We consider the problem of publishing sensitive transaction data with privacy preservation. High dimensionality of transaction data poses unique challenges on data privacy and data utility. On one hand, re-identification attacks tend to use a subset of ite ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2008
Query suggestion plays an important role in improving the usability of search engines. Although some recently proposed methods can make meaningful query suggestions by mining query patterns from search logs, none of them are context-aware - they do not tak ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2008
Mining user preferences plays a critical role in many important applications such as customer relationship management (CRM), product and service recommendation, and marketing campaigns. In this paper, we identify an interesting and practical problem of min ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2008
In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, whi ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · October 1, 2008
Recently, as more and more social network data has been published in one way or another, preserving privacy in publishing social network data becomes an important concern. With some local knowledge about individuals in a social network, an adversary may at ...
Full textCite
ConferenceProceedings - The 9th International Conference on Web-Age Information Management, WAIM 2008 · September 22, 2008
Uncertain data are inherent in many important applications. Recently, considerable research efforts have been put into the field of managing uncertain data. In this paper, we summarize existing techniques to query and model uncertain data and systems that ...
Full textCite
ConferenceProceedings - The 9th International Conference on Web-Age Information Management, WAIM 2008 · September 22, 2008
With the expansion of the internet, many specialized, high-profile sites have become available that bring very technical subject matter to readers with non-technical backgrounds. While the theme of these sites may be of interest to these readers, the posts ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · September 1, 2008
Individual privacy will be at risk if a published data set is not properly deidentified. k-Anonymity is a major technique to deidentify a data set. Among a number of k-anonymization schemes, local recoding methods are promising for minimizing the distortio ...
Full textCite
Journal ArticleJournal of Computer Science and Technology · July 1, 2008
The task of clustering is to identify classes of similar objects among a set of objects. The definition of similarity varies from one clustering model to another. However, in most of these models the concept of similarity is often based on such metrics as ...
Full textCite
ConferenceAdvances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings · May 16, 2008
k-anonymization is an important privacy protection mechanism in data publishing. While there has been a great deal of work in recent years, almost all considered a single static release. Such mechanisms only protect the data up to the first release or firs ...
Full textCite
ConferenceAdvances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings · May 16, 2008
By comparing genomes among both closely and distally related species, comparative genomics analysis characterizes structures and functions of different genomes in both conserved and divergent regions. Synteny blocks, which are conserved blocks of genes on ...
Full textCite
Journal ArticleProceedings of the VLDB Endowment · January 1, 2008
Current skyline evaluation techniques assume a xed ordering on the attributes. However, dynamic preferences on nominal attributes are more realistic in known applications. In order to generate online response for any such preference issued by a user, one o ...
Full textCite
ConferenceSociety for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130 · January 1, 2008
Web spam, which refers to any deliberate actions bringing to selected web pages an unjustifiable favorable relevance or importance, is one of the major obstacles for high quality information retrieval on the web. Most of the existing web spam detection met ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2008
Uncertain data are inherent in some important applications, such as environmental surveillance, market analysis, and quantitative economics research. Due to the importance of those applications and the rapidly increasing amount of uncertain data collected ...
Full textCite
ConferenceSociety for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130 · January 1, 2008
Supervised learning on sequence data, also known as sequence classification, has been well recognized as an important data mining task with many significant applications. Since temporal order is important in sequence data, in many critical applications of ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 14, 2007
The importance of dominance and skyline analysis has been well recognized in multi-criteria decision making applications. Most previous studies assume a fixed order on the attributes. In practice, different customers may have different preferences on nomin ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 14, 2007
In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, whi ...
Full textCite
ConferenceInternational Conference on Information and Knowledge Management, Proceedings · December 1, 2007
With increasing amount of data being stored in XML format, OLAP queries over these data become important. OLAP queries have been well studied in the relational database systems. However, the evaluation of OLAP queries over XML data is not a trivial extensi ...
Full textCite
Conference2007 2nd International Conference on Pervasive Computing and Applications, ICPCA'07 · December 1, 2007
In this paper, we present the framework of Semantic and Automatic Service Orchestration (SASO) system for Web services modeling and composition. The SASO system has the following feature's: 1) it adopts a semantic approach to model Web services, and 2) it ...
Full textCite
Conference6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE 2007 · December 1, 2007
A software system interacts with third-party libraries through various APIs. Using these library APIs often needs tofollow certain usage patterns. Furthermore, ordering rules (specifications) exist between APIs, and these rules govern the secure and robust ...
Full textCite
ConferenceProceedings of the International Conference on Scientific and Statistical Database Management, SSDBM · December 1, 2007
K-anonymity is a simple yet practical mechanism to protect privacy against attacks of re-identifying individuals by joining multiple public data sources. All existing methods achieving k-anonymity assume implicitly that the data objects to be anonymized ar ...
Full textCite
ConferenceProceedings - International Conference on Advanced Information Networking and Applications, AINA · September 25, 2007
Trustworthy data processing, which ensures the credibility and irrefutability of data, is crucial in many business applications. Recently, the Write-Once-Read-Many (WORM) devices have been used as trustworthy data storage. Nevertheless, how to efficiently ...
Full textCite
ConferenceProceedings - International Conference on Software Engineering · September 25, 2007
Software engineering data (such as code bases, execution traces, historical code changes, mailing lists, and bug databases) contains a wealth of information about a project's status, progress, and evolution. Using well-established data mining techniques, p ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · September 24, 2007
Recently, the skyline computation and analysis have been extended from one single full space to multidimensional subspaces, which can lead to valuable insights in some applications. Particularly, compressed skyline cubes in the form of skyline groups and t ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · August 1, 2007
Skyline and top-k queries are two popular operations for preference retrieval. In practice, applications that require these operations usually provide numerous candidate attributes, whereas, depending on their interests, users may issue queries regarding d ...
Full textCite
Journal ArticleIEEE Transactions on Parallel and Distributed Systems · July 1, 2007
Limited energy supply is one of the major constraints in wireless sensor networks. A feasible strategy is to aggressively reduce the spatial sampling rate of sensors, that is, the density of the measure points in a field. By properly scheduling, we want to ...
Full textCite
Journal ArticleIIE Transactions (Institute of Industrial Engineers) · June 1, 2007
In this study, we propose a simple and novel data structure using hyper-links, H-struct, and a new mining algorithm, H-mine, which takes advantage of this data structure and dynamically adjusts links in the mining process. A distinct feature of this method ...
Full textCite
Journal ArticleJournal of Intelligent Information Systems · April 1, 2007
Constraints are essential for many sequential pattern mining applications. However, there is no systematic study on constraint-based sequential pattern mining. In this paper, we investigate this issue and point out that the framework developed for constrai ...
Full textCite
Conference33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings · January 1, 2007
Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data r ...
Cite
Conference33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings · January 1, 2007
Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognition science to database query answering, and study the novel problem of answerin ...
Cite
Conference33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings · January 1, 2007
Data publishing generates much concern over the protection of individual privacy. Recent studies consider cases where the adversary may possess different kinds of knowledge about the data. In this paper, we show that knowledge of the mechanism or algorithm ...
Cite
Chapter · January 1, 2007
It is well-recognized that medical datasets are often noisy and incomplete due to the difficulties in data collection and integration. Noise and incompleteness in medical data post substantial challenges for accurate classification. A differential latent s ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2007
Privacy-preserving data publication for data mining is to protect sensitive information of individuals in published data while the distortion to the data is minimized. Recently, it is shown that (α, k)-anonymity is a feasible technique when we are given so ...
Full textCite
ConferenceProceedings of the 7th SIAM International Conference on Data Mining · January 1, 2007
The Web is a very large social network. It is important and interesting to understand the "ecology" of the Web: the general relations of Web pages to their environment. The understanding of such relations has a few important applications, including Web com ...
Full textCite
Journal ArticleKnowledge and Information Systems · January 1, 2007
Extensive studies have shown that mining microarray data sets is important in bioinformatics research and biomedical applications. In this paper, we explore a novel type of gene-sample-time microarray data sets that records the expression levels of various ...
Full textCite
ConferenceProceedings - Fifth Annual IEEE International Conference on Pervasive Computing and Communications, PerCom 2007 · January 1, 2007
Activity monitoring, a crucial task in many applications, is often conducted expensively using video cameras. Also, effectively monitoring a large field by analyzing images from multiple cameras remains a challenging problem. In this paper, we introduce a ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2007
While active rules have been applied in many areas including active databases, XML documentation and Semantic Web, current methods remain largely uncertain of how to terminate active behaviors. Some existing methods have been provided in the form of a logi ...
Full textCite
Journal ArticleKnowledge and Information Systems · January 1, 2007
In some business applications such as trading management in financial institutions, it is required to accurately answer ad hoc aggregate queries over data streams. Materializing and incrementally maintaining a full data cube or even its compression or appr ...
Full textCite
ConferenceProceedings of the 7th SIAM International Conference on Data Mining · January 1, 2007
Finding discords in time series database is an important problem in a great variety of applications, such as space shuttle telemetry, mechanical industry, biomedicine, and financial data analysis. However, most previous methods for this problem suffer from ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · December 1, 2006
As OLAP engines are widely used to support multidimensional data analysis, it is desirable to support in data cubes advanced statistical measures, such as regression and filtering, in addition to the traditional simple measures such as count and average. S ...
Full textCite
ConferenceInternational Conference on Information and Knowledge Management, Proceedings · December 1, 2006
In many applications, classifiers need to be built based on multiple related data streams. For example, stock streams and news streams are related, where the classification patterns may involve features from both streams. Thus instead of mining on a single ...
Full textCite
ConferenceACM Transactions on Database Systems · December 1, 2006
The skyline operator is important for multicriteria decision-making applications. Although many recent studies developed efficient methods to compute skyline objects in a given space, none of them considers skylines in multiple subspaces simultaneously. Mo ...
Full textCite
ConferenceProceedings of the ACM/IEEE Joint Conference on Digital Libraries · December 1, 2006
We study how to resolve entities that contain a group of related elements in them (e.g., an author entity with a list of citations or an intermediate result by GROUP BY SQL query). Such entities, named as grouped-entities, frequently occur in many applicat ...
Full textCite
ConferenceProceedings - International Conference on Software Engineering · December 1, 2006
To improve software productivity, when constructing new software systems, developers often reuse existing class libraries or frameworks by invoking their APIs. Those APIs, however, are often complex and not well documented, posing barriers for developers t ...
Full textCite
ConferenceProceedings of the National Conference on Artificial Intelligence · November 13, 2006
The generators and the unique closed pattern of an equivalence class of itemsets share a common set of transactions. The generators are the minimal ones among the equivalent itemsets, while the closed pattern is the maximum one. As a generator is usually s ...
Cite
ConferenceJournal of Intelligent Information Systems · November 1, 2006
Change detection on spatial data is important in many applications, such as environmental monitoring. Given a set of snapshots of spatial objects at various temporal instants, a user may want to derive the changing regions between any two snapshots. Most o ...
Full textCite
Journal ArticleGeoInformatica · September 1, 2006
A co-location pattern is a group of spatial features/events that are frequently co-located in the same region. For example, human cases of West Nile Virus often occur in regions with poor mosquito control and the presence of birds. For co-location pattern ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · June 1, 2006
Incorporating constraints into frequent itemset mining not only improves data mining efficiency, but also leads to concise and meaningful results. In this paper, a framework for closed constrained gradient itemset mining in retail databases is proposed by ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · January 1, 2006
Mining knowledge about ordering from sequence data is an important problem with many applications, such as bioinformatics, Web mining, network management, and intrusion detection. For example, if many customers follow a partial order in their purchases of ...
Full textCite
ConferenceVLDB 2006 - Proceedings of the 32nd International Conference on Very Large Data Bases · January 1, 2006
Image retrieval has found more and more applications. Due to the well recognized semantic gap problem, the accuracy and the recall of image similarity search are often still low. As an effective method to improve the quality of image retrieval, the relevan ...
Cite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2006
Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical da ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2006
Individual privacy will be at risk if a published data set is not properly de-identified, k-anonymity is a major technique to de-identify a data set. A more general view of k-anonymity is clustering with a constraint of the minimum number of objects in eve ...
Cite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · January 1, 2006
The entity resolution (ER) problem, which identifies duplicate entities that refer to the same real world entity, is essential in many applications. In this paper, in particular, we focus on resolving entities that contain a group of related elements in th ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2006
Privacy becomes a more and more serious concern in applications involving microdata. Recently, efficient anonymization has attracted much research work. Most of the previous methods use global recoding, which maps the domains of the quasi-identifier attrib ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · January 1, 2006
Given a set of multi-dimensional points, the skyline contains the best points according to any preference function that is monotone on all axes. In practice, applications that require skyline analysis usually provide numerous candidate attributes, and vari ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2006
Privacy preserving data processing has become an important topic recently because of advances in hardware technology which have lead to widespread proliferation of demographic and sensitive data. A rudimentary way to preserve privacy is to simply hide the ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2006
Clustering data streams has found a few important applications. While many previous studies focus on clustering objects arriving in a data stream, in this paper, we consider the novel problem of on demand clustering concept drifting data streams. In order ...
Full textCite
ConferenceProceedings of the International Conference on Scientific and Statistical Database Management, SSDBM · December 1, 2005
Data summarization is an important data analysis task in data warehousing and online analytic processing. In this paper, we consider a novel type of summarization queries, probable group queries, such as "What are the groups of patients that have a 50% or ...
Cite
ConferenceVLDB 2005 - Proceedings of 31st International Conference on Very Large Data Bases · December 1, 2005
The skyline operator is important for multi-criteria decision making applications. Although many recent studies developed efficient methods to compute skyline objects in a specific space, the fundamental problem on the semantics of skylines remains open: W ...
Cite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2005
Joint mining of multiple data sets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in cross-market customer segmentation, a group of customers who behave similarly in multipl ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · December 1, 2005
Image annotation is an important research problem in content-based image retrieval (CBIR) and computer vision with broad applications. A major challenge is the so-called "semantic gap" between the low-level visual features and the high-level semantic conce ...
Full textCite
Conference2005 Second Annual IEEE Communications Society Conference on Sensor and AdHoc Communications and Networks, SECON 2005 · December 1, 2005
Energy consumption is one of the major constraints in wireless sensor networks. A highly feasible strategy is to aggressively reduce the spatial sampling rate of sensors (i.e., the density of the measure points in a field). By properly scheduling, we want ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · December 1, 2005
Mining frequent structural patterns from graph databases is an important research problem with broad applications. Recently, we developed an effective index structure, ADI, and efficient algorithms for mining frequent patterns from large, disk-based graph ...
Cite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · October 1, 2005
Effective identification of coexpressed genes and coherent patterns in gene expression data is an important task in bioinformatics research and biomedical applications. Several clustering methods have recently been proposed to identify coexpressed genes th ...
Full textCite
Journal ArticleDistributed and Parallel Databases · September 1, 2005
Real-time surveillance systems, telecommunication systems, and other dynamic environments often generate tremendous (potentially infinite) volume of stream data: the volume is too huge to be scanned multiple times. Much of such data resides at rather low l ...
Full textCite
Journal ArticleInternational Journal of Data Warehousing and Mining (IJDWM) · January 1, 2005
Frequent pattern mining is an important data-mining problem with broad applications. Although there are many in-depth studies on efficient frequent pattern mining algorithms and constraint pushing techniques, the effectiveness of frequent pattern mining re ...
Full textCite
ConferenceProceedings of the 2005 SIAM International Conference on Data Mining, SDM 2005 · January 1, 2005
All of the existing (iceberg) cube computation algorithms assume that the data is stored in a single base table, however, in practice, a data warehouse is often organized in a schema of multiple tables, such as star schema and snowflake schema. In terms of ...
Full textCite
ConferenceLecture Notes in Computer Science · January 1, 2005
Formal concept analysis has become an active field of study for data analysis and knowledge discovery. A formal concept C is determined by its extent (the set of objects that fall under C) and its intent (the set of properties or attributes covered by C). ...
Full textCite
ConferenceLecture Notes in Computer Science · January 1, 2005
Pattern-based clustering has broad applications in microarray data analysis, customer segmentation, e-business data analysis, etc. However, pattern-based clustering often returns a large number of highly-overlapping clusters, which makes it hard for users ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2005
One fundamental task in near-neighbor search as well as other similarity matching efforts is to find a distance function that can efficiently quantify the similarity between two objects in a meaningful way. In DNA microarray analysis, the expression levels ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2005
In applications such as fraud and intrusion detection, it is of great interest to measure the evolving trends in the data. We consider the problem of quantifying changes between two datasets with class labels. Traditionally, changes are often measured by f ...
Full textCite
Journal ArticleChinese Science Bulletin · December 1, 2004
Tumor diagnosis by analyzing gene expression profiles becomes an interesting topic in bioinformatics and the main problem is to identify the genes related to a tumor. This paper proposes a rank sum method to identify the related genes based on the rank sum ...
Full textCite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · November 1, 2004
Sequential pattern mining is an important data mining problem with broad applications. However, it Is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the pre ...
Full textCite
ConferenceProceedings of the International Conference on Scientific and Statistical Database Management, SSDBM · October 25, 2004
Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends ...
Cite
Journal ArticleIEEE Transactions on Knowledge and Data Engineering · August 1, 2004
Many data analysis tasks can be viewed as search or mining in a multidimensional space (MDS). In such MDSs, dimensions capture potentially important factors for given applications, and cells represent combinations of values for the factors. To systematical ...
Full textCite
Journal ArticleData Mining and Knowledge Discovery · May 1, 2004
Recent work has highlighted the importance of the constraint-based mining paradigm in the context of frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases. Constraint pushing techniques h ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2004
Mining frequent tree patterns is an important research problems with broad applications in bioinformatics, digital library, ecommerce, and so on. Previous studies highly suggested that patterngrowth methods are efficient in frequent pattern mining. In this ...
Full textCite
ConferenceKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2004
Finding informative genes from microarray data is an important research problem in bioinformatics research and applications. Most of the existing methods rank features according to their discriminative capability and then find a subset of discriminative ge ...
Full textCite
Journal ArticleData Mining and Knowledge Discovery · January 1, 2004
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. H ...
Full textCite
ConferenceKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2004
Extensive studies have shown that mining microarray data sets is important in bioinformatics research and biomedical applications. In this paper, we explore a novel type of gene-sample-time microarray data sets, which records the expression levels of vario ...
Full textCite
ConferenceKDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2004
Mining frequent structural patterns from graph databases is an interesting problem with broad applications. Most of the previous studies focus on pruning unfruitful search subspaces effectively, but few of them address the mining on large, disk-based datab ...
Full textCite
Journal ArticleJournal of Computer Science and Technology · January 1, 2004
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a challenging problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Recent studie ...
Full textCite
Chapter · January 1, 2004
Discovering co-expressed genes and coherent expression patterns in gene expression data is an important data analysis task in bioinformatics research and biomedical applications. Although various clustering methods have been proposed, two tough challenges ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · December 1, 2003
Recently, a technique called quotient cube was proposed as a summary structure for a data cube that preserves its semantics, with applications for online exploration and visualization. The authors showed that a quotient cube can be constructed very efficie ...
Cite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2003
Mining microarray gene expression data is an important research topic in bioinformatics with broad applications. While most of the previous studies focus on clustering either genes or samples, it is interesting to ask whether we can partition the complete ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2003
Discovering coherent gene expression patterns in time-series gene expression data is an important task in bioinformatics research and biomedical applications. In this paper, we propose an interactive exploration framework for mining coherent expression pat ...
Full textCite
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2003
Mining frequent closed itemsets provides complete and non-redundant results for frequent pattern analysis. Extensive studies have proposed various strategies for efficient frequent closed itemset mining, such as depth-first search vs. breadthfirst search, ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2003
Pattern-based clustering is important in many applications, such as DNA micro-array data analysis, automatic recommendation systems and target marketing systems. However, pattern-based clustering in large databases is challenging. On the one hand, there ca ...
Cite
Chapter · January 1, 2003
This chapter discusses the efficacious data cube exploration by semantic summarization and compression. Data cube is the core operator in data warehousing and online analytical processing (OLAP). Its efficient computation, maintenance, and utilization for ...
Full textCite
ConferenceProceedings - 3rd IEEE Symposium on BioInformatics and BioEngineering, BIBE 2003 · January 1, 2003
Clustering the time series gene expression data is an important task in bioinformatics research and biomedical applications. Recently, some clustering methods have been adapted or proposed. However, some concerns still remain, such as the robustness of the ...
Full textCite
ConferenceProceedings - 29th International Conference on Very Large Data Bases, VLDB 2003 · January 1, 2003
Data cube is the core operator in data warehousing and OLAP. Its efficient computation, maintenance, and utilization for query answering and advanced analysis have been the subjects of numerous studies. However, for many applications, the huge size of the ...
Full textCite
Journal ArticleLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2003
It has been well recognized that online analytical processing (OLAP) can provide important insights into huge archives of data. While the conventional OLAP model is capable of analyzing relational business data, it often cannot fit many kinds of complex da ...
Full textCite
ConferenceProceedings of the ACM Symposium on Applied Computing · January 1, 2003
Mining co-location patterns from spatial databases may reveal types of spatial features likely located as neighbors in space. In this paper, we address the problem of mining confident co-location rules without a support threshold. First, we propose a novel ...
Full textCite
Journal ArticleJournal of Computer Science and Technology · January 1, 2003
The study on database technologies, or more generally, the technologies of data and information management, is an important and active research field. Recently, many exciting results have been reported. In this fast growing field, Chinese researchers play ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2002
Frequent pattern mining has been studied extensively. However, the effectiveness and efficiency of this mining is often limited, since the number of frequent patterns generated is often too large. In many applications it is sufficient to generate and exami ...
Cite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · September 17, 2002
As WWW becomes more and more popular and powerful, how to search information on the web in database way becomes an important research topic. COMMIX, which is developed in the DB group in Peking University (China), is a system towards building very large da ...
Cite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · September 17, 2002
Data cube enables fast online analysis of large data repositories, which is attractive in many applications. Although there are several kinds of available cube-based OLAP products, users may still encounter challenges on effectiveness and efficiency in the ...
Cite
ConferenceProceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002 · June 3, 2002
As WWW becomes more and more popular and powerful, how to search information on the web in database way becomes an important research topic. COMMIX, which is developed in the DB group in Peking University (China), is a system towards building very large da ...
Full textCite
ConferenceProceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002 · June 3, 2002
Data cube enables fast online analysis of large data repositories which is attractive in many applications. Although there are several kinds of available cube-based OLAP products, users may still encounter challenges on effectiveness and efficiency in the ...
Full textCite
ConferenceInternational Conference on Information and Knowledge Management, Proceedings · January 1, 2002
Constraints are essential for many sequential pattern mining applications. However, there is no systematic study on constraint-based sequential pattern mining. In this paper, we investigate this issue and point out that the framework developed for constrai ...
Full textCite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2001
Methods for efficient mining of frequent patterns have been studied extensively by many researchers. However, the previously proposed methods still encounter some performance bottlenecks when mining databases with different data characteristics, such as de ...
Cite
ConferenceProceedings - IEEE International Conference on Data Mining, ICDM · December 1, 2001
Previous studies propose that associative classification has high classification accuracy and strong flexibility at handling unstructured data. However, it still suffers from the huge set of mined rules and sometimes biased classification or overfitting si ...
Cite
ConferenceVLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases · January 1, 2001
Constrained gradient analysis (similar to the "cubegrade" problem posed by Imielinski, et al. [9]) is to extract pairs of similar cell characteristics associated with big changes in measure in a data cube. Cells are considered similar if they are related b ...
Cite
ConferenceInternational Conference on Information and Knowledge Management, Proceedings · January 1, 2001
Sequential pattern mining, which finds the set of frequent subsequences in sequence databases, is an important data-mining task and has broad applications. Usually, sequence patterns are associated with different circumstances, and such circumstances form ...
Full textCite
Journal ArticleSIGMOD Record (ACM Special Interest Group on Management of Data) · January 1, 2001
It is often too expensive to compute and materialize a complete high-dimensional data cube. Computing an iceberg cube, which contains only aggregates above certain thresholds, is an effective way to derive nontrivial multidimensional aggregations for OLAP ...
Full textCite
Journal ArticleProceedings - International Conference on Data Engineering · January 1, 2001
Recent work has highlighted the importance of the constraint-based mining paradigm in the context of frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases. In this paper, we study constra ...
Full textCite
ConferenceProceedings - International Conference on Data Engineering · January 1, 2001
Sequential pattern mining is an important data mining problem with broad applications. It is challenging since one may need to examine a combinatorially explosive number of possible subsequence patterns. Most of the previously developed sequential pattern ...
Cite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2001
It is often too expensive to compute and materialize a complete high-dimensional data cube. Computing an iceberg cube, which contains only aggregates above certain thresholds, is an effective way to derive nontrivial multidimensional aggregations for OLAP ...
Full textCite
ConferenceWorkshop on Temporal Data Mining, 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD�01). ACM Press · 2001Cite
ConferenceProceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · December 1, 2000
Sequential pattern mining is an important data mining problem with broad applications. It is also a difficult problem since one may need to examine a combinatorially explosive number of possible subsequence patterns. Most of the previously developed sequen ...
Cite
ConferenceSIGMOD 2000 - Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data · January 1, 2000
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. H ...
Full textCite
ConferenceProceeding of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · January 1, 2000
Recent studies show that constraint pushing may substantially improve the performance of frequent pattern mining, and methods have been proposed to incorporate interesting constraints in frequent pattern mining. However, some popularly encountered constrai ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2000
With the explosive growth of data avaiilable on the World Wide Web, discovery and analysis of useful information from the World Wide Web becomes a practical necessity. Web access pattern, which is the sequence of accesses pursued by users frequently, is a ...
Full textCite
Journal ArticleSIGMOD Record (ACM Special Interest Group on Management of Data) · January 1, 2000
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. H ...
Full textCite
Journal ArticleActa Metallurgica Sinica (English Letters) · October 1, 1999
Data cube is the central mechanism in multi-dimensional data warehouse and online analytical processing (OLAP) based on multi-dimensional analysis. The algebra for OLAP data cube, including the basic conception, data logic model, important properties and o ...
Cite