Jian Pei

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2026 Data augmentation is a series of techniques that generate high-quality artificial data by manipulating existing data samples. By leveraging data augmentation techniques, AI models can achieve significantly improved applicability in tasks involving scarce o ... Full text Cite

A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT

Journal Article International Journal of Machine Learning and Cybernetics · December 1, 2025 Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks across different data modalities. A PFM (e.g., BERT, ChatGPT, GPT-4) is trained on large-scale data, providing a solid parameter initialization for a wide range ... Full text Cite

Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey

Journal Article ACM Computing Surveys · October 6, 2025 Large language models (LLMs) have significantly advanced the field of natural language processing (NLP), providing a highly useful, task-agnostic foundation for a wide range of applications. However, directly applying LLMs to solve sophisticated problems i ... Full text Cite

The 2nd Workshop on Large Language Models for E-Commerce

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 3, 2025 Large Language Models (LLMs) are revolutionizing E-Commerce by enabling product recommendation, search, classification, question answering, and advertising applications. Their increasing adoption in real-world systems underscores their potential; however, ... Full text Cite

AI4DE: The 1st International Workshop on AI for Data Editing

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 3, 2025 Machine learning traditionally emphasizes developing models for given datasets, but real-world data is often messy, making model improvement insufficient for enhancing performance. AI for data editing (AI4DE) is an emerging field that systematically improv ... Full text Cite

A Survey on Small Language Models in the Era of Large Language Models: Architecture, Capabilities, and Trustworthiness

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 3, 2025 Large language models (LLMs) based on Transformer architecture are powerful but face challenges with deployment, inference latency, and costly fine-tuning. These limitations highlight the emerging potential of small language models (SLMs), which can either ... Full text Cite

Data and AI Markets in a Nutshell

Conference Www Companion 2025 Companion Proceedings of the ACM Web Conference 2025 · May 23, 2025 Data and AI model services, often regarded as the driving force of the digital and AI economy, are powering a wide range of applications and creating significant business opportunities. Data and AI model service pipelines are frequently enabled by data and ... Full text Cite

Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters

Preprint · March 26, 2025 Link to item Cite

Finding Antagonistic Communities in Signed Uncertain Graphs

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2025 Many real-world networks are signed networks with positive and negative edge weights, such as social networks with positive (friend) or negative (foe) relationships between users, and gene interaction networks with positive (stimulatory) or negative (inhib ... Full text Cite

1122: AUTOMATED ABSTRACTION OF PULMONARY EMBOLISM CONCEPTS FROM CT REPORTS WITH LARGE LANGUAGE MODELS

Conference Critical Care Medicine · January 2025 Full text Cite

Computing Shapley Values for Dynamic Data

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2025 Data valuation is a core function in data markets and cooperative data sharing. Shapley value is a widely used approach to fairly measure the contribution of data points towards a collective utility (e.g., a machine learning model trained from the data). H ... Full text Cite

EHRmonize: A Framework for Medical Concept Abstraction from Electronic Health Records using Large Language Models

Conference Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics · January 1, 2025 Electronic health records (EHRs) contain vast amounts of complex data, but harmonizing and processing this information remains a challenging and costly task requiring significant clinical expertise. While large language models (LLMs) have shown promise in ... Full text Cite

CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets

Conference Proceedings International Conference on Data Engineering · January 1, 2025 This paper introduces the novel concept of cost-sensitive data acquisition (CDA), a desirable addition to the data preparation process in a data science pipeline that focuses on strategically acquiring data from various priced sources, such as data markets ... Full text Cite

Computing Shapley Values in Preference Queries

Conference Proceedings International Conference on Data Engineering · January 1, 2025 This paper tackles the novel problem of computing Shapley values when multiple data owners collaborate to answer preference queries. Despite extensive existing research on preference queries and Shapley value computation separately, the evaluation of data ... Full text Cite

Ask Questions With Double Hints: Visual Question Generation With Answer-Awareness and Region-Reference.

Journal Article IEEE transactions on pattern analysis and machine intelligence · December 2024 The visual question generation (VQG) task aims to generate human-like questions from an image and potentially other side information (e.g., answer type). Previous works on VQG fall in two aspects: i) They suffer from one image to many questions mapping pro ... Full text Cite

The Fourth International Workshop on Smart Data for Blockchain and Distributed Ledger (SDBD'24)

Conference Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 24, 2024 With the advent of Bitcoin, a cryptographically-enabled peer-to-peer digital payment system, blockchain together with a whole package of distributed ledger technologies, which serve as the underlying foundation of all the crypto-currencies, have been gaini ... Full text Cite

EHRmonize: A Framework for Medical Concept Abstraction from Electronic Health Records using Large Language Models

Preprint · June 28, 2024 Link to item Cite

Applications and Computation of the Shapley Value in Databases and Machine Learning

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 9, 2024 Recently, the Shapley value, a concept rooted in cooperative game theory, has found more and more applications in databases and machine learning. Due to its combinatoric nature, the computation of the Shapley value is #P-hard. To address this challenge, nu ... Full text Cite

Linear-Time Graph Neural Networks for Scalable Recommendations

Conference Www 2024 Proceedings of the ACM Web Conference · May 13, 2024 In an era of information explosion, recommender systems are vital tools to deliver personalized recommendations for users. The key of recommender systems is to forecast users' future behaviors based on previous user-item interactions. Due to their strong e ... Full text Cite

FairSample: Training Fair and Accurate Graph Convolutional Neural Networks Efficiently

Journal Article IEEE Transactions on Knowledge and Data Engineering · April 1, 2024 Fairness in Graph Convolutional Neural Networks (GCNs) becomes a more and more important concern as GCNs are adopted in many crucial applications. Societal biases against sensitive groups may exist in many real world graphs. GCNs trained on those graphs ma ... Full text Cite

Optimization of Graph Clustering Inspired by Dynamic Belief Systems

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2024 Graph clustering is essential to understand the nature and behavior of real world such as social network, technical network and transportation network. Different from the existing studies, we propose a new Markov clustering method inspired by belief dynami ... Full text Cite

Database Native Model Selection: Harnessing Deep Neural Networks in Database Systems

Conference Proceedings of the VLDB Endowment · January 1, 2024 The growing demand for advanced analytics beyond statistical aggregation calls for database systems that support effective model selection of deep neural networks (DNNs). However, existing model selection strategies are based on either training-based algor ... Full text Cite

Protecting Data Buyer Privacy in Data Markets

Journal Article IEEE Internet Computing · January 1, 2024 Data markets serve as crucial platforms facilitating data discovery, exchange, sharing, and integration among data users and providers. However, the paramount concern of privacy has predominantly centered on protecting privacy of data owners and third part ... Full text Cite

Shapley Value Approximation Based on Complementary Contribution

Journal Article IEEE Transactions on Knowledge and Data Engineering · January 1, 2024 Shapley value provides a unique way to fairly assess each player's contribution in a coalition and has enjoyed many applications. However, the exact computation of Shapley value is #P-hard due to the combinatoric nature of Shapley value. Many existing appl ... Full text Cite

Position: TRUSTLLM: Trustworthiness in Large Language Models

Conference Proceedings of Machine Learning Research · January 1, 2024 Large language models (LLMs) have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. This paper introduces TRUSTLLM, a c ... Cite

Counterfactual Explanation of Shapley Value in Data Coalitions

Journal Article Proceedings of the VLDB Endowment · January 1, 2024 The Shapley value is widely used for data valuation in data markets. However, explaining the Shapley value of an owner in a data coalition is an unexplored and challenging task. To tackle this, we formulate the problem of finding the counterfactual explana ... Full text Cite

RECALL: Membership Inference via Relative Conditional Log-Likelihoods

Conference Emnlp 2024 2024 Conference on Empirical Methods in Natural Language Processing Proceedings of the Conference · January 1, 2024 The rapid scaling of large language models (LLMs) has raised concerns about the transparency and fair use of the data used in their pretraining. Detecting such content is challenging due to the scale of the data and limited exposure of each instance during ... Full text Cite

Powering In-Database Dynamic Model Slicing for Structured Data Analytics

Conference Proceedings of the VLDB Endowment · January 1, 2024 Relational database management systems (RDBMS) are widely used for the storage of structured data. To derive insights beyond statistical aggregation, we typically have to extract specific subdatasets from the database using conventional database operations ... Full text Cite