
Scale-free and unbiased transformer with tokenization for cell type annotation from single-cell RNA-seq data
The exponentially increasing high-throughput single-cell RNA sequencing (scRNA-seq) data enables the further elucidation of more granular gene-cell expression patterns across diverse species, making the development of efficient cell type annotation methods an ever more pressing necessity. Although numerous high-performing annotation methods have been introduced, they continue to grapple with challenges such as dropout events or high-dimensional feature redundancy with more than 20000 genes, and are also constrained by specific limitations that may introduce manual biases. To address these challenges, we developed a deep end-to-end model (scSFUT) which can flexibly annotate scalable single-cell datasets in a purely data-driven manner based on an accuracy bias-free attention mechanism utilizing full-length gene expression. Specifically, scSFUT first performs tokenization of the gene expression vector and leverages 1D-convolution to integrate comprehensive intra- and inter-token gene pathway information. In addition, coupled with a self-supervised masking reconstruction strategy and the cell annotation task, scSFUT enables the shared encoder to obtain representative latent features at the global cell level via joint optimization with two corresponding losses. With rigorous evaluations across 5 real datasets from different species, scSFUT demonstrates competitive performance and broader applicability compared to the state-of-the-art methods.
Duke Scholars
Published In
DOI
ISSN
Publication Date
Volume
Related Subject Headings
- Artificial Intelligence & Image Processing
- 4611 Machine learning
- 4605 Data management and data science
- 4603 Computer vision and multimedia computation
- 0906 Electrical and Electronic Engineering
- 0806 Information Systems
- 0801 Artificial Intelligence and Image Processing
Citation

Published In
DOI
ISSN
Publication Date
Volume
Related Subject Headings
- Artificial Intelligence & Image Processing
- 4611 Machine learning
- 4605 Data management and data science
- 4603 Computer vision and multimedia computation
- 0906 Electrical and Electronic Engineering
- 0806 Information Systems
- 0801 Artificial Intelligence and Image Processing