Skip to main content

Abstract 2255: Using tumor sample gene expression data to infer tumor purity levels with stochastic gradient boosting machines

Publication ,  Conference
Li, Y; Bingham, A; Li, Q-J; Zhuang, Y; Umbach, DM; Li, L
Published in: Cancer Research
July 1, 2018

Tumor purity is the percent of cancer cells present in a sample of tumor tissue. The noncancerous cells (stromal cells) in a tumor are thought to have an important role in tumor growth, metastatic progression, and drug resistance. They also strongly influence genomic analyses of tumor samples. The Cancer Genome Atlas (TCGA) has extensive RNA-seq data from tumor tissue samples as well as assessments of tumor purity for the samples. Our goal is to select a subset of genes whose expression levels are predictive of tumor purity for each tumor type as well as a subset of genes whose expression levels are predictive of all tumor type samples when pooled together. We hope that the genes selected may provide insight about the cell-type composition of tumor samples and about the similarities and differences in tumor microenvironments. We use data from the TCGA, which covers 11 different tumor types and includes genome-wide assessments on over 3,148 samples for gene expression. To identify predictive genes, we used XGBoost, a supervised machine learning algorithm based on the idea of a boosted regression tree ensemble. We carried out 100 repeated runs of 10-fold cross-validations (total of 1,000 train-test partitions) for each tumor type and, also, for all tumor types combined. Using the training-set samples, XGBoost selects a set of genes to predict tumor purity levels; the selected genes are subsequently used to predict the purity levels of the test-set samples. Across the 1,000 train-test partitions for all 11 tumor types, the average root-mean-squared error ranged from 0.09 to 0.16 for the test sets. For each tumor type, we selected the top 250 genes based on their aggregated feature importance scores, a measure of each gene's contribution to tumor purity estimation. No single gene was among the top 250 in all 11 tumor types; however, ACAP1, AMICA1, CSF2RB, CYTIP, GGT5, GLIPR1, IRF4, and PECAM1 were not only among the top 250 in more than 6 tumor types but also in the top 250 when all tumors were combined, suggesting those genes might serve as biomarkers for tumor purity. The most common pathways from gene ontology analysis of these top genes include various immune and signaling pathways. We used XGBoost to identify genes whose expression levels were associated with tumor purity levels in each tumor type. Our results suggest that assessed tumor purity levels in tumor samples can be faithfully recapitulated using certain subsets of genes. We believe that those genes selected for each tumor type by our unbiased approach might provide insight into the biology of the tumor microenvironment, e.g., the presence of cell type-specific marker genes would indicate the presence of specific cell types.Citation Format: YuanYuan Li, Adrienna Bingham, Qi-Jing Li, Yuan Zhuang, David M. Umbach, Leping Li. Using tumor sample gene expression data to infer tumor purity levels with stochastic gradient boosting machines [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 2255.

Duke Scholars

Published In

Cancer Research

DOI

EISSN

1538-7445

ISSN

0008-5472

Publication Date

July 1, 2018

Volume

78

Issue

13_Supplement

Start / End Page

2255 / 2255

Publisher

American Association for Cancer Research (AACR)

Related Subject Headings

  • Oncology & Carcinogenesis
  • 3211 Oncology and carcinogenesis
  • 3101 Biochemistry and cell biology
  • 1112 Oncology and Carcinogenesis
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Li, Y., Bingham, A., Li, Q.-J., Zhuang, Y., Umbach, D. M., & Li, L. (2018). Abstract 2255: Using tumor sample gene expression data to infer tumor purity levels with stochastic gradient boosting machines. In Cancer Research (Vol. 78, pp. 2255–2255). American Association for Cancer Research (AACR). https://doi.org/10.1158/1538-7445.am2018-2255
Li, YuanYuan, Adrienna Bingham, Qi-Jing Li, Yuan Zhuang, David M. Umbach, and Leping Li. “Abstract 2255: Using tumor sample gene expression data to infer tumor purity levels with stochastic gradient boosting machines.” In Cancer Research, 78:2255–2255. American Association for Cancer Research (AACR), 2018. https://doi.org/10.1158/1538-7445.am2018-2255.
Li Y, Bingham A, Li Q-J, Zhuang Y, Umbach DM, Li L. Abstract 2255: Using tumor sample gene expression data to infer tumor purity levels with stochastic gradient boosting machines. In: Cancer Research. American Association for Cancer Research (AACR); 2018. p. 2255–2255.
Li, YuanYuan, et al. “Abstract 2255: Using tumor sample gene expression data to infer tumor purity levels with stochastic gradient boosting machines.” Cancer Research, vol. 78, no. 13_Supplement, American Association for Cancer Research (AACR), 2018, pp. 2255–2255. Crossref, doi:10.1158/1538-7445.am2018-2255.
Li Y, Bingham A, Li Q-J, Zhuang Y, Umbach DM, Li L. Abstract 2255: Using tumor sample gene expression data to infer tumor purity levels with stochastic gradient boosting machines. Cancer Research. American Association for Cancer Research (AACR); 2018. p. 2255–2255.

Published In

Cancer Research

DOI

EISSN

1538-7445

ISSN

0008-5472

Publication Date

July 1, 2018

Volume

78

Issue

13_Supplement

Start / End Page

2255 / 2255

Publisher

American Association for Cancer Research (AACR)

Related Subject Headings

  • Oncology & Carcinogenesis
  • 3211 Oncology and carcinogenesis
  • 3101 Biochemistry and cell biology
  • 1112 Oncology and Carcinogenesis