Skip to main content
Journal cover image

Kernel-based logistic regression model for protein sequence without vectorialization.

Publication ,  Journal Article
Fong, Y; Datta, S; Georgiev, IS; Kwong, PD; Tomaras, GD
Published in: Biostatistics
July 2015

Protein sequence data arise more and more often in vaccine and infectious disease research. These types of data are discrete, high-dimensional, and complex. We propose to study the impact of protein sequences on binary outcomes using a kernel-based logistic regression model, which models the effect of protein through a random effect whose variance-covariance matrix is mostly determined by a kernel function. We propose a novel, biologically motivated, profile hidden Markov model (HMM)-based mutual information (MI) kernel. Hypothesis testing can be carried out using the maximum of the score statistics and a parametric bootstrap procedure. To improve the power of testing, we propose intuitive modifications to the test statistic. We show through simulation studies that the profile HMM-based MI kernel can be substantially more powerful than competing kernels, and that the modified test statistics bring incremental gains in power. We use these proposed methods to investigate two problems from HIV-1 vaccine research: (1) identifying segments of HIV-1 envelope (Env) protein that confer resistance to neutralizing antibody and (2) identifying segments of Env that are associated with attenuation of protective vaccine effect by antibodies of isotype A in the RV144 vaccine trial.

Duke Scholars

Published In

Biostatistics

DOI

EISSN

1468-4357

Publication Date

July 2015

Volume

16

Issue

3

Start / End Page

480 / 492

Location

England

Related Subject Headings

  • env Gene Products, Human Immunodeficiency Virus
  • Statistics & Probability
  • Sequence Analysis, Protein
  • Models, Statistical
  • Markov Chains
  • Logistic Models
  • Immunoglobulin G
  • Immunoglobulin A
  • Humans
  • HIV-1
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Fong, Y., Datta, S., Georgiev, I. S., Kwong, P. D., & Tomaras, G. D. (2015). Kernel-based logistic regression model for protein sequence without vectorialization. Biostatistics, 16(3), 480–492. https://doi.org/10.1093/biostatistics/kxu056
Fong, Youyi, Saheli Datta, Ivelin S. Georgiev, Peter D. Kwong, and Georgia D. Tomaras. “Kernel-based logistic regression model for protein sequence without vectorialization.Biostatistics 16, no. 3 (July 2015): 480–92. https://doi.org/10.1093/biostatistics/kxu056.
Fong Y, Datta S, Georgiev IS, Kwong PD, Tomaras GD. Kernel-based logistic regression model for protein sequence without vectorialization. Biostatistics. 2015 Jul;16(3):480–92.
Fong, Youyi, et al. “Kernel-based logistic regression model for protein sequence without vectorialization.Biostatistics, vol. 16, no. 3, July 2015, pp. 480–92. Pubmed, doi:10.1093/biostatistics/kxu056.
Fong Y, Datta S, Georgiev IS, Kwong PD, Tomaras GD. Kernel-based logistic regression model for protein sequence without vectorialization. Biostatistics. 2015 Jul;16(3):480–492.
Journal cover image

Published In

Biostatistics

DOI

EISSN

1468-4357

Publication Date

July 2015

Volume

16

Issue

3

Start / End Page

480 / 492

Location

England

Related Subject Headings

  • env Gene Products, Human Immunodeficiency Virus
  • Statistics & Probability
  • Sequence Analysis, Protein
  • Models, Statistical
  • Markov Chains
  • Logistic Models
  • Immunoglobulin G
  • Immunoglobulin A
  • Humans
  • HIV-1