Kernel-based logistic regression model for protein sequence without vectorialization.

Published

Journal Article

Protein sequence data arise more and more often in vaccine and infectious disease research. These types of data are discrete, high-dimensional, and complex. We propose to study the impact of protein sequences on binary outcomes using a kernel-based logistic regression model, which models the effect of protein through a random effect whose variance-covariance matrix is mostly determined by a kernel function. We propose a novel, biologically motivated, profile hidden Markov model (HMM)-based mutual information (MI) kernel. Hypothesis testing can be carried out using the maximum of the score statistics and a parametric bootstrap procedure. To improve the power of testing, we propose intuitive modifications to the test statistic. We show through simulation studies that the profile HMM-based MI kernel can be substantially more powerful than competing kernels, and that the modified test statistics bring incremental gains in power. We use these proposed methods to investigate two problems from HIV-1 vaccine research: (1) identifying segments of HIV-1 envelope (Env) protein that confer resistance to neutralizing antibody and (2) identifying segments of Env that are associated with attenuation of protective vaccine effect by antibodies of isotype A in the RV144 vaccine trial.

Full Text

Duke Authors

Cited Authors

  • Fong, Y; Datta, S; Georgiev, IS; Kwong, PD; Tomaras, GD

Published Date

  • July 2015

Published In

Volume / Issue

  • 16 / 3

Start / End Page

  • 480 - 492

PubMed ID

  • 25532524

Pubmed Central ID

  • 25532524

Electronic International Standard Serial Number (EISSN)

  • 1468-4357

International Standard Serial Number (ISSN)

  • 1465-4644

Digital Object Identifier (DOI)

  • 10.1093/biostatistics/kxu056

Language

  • eng