Show simple item record

dc.contributor.advisorCai, Tianxi
dc.contributor.authorChan, Stephanie F.
dc.date.accessioned2019-05-16T12:40:51Z
dc.date.created2018-11
dc.date.issued2018-08-17
dc.date.submitted2018
dc.identifier.urihttp://nrs.harvard.edu/urn-3:HUL.InstRepos:39947170*
dc.description.abstractElectronic health records (EHRs) are electronic versions of patient charts, created to improve patient care. The adoption of EHRs in the US has increased significantly in the last decade, making it a rich resource for conducting clinical research. The breadth of the EHRs, with detailed longitudinal patient data and information on a wide range of disease conditions, allows for new opportunities for different types of clinical research. The detailed phenotypic information on individual patients allows for simultaneously studying multiple phenotypes. A useful tool for such simultaneous assessment is the Phenome-wide association study (PheWAS), which relates a genomic or biological marker of interest to a wide spectrum of disease phenotypes, typically defined by the diagnostic billing codes. One challenge arises when the biomarker of interest is expensive to measure on the entire EMR cohort. Performing PheWAS based on supervised estimation using only subjects who have marker measurements may yield limited power. In chaper 1, we focus on the setting in a PheWAS where the marker is measured on a small fraction of the patients while a few surrogate markers such as historical measurements of the biomarker are available on a large number of patients. We propose an efficient semi-supervised estimation procedure to estimate the covariance between the biomarker and the billing code, leveraging the surrogate marker information. We employ surrogate marker values to impute the missing outcome via a two-step semi-non-parametric approach and demonstrate that our proposed estimator is always more efficient than the supervised counterpart without requiring the imputation model to be correct. We illustrate the proposed procedure by assessing the association between the C-reactive protein (CRP) and some inflammatory diseases with an EMR study of inflammatory bowel disease performed with the Partners HealthCare EMR where CRP was only measured for a small fraction of the patients due to budget constraints. In chapters 2 and 3, we focus on the challenges in using EHRs to build risk prediction models. One major challenge is that the timing of disease onset is not readily available. Extracting clinical event times for patients requires labor intensive medical chart reviews. Additionally, since a significant proportion of clinical events may occur prior to patients' first EHR encounter or outside of the specific hospital system, the EHR may only capture partial information on the event time. For example, the domain expert would be able to determine whether a patient has experienced a clinical outcome by the end of EHR follow-up, but the exact timing may be unknown even after chart review. The time to first ICD9 billing code for the clinical condition or the first NLP mention of the condition in the notes can serve as a proxy for the true event time, but is subject to measurement error. In chapter 2, we propose a robust approach to developing a risk prediction model by synthesizing multiple imperfect sources of information on the event time of interest. Treating the partially observed outcomes as survival time subject to current status censoring and survival time measured with errors, we construct an optimally combined estimator under a flexible semi-parametric transformation model for the survival time given baseline predictors and unspecified measurement errors. Simulation studies demonstrate that the proposed estimator performs well in finite sample. We illustrate the proposed estimator by assessing the effects of genetic markers on coronary artery disease with an EHR study of rheumatoid arthritis patients performed with the Partners HealthCare EMR. In chapter 3, we propose a maximum likelihood estimator to estimate the risk of developing a disease by combining only the multiple imperfect sources of information on the event time of interest. Simulation studies demonstrate that the proposed estimator performs well in finite sample. We illustrate the proposed estimator by predicting the risk of developing type 2 diabetes based on a obesity genetic risk score in a cohort of patients from the Partners Biobank.
dc.description.sponsorshipBiostatistics
dc.format.mimetypeapplication/pdf
dc.language.isoen
dash.licenseLAA
dc.subjectBiology, Biostatistics
dc.titleRisk Assessment with Imprecise EHR Data
dc.typeThesis or Dissertation
dash.depositing.authorChan, Stephanie F.
dc.date.available2019-05-16T12:40:51Z
thesis.degree.date2018
thesis.degree.grantorGraduate School of Arts & Sciences
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
dc.contributor.committeeMemberLiu, Jun
dc.contributor.committeeMemberPatel, Chirag
dc.type.materialtext
thesis.degree.departmentBiostatistics
dash.identifier.vireohttp://etds.lib.harvard.edu/gsas/admin/view/2421
dc.description.keywordselectronic health records; phenome wide association studies; risk prediction
dash.author.emailstepcie@gmail.com


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record