Using Machine Learning to Predict Antidepressant Treatment Outcome From Electronic Health Records
Abstract
Objective
To evaluate if a machine learning approach can accurately predict antidepressant treatment outcome using electronic health records (EHRs) from patients with depression.
Method
This study examined 808 patients with depression at a New York City‐based outpatient mental health clinic between June 13, 2016 and June 22, 2020. Antidepressant treatment outcome was defined based on trend in depression symptom severity over time and was categorized as either “Recovering” or “Worsening” (i.e., non‐Recovering), measured by the slope of individual‐level Patient Health Questionnaire‐9 (PHQ‐9) score trajectory spanning 6 months following treatment initiation. A patient was designated as “Recovering” if the slope is less than 0 and as “Worsening” if the slope was no less than 0. Multiple machine learning (ML) models including L2 norm regularized Logistic Regression, Naive Bayes, Random Forest, and Gradient Boosting Decision Tree (GBDT) were used to predict treatment outcome based on additional data from EHRs, including demographics and diagnoses. Shapley Additive Explanations were applied to identify the most important predictors.
Results
The GBDT achieved the best results of predicting “Recovering” (AUC: 0.7654 ± 0.0227; precision: 0.6002 ± 0.0215; recall: 0.5131 ± 0.0336). When excluding patients with low PHQ‐9 scores (<10) at baseline, the results of predicting “Recovering” (AUC: 0.7254 ± 0.0218; precision: 0.5392 ± 0.0437; recall: 0.4431 ± 0.0513) were obtained. Prior diagnosis of anxiety, psychotherapy, recurrent depression, and baseline depression symptom severity were strong predictors.
Conclusions
The results demonstrate the potential utility of using ML in longitudinal EHRs to predict antidepressant treatment outcome. Our predictive tool holds the promise to accelerate personalized medical management in patients with psychiatric illnesses.
Highlights
Longitudinal questionnaire data were used to measure antidepressant treatment outcome.
Machine learning models were used to predict outcome from electronic health records.
The gradient boosting decision tree model achieved the best predictive results.
Diagnostic codes and baseline severity were strong predictors of treatment outcome.
Depression is one of the most prevalent psychiatric disorders, affecting approximately 14% of the global population (1). The economic costs resulting from depression are staggering and have become the second contributor to disease burden (2). While antidepressants are commonly prescribed to patients suffering from depression (3), due to the complex etiology and heterogeneous symptomatology of depression, prior studies suggest that antidepressant treatment efficacy is usually low, with as few as 11–30% of depressed patients obtaining remission after initial treatment (4). The use of prediction tools in areas of medicine such as oncology, cardiology, and radiology has played an important role in the clinical decision‐making (5, 6), suggesting the potential utility for such tools in predicting antidepressant treatment efficacy.
Recent studies have assessed the prediction of antidepressant treatment outcomes such as responder or the achievement of remission using brain images, social status, and electronic health records (EHRs). In particular, in studies using functional magnetic resonance imaging (fMRI) data (6, 7, 8, 9), the mean activation and differential response were analyzed for case and control groups, and specific observations were found to be predictive of the response to antidepressant treatment. However, the cost and time associated with collecting and processing fMRI data may hinder the use of the approach in a practical manner (4). Previous studies also used patient self‐report data, including socioeconomic status (4, 10, 11), to evaluate whether patients would achieve symptomatic remission. However, self‐reported social information may be less precise as an outcome measure and prone to nonresponse bias.
In addition, prior studies based on EHRs mainly extracted feature information from patients' medical records (10, 12, 13, 14), such as medication dose information, to predict treatment dropout or remission after receiving antidepressants. However, these studies did not consider patients' baseline depression severity and other clinical data such as diagnostic codes in prediction model development. Furthermore, most of the previous studies defined the outcome based on a behavioral assessment summarized in a numerical variable (e.g. Montgomery‐Asberg Depression Rating Scale (MADRS) score or Patient Health Questionnaire‐9 (PHQ‐9) score) (6), which only considered the baseline and final scores of a specified treatment period, without taking the variations in patients' self‐report scores and severity of depression over that time interval into account.
This study applies a data‐driven approach to address these limitations and seeks to evaluate trends in depressive symptom severity over time and evaluate the ability to use EHRs and machine learning methods to predict antidepressant treatment outcome. In particular,
(1) | We defined antidepressant treatment outcome by a slope fit of all PHQ‐9 scores across a treatment period using a linear regression model. Compared to the aforementioned studies which only considered the difference between the PHQ‐9 scores at baseline and the end of a treatment period, our method involving repeated measures takes into account the intermediate changes in the PHQ‐9 scores across multiple time points. In this way, we can capture fine‐grained PHQ‐9 scores changes over time and further measure treatment outcome accurately. | ||||
(2) | We incorporated baseline depression severity information and a range of EHR‐derived data types, including patient demographics, comorbidities, procedures, and prescription medications to train the predictive models. | ||||
(3) | We investigated the impact of longitudinal EHRs availability on predictive performance across different time periods. The most important features of predictive models were also identified and examined in this study. |
MATERIALS AND METHODS
Dataset
Fully de‐identified study data were acquired from an outpatient behavioral health clinic at a major academic medical center in New York City whose mission is to facilitate access to outpatient mental health services for managed care patients. Patients who receive care at this clinic require a mental health referral from a primary care provider and, upon intake, undergo a strict screening assessment by a psychiatrist to confirm that this closely monitored setting is appropriate for their care needs. There were 3380 patients with PHQ‐9 scores in the dataset between June 13, 2016, and June 22, 2020. The PHQ‐9 is a multipurpose instrument for screening, monitoring, and measuring the severity of depression. While not a diagnostic instrument, the PHQ‐9 has been supported and used to define treatment outcome measures in prior studies (15, 16, 17). A total of 808 adult patients (≥18 years old) who received at least one antidepressant prescription (Supplementary eTable 1) were included in the final analysis (Cohort‐A). An advantage of this dataset is that close monitoring ensures continuity of care and antidepressants prescribed in this clinic are highly likely to be intended for their presenting mental health condition. In a sub‐analysis, we built a separate cohort (Cohort‐B) that consisted of 467 patients after removing patients with baseline PHQ‐9 scores less than 10 to account for the baseline severity of dependency (18). The exclusion cascade is shown in Supplementary eFigure 1. Additional details regarding the dataset are shown in the Supplemental files. IRB exemption was granted for use of this database for research.
Outcome
The primary outcome was established as the course of depression during the first 6 months following the prescription of an antidepressant, which for the benefit of clinical interpretability was categorized as either “Recovering” or “Worsening.” The outcome was measured using all PHQ‐9 scores recorded during a 6‐month period. The overall trend for each patient's PHQ‐9 score trajectory was represented by a slope, which was obtained using a linear regression model (19). This technique of using the slope on all PHQ‐9 scores to estimate the treatment outcome has the potential to capture the overall trend in a patient's symptom severity over time. A slope of a patient's PHQ‐9 score trajectory of less than 0 suggested a decline in the trajectory of depression over time. Thus, the treatment outcome was classified as “Recovering.” Otherwise, the outcome was classified as “Worsening.”
Study Design
Antidepressant treatment outcome was predicted based on patient demographics, baseline PHQ‐9 scores, comorbidities, procedures, and prescription medication data derived from the EHRs at the time of the index antidepressant prescription, censoring all subsequent data (Figure 1). In particular, the information during the observation window before time “t” (the time of the first antidepressant prescription) was used to predict the “Recovering” or “Worsening” outcome during the next 6 months following the antidepressant prescription. The labels “Recovering” or “Worsening” were determined based on the PHQ‐9 scores extracted between t and the prediction time t + T. Three experiments were conducted by setting different durations of the observation window, including 1 year, 2 years, and all years.
Feature Derivation
Features for building risk prediction models were extracted from patient demographics (age, sex, and race), baseline PHQ‐9 scores, medical and psychiatric comorbidities, procedures (i.e., interventions), and prescription medication orders. The comorbidity, procedure, and medication codes were extracted from the EHRs by counting the number of times that each of these codes appeared in a patient's past medical history. We considered only the features that could be found in the medical histories of at least 10 patients. For each diagnostic code (e.g., ICD‐10 codes A60.02, A60.04, A60.09), the first three characters (e.g., A60) were used to aggregate similar disease diagnostic codes together. Each patient was represented by a count vector based on their past diagnosis, procedures, and prescription medication history, which was combined with demographic and baseline PHQ‐9 scores for training predictive models. The counts of these codes are shown in Supplementary eTable 2.
Prediction Models and Metrics
We applied the popular machine learning (ML) models, including L2 norm regularized Logistic Regression (LR) (20), Naive Bayes (NB) (21), Random Forest (RF) (22), and Gradient Boosting Decision Tree (GBDT) were used to predict antidepressant treatment outcome. We used nested cross‐validation for each model (23), in which an outer 5‐fold‐cross‐validation was used to split the dataset into training data and testing data, and an inner 5‐fold‐cross‐validation was used on the training data for tuning the parameters. For LR, NB, and RF, we used the scikit‐learn software library (24). For GBDT, we utilized the XGBoost software library (25). The area under the receiver operating characteristic curve (AUC) was used to evaluate model performance. Precision and recall were also calculated for reference.
RESULTS
Study Cohort Characteristics
Data from 808 patients were analyzed, comprising 243 “Worsening” and 565 “Recovering” patients. The median age among these patients was 35 (IQR [29.0‐‐45.0]) years. Females composed 61.39% of the study cohort, and 53.47% were White. At baseline, the median PHQ‐9 score was 11 (IQR [7.0‐‐16.0]). On average, patients in the “Recovering” group scored three points higher at baseline compared to patients in the “Worsening” group. There were 341 (42.20%) patients exhibiting no or mild depression (PHQ‐9 ≤ 9), 209 (25.87%) with moderate depression (9 < PHQ‐9 ≤ 14), and 258 (31.93%) with moderately severe to severe depression (PHQ‐9 > 14) at baseline. In the “Worsening” group, there were more patients exhibiting no or mild depression (n = 139, 57.20%) at baseline. In the “Recovering” group, more patients had severe depression (n = 206, 36.46%) at baseline. Additional details about the study cohort are shown in Table 1.
Characteristic | All patients | “Recovering” group | “Worsening” group | p value |
---|---|---|---|---|
Age, median (Q1‐Q3) | 35.0 (29.0–45.0) | 36.0 (30.0–45.0) | 34.0 (28.0–45.5) | 0.06 |
Gender, n (%) | 0.79 | |||
Female | 496 (61.39%) | 349 (61.77%) | 147 (60.49%) | |
Male | 312 (38.61%) | 216 (38.23%) | 96 (39.51%) | |
Race, n (%) | 0.98 | |||
White | 432 (53.47%) | 302 (53.45%) | 130 (53.50%) | |
Asian | 64 (7.92%) | 46 (8.14%) | 18 (7.41%) | |
Black | 51 (6.31%) | 36 (6.37%) | 15 (6.17%) | |
Other | 261 (32.30%) | 181 (32.04%) | 80 (32.92%) | |
Baseline PHQ‐9 score, median (Q1‐Q3) | 11.0 (7.0–16.0) | 12.0 (8.0–17.0) | 9.0 (5.0–13.0) | <0.001 |
Baseline PHQ‐9 score category, n (%) | <0.001 | |||
No or mild | 341 (42.20%) | 202 (35.75%) | 139 (57.20%) | |
Moderate | 209 (25.87%) | 157 (27.79%) | 52 (21.40%) | |
Severe | 258 (31.93%) | 206 (36.46%) | 52 (21.40%) |
Model Discrimination
Table 2 shows the AUC for the LR, RF, and GBDT models on different data collection window lengths. We observed that the incorporation of more longitudinal data contributes to improved prediction performance, as demonstrated by increases in the AUC. Gradient Boosting Decision Tree obtained better performance compared to LR, NB, and RF. Table 2 shows the best prediction performance for GBDT (0.7654 ± 0.0227) when the length of the data collection window was set to “all years,” which was based on Cohort‐A that contained 341 (42.20%) patients with mild depression (PHQ‐9 <10) at baseline.
ML model | LR | NB | RF | GBDT |
---|---|---|---|---|
1 year's EHRs | 0.6881 ± 0.0408 | 0.6985 ± 0.0513 | 0.7082 ± 0.0221 | 0.7197 ± 0.0301 |
2 years' EHRs | 0.6971 ± 0.0512 | 0.7101 ± 0.0456 | 0.7197 ± 0.0131 | 0.7363 ± 0.0215 |
All years' EHRs | 0.7204 ± 0.0463 | 0.7392 ± 0.0301 | 0.7496 ± 0.0307 | 0.7654 ± 0.0227 |
In clinical practice, patients with baseline PHQ‐9 scores less than 10 (i.e., with no or mild depression) may not receive antidepressant treatment or receive treatment of low‐intensity (18). Furthermore, a lower initial PHQ‐9 score could leave less room for a decreasing trend over time. Therefore, we excluded patients with baseline PHQ‐9 scores less than 10 and conducted a sub‐analysis on prediction performance on the Cohort‐B. The GBDT model was trained based on “all years” data, and an AUC of 0.7254 ± 0.0218 was obtained. More details about the number of patients in the “Recovering” and “Worsening” groups, as well as the prediction performance (AUC, precision, recall) are shown in Supplementary eTable 3. Because of its clinical significance, we conducted the following experiments on Cohort‐B.
Prediction Performance Using Different Types of Information with Gradient Boosting Decision Tree
This section studied the prediction performance of GBDT with different types of information from EHRs extracted from Cohort‐B. The AUC obtained on different types of information is shown in Figure 2. From these results, we observe that diagnostic code information obtained the best prediction performance relative to other types of information. The prediction performances of using the procedure and medication information were similar. Combining demographic and baseline PHQ‐9 score information obtained better prediction results, perhaps because the baseline PHQ‐9 score played an important role in predicting the outcomes.
Feature Importance
A large number of features were important in predicting outcome. Each feature's importance for prediction was investigated on Cohort‐B, and the top 10 features were selected using the feature importance scores obtained from the output of GBDT. Furthermore, the positive or negative impact of these features on prediction performance was derived using the SHapley Additive exPlanations tool (26), which explains the prediction of outcome by computing the individual contribution of each feature to the overall prediction. This enabled us to fairly distribute the prediction among the top ten pre‐treatment features that were important in the prediction, which included baseline PHQ‐9 scores, as well as anxiety, psychotherapy, fatigue, stress, lorazepam, and acetaminophen (see Figure 3). An interesting finding was that acetaminophen was among the top features that were predictive of PHQ‐9 slopes, which may be because acetaminophen alters emotions and reduces emotional distress when suffering from physical pain. A previous study has shown that nonsteroidal anti‐inflammatory drugs (NSAIDs) and paracetamol can yield effects similar to antidepressants, in particular selective serotonin reuptake inhibitors (SSRIs) (27).
DISCUSSION
This study demonstrates the potential of using machine learning to identify clinically meaningful predictors of the outcome of antidepressant treatment, especially when a slope fit of all PHQ‐9 scores represents longitudinal treatment recovery. In the investigation of 808 individuals with EHRs, predictive models including LR, RF, and GBDT were built for discriminating outcome by integrating multiple types of clinical information such as demographics, diagnostic codes, and medications. Combining multiple types of features could build more complete representations of patients, which can improve the predictive performance of ML models.
In this study, the discrimination was modest, with an AUC of 0.7654 (SD: 0.0227) obtained by GBDT when considering EHRs from “all years.” Discrimination in these models differed when considering 1 year, 2 years, and all years worth of clinical data, and we observed that considering more longitudinal data tended to improve prediction performance. By removing individuals with low PHQ‐9 scores (<10) at baseline, a modified AUC of 0.7254 (SD: 0.0218) was obtained by GBDT. This resulted in a large drop in AUC, because the AUC measures how well the predictive model can distinguish patients who recovered from those who worsened. It is reasonable to expect the exclusion of patients with low baseline PHQ‐9 scores to limit the model's ability to distinguish worsening individuals because it is unusual for patients with severe baseline scores to worsen further after treatment initiation. In this sense, a broader range of PHQ‐9 scores may hold more potential for improved distinction. We also observed that different types of clinical information played different roles in prediction. Diagnostic information such as anxiety, psychotherapy, and recurrent depression contributed the most to prediction. Demographic and baseline PHQ‐9 score information combined played a more important role relative to procedure and medication codes separately, which may be because baseline PHQ‐9 score was a vital contributor on its own. These findings could corroborate well with a previous report (15). In addition, if a patient was more complex and had more severe comorbidities such as anxiety and stress prior to taking the antidepressant, the model tended to predict the outcome as “Worsening.”
Our study differs substantially compared to previous work that investigated antidepressant treatment outcome (6, 10, 12, 28, 29, 30). To the best of our knowledge, no prior work has attempted to model antidepressant treatment outcome based on the slope of multiple continuous self‐report PHQ‐9 measurements over time, and very limited studies have utilized complete, longitudinal EHRs for predicting antidepressant treatment outcome. Previous literature only considers the difference between the final PHQ‐9 score and the baseline PHQ‐9 score for a certain treatment period (6, 10, 30, 31). Relying on these two timepoints alone can be misleading because the single difference in scores might suggest a worsening or improving outcome, when in fact the course of the outcome was mostly in the other direction. Our work fills an important gap by using the slope fit using a linear regression model based on all PHQ‐9 scores throughout a certain treatment period. This method is advantageous because it captures the evolution and intermediate oscillations in PHQ‐9 scores over time towards modeling the overall treatment outcome.
Pradier et al. attempted to predict treatment dropout after antidepressant initiation using EHRs (12). In their study, the primary outcome was treatment discontinuation following index prescription, defined as less than 90 days of prescription availability and no evidence of non‐pharmacologic psychiatric treatment. However, treatment discontinuation may not necessarily reflect a “recovering” or “worsening” treatment outcome and moreover is unable to account for variation among patients in terms of depression severity at the time of treatment. Common measures such as PHQ‐9 and HDRS (Hamilton Depression Rating Scale) are more robust in investigating antidepressant treatment outcome (15, 30), because they are acquired based on standardized questionnaires and provide more complete and objective information for estimating antidepressant treatment outcome (32).
Our study has important clinical implications. First, the machine‐learned class models predicted antidepressant treatment outcome using patients' medical history, which may encourage clinicians and patients to conduct more follow‐up visits during the course of their treatment (12). Additionally, the predictive results obtained from the models can aid clinicians in developing treatment plans that combine multiple elements in sequence or in parallel. Furthermore, the predictive models using EHRs can make contributions to personalized treatment management strategies in psychiatry (33). Beyond informing targeted treatments, these predictive models may potentially contribute to the design of a new generation of EHR‐linked clinical trials (34). For example, clinicians can stratify the patients into “high‐risk” and “low‐risk” groups based on predictive results (“worsening” or “recovering”) and pay closer attention to the treatments and prognosis of the “high‐risk” group (35).
There are several potential limitations to our study. For example, this study only considered data from a single academic medical center, which did not allow us to generalize the model prediction across multiple different health systems. This may have also resulted in a sample that was selective in geography, payer, and patients, and may not fully represent the population at risk. As with most EHR‐based studies, we are limited only to visits captured within the EHR network, and as such, clinical care sought outside this network may be missing. Also, when defining the outcome using the slope, the slope of zero may not always be the clinically meaningful cut‐point for recovering or worsening depression, as reversions to mean would be expected. Further investigation is necessary as different thresholds may produce different “Recovering” and “Worsening” distributions, which would present different prediction performances. In addition, it was unknown which diagnosis a given patient was referred or receiving treatment for, and we did not build predictive models for different classes and doses of antidepressants in this study. Accounting for different antidepressant classes, extending the study period to explore longer term outcomes, as well as conducting cross‐site validation may further enhance the results and applicability of this study, which will be investigated in future studies. Subsequent studies may also apply the techniques used in this analysis to information used in previous studies that investigated antidepressant outcome prediction, for example, socioeconomic status for which objective measures and structured EHR data are expanding. Additionally, natural language processing techniques would be considered in the future to process clinical text. With these limitations in mind, these results provide insights in terms of personalizing antidepressant treatments and encouraging researchers to pursue modeling of this simple but highly valuable outcome.
CONCLUSION
Using routinely collected longitudinal EHRs and ML algorithms, we predicted overall changes in depression severity after starting antidepressant treatment. Antidepressant treatment outcome was defined based on multiple PHQ‐9 scores using a linear regression model. Multiple types of EHRs were integrated to train models, and GBDT showed the best prediction performance. These investigations have the potential to drive the development of a clinical decision‐making tool for personalized management of depression.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35