Predictive models of long COVID

Abstract:

Background: The cause and symptoms of long COVID are poorly understood. It is challenging to predict whether a given COVID-19 patient will develop long COVID in the future.

Methods: We used electronic health record (EHR) data from the National COVID Cohort Collaborative to predict the incidence of long COVID. We trained two machine learning (ML) models – logistic regression (LR) and random forest (RF). Features used to train predictors included symptoms and drugs ordered during acute infection, measures of COVID-19 treatment, pre-COVID comorbidities, and demographic information. We assigned the ‘long COVID’ label to patients diagnosed with the U09.9 ICD10-CM code. The cohorts included patients with (a) EHRs reported from data partners using U09.9 ICD10-CM code and (b) at least one EHR in each feature category. We analysed three cohorts: all patients (n = 2,190,579; diagnosed with long COVID = 17,036), inpatients (149,319; 3,295), and outpatients (2,041,260; 13,741).

Findings: LR and RF models yielded median AUROC of 0.76 and 0.75, respectively. Ablation study revealed that drugs had the highest influence on the prediction task. The SHAP method identified age, gender, cough, fatigue, albuterol, obesity, diabetes, and chronic lung disease as explanatory features. Models trained on data from one N3C partner and tested on data from the other partners had average AUROC of 0.75.

Interpretation: ML-based classification using EHR information from the acute infection period is effective in predicting long COVID. SHAP methods identified important features for prediction. Cross-site analysis demonstrated the generalizability of the proposed methodology.

Source: Antony B, Blau H, Casiraghi E, Loomba JJ, Callahan TJ, Laraway BJ, Wilkins KJ, Antonescu CC, Valentini G, Williams AE, Robinson PN, Reese JT, Murali TM; N3C consortium. Predictive models of long COVID. EBioMedicine. 2023 Oct;96:104777. doi: 10.1016/j.ebiom.2023.104777. Epub 2023 Sep 4. PMID: 37672869; PMCID: PMC10494314. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10494314/ (Full text)

Unraveling Post-COVID-19 Immune Dysregulation Using Machine Learning-based Immunophenotyping

Abstract:

The COVID-19 pandemic has left a significant mark on global healthcare, with many individuals experiencing lingering symptoms long after recovering from the acute phase of the disease, a condition often referred to as “long COVID.” This study delves into the intricate realm of immune dysregulation that ensues in 509 post-COVID-19 patients across multiple Iraqi regions during the years 2022 and 2023.

Utilizing advanced machine learning techniques for immunophenotyping, this research aims to shed light on the diverse immune dysregulation patterns present in long COVID patients. By analyzing a comprehensive dataset encompassing clinical, immunological, and demographic information, the study provides valuable insights into the complex interplay of immune responses following COVID-19 infection.

The findings reveal that long COVID is associated with a spectrum of immune dysregulation phenomena, including persistent inflammation, altered cytokine profiles, and abnormal immune cell subsets. These insights highlight the need for personalized interventions and tailored treatment strategies for individuals suffering from long COVID-19.

This research represents a significant step forward in our understanding of the post-COVID-19 immune landscape and opens new avenues for targeted therapies and clinical management of long COVID patients. As the world grapples with the long-term implications of the pandemic, these findings offer hope for improving the quality of life for those affected by this enigmatic condition.

Source: Maitham G. Yousif, Ghizal Fatima and Hector J. Castro et al. Unraveling Post-COVID-19 Immune Dysregulation Using Machine Learning-based Immunophenotyping. 2023. https://arxiv.org/ftp/arxiv/papers/2310/2310.01428.pdf (Full text)

Unsupervised cluster analysis reveals distinct subtypes of ME/CFS patients based on peak oxygen consumption and SF-36 scores

Abstract:

Purpose: Myalgic encephalomyelitis, commonly referred to as chronic fatigue syndrome (ME/CFS), is a severe, disabling chronic disease and an objective assessment of prognosis is crucial to evaluate the efficacy of future drugs. Attempts are ongoing to find a biomarker to objectively assess the health status of (ME/CFS), patients. This study therefore aims to demonstrate that oxygen consumption is a biomarker of ME/CFS provides a method to classify patients diagnosed with ME/CFS based on their responses to the Short Form-36 (SF-36) questionnaire, which can predict oxygen consumption using cardiopulmonary exercise testing (CPET).

Methods: Two datasets were used in the study. The first contained SF-36 responses from 2,347 validated records of ME/CFS diagnosed participants, and an unsupervised machine learning model was developed to cluster the data. The second dataset was used as a validation set and included the cardiopulmonary exercise test (CPET) results of 239 participants diagnosed with ME/CFS. Participants from this dataset were grouped by peak oxygen consumption according to Weber’s classification. The SF-36 questionnaire was correctly completed by only 92 patients, who were clustered using the machine learning model. Two categorical variables were then entered into a contingency table: the cluster with values {0,1} and Weber classification {A, B, C, D} were assigned. Finally, the Chi-square test of independence was used to assess the statistical significance of the relationship between the two parameters.

Findings: The results indicate that the Weber classification is directly linked to the score on the SF-36 questionnaire. Furthermore, the 36-response matrix in the machine learning model was shown to give more reliable results than the subscale matrix (p – value < 0.05) for classifying patients with ME/CFS.

Implications: Low oxygen consumption on CPET can be considered a biomarker in patients with ME/CFS. Our analysis showed a close relationship between the cluster based on their SF-36 questionnaire score and the Weber classification, which was based on peak oxygen consumption during CPET. The dataset for the training model comprised raw responses from the SF-36 questionnaire, which is proven to better preserve the original information, thus improving the quality of the model.

Source: Lacasa M, Launois P, Prados F, Alegre J, Casas-Roma J. Unsupervised cluster analysis reveals distinct subtypes of ME/CFS patients based on peak oxygen consumption and SF-36 scores. Clin Ther. 2023 Oct 4:S0149-2918(23)00352-1. doi: 10.1016/j.clinthera.2023.09.007. Epub ahead of print. PMID: 37802746. https://www.clinicaltherapeutics.com/article/S0149-2918(23)00352-1/fulltext (Full text)

A retrospective cohort analysis leveraging augmented intelligence to characterize long COVID in the electronic health record: A precision medicine framework

Abstract:

Physical and psychological symptoms lasting months following an acute COVID-19 infection are now recognized as post-acute sequelae of COVID-19 (PASC). Accurate tools for identifying such patients could enhance screening capabilities for the recruitment for clinical trials, improve the reliability of disease estimates, and allow for more accurate downstream cohort analysis.

In this retrospective cohort study, we analyzed the EHR of hospitalized COVID-19 patients across three healthcare systems to develop a pipeline for better identifying patients with persistent PASC symptoms (dyspnea, fatigue, or joint pain) after their SARS-CoV-2 infection. We implemented distributed representation learning powered by the Machine Learning for modeling Health Outcomes (MLHO) to identify novel EHR features that could suggest PASC symptoms outside of typical diagnosis codes. MLHO applies an entropy-based feature selection and boosting algorithms for representation mining. These improved definitions were then used for estimating PASC among hospitalized patients.

30,422 hospitalized patients were diagnosed with COVID-19 across three healthcare systems between March 13, 2020 and February 28, 2021. The mean age of the population was 62.3 years (SD, 21.0 years) and 15,124 (49.7%) were female.

We implemented the distributed representation learning technique to augment PASC definitions. These definitions were found to have positive predictive values of 0.73, 0.74, and 0.91 for dyspnea, fatigue, and joint pain, respectively.

We estimated that 25 percent (CI 95%: 6-48), 11 percent (CI 95%: 6-15), and 13 percent (CI 95%: 8-17) of hospitalized COVID-19 patients will have dyspnea, fatigue, and joint pain, respectively, 3 months or longer after a COVID-19 diagnosis. We present a validated framework for screening and identifying patients with PASC in the EHR and then use the tool to estimate its prevalence among hospitalized COVID-19 patients.

Source: Strasser ZH, Dagliati A, Shakeri Hossein Abad Z, Klann JG, Wagholikar KB, Mesa R, Visweswaran S, Morris M, Luo Y, Henderson DW, Samayamuthu MJ; Consortium for Clinical Characterization of COVID-19 by EHR (4CE); Omenn GS, Xia Z, Holmes JH, Estiri H, Murphy SN. A retrospective cohort analysis leveraging augmented intelligence to characterize long COVID in the electronic health record: A precision medicine framework. PLOS Digit Health. 2023 Jul 25;2(7):e0000301. doi: 10.1371/journal.pdig.0000301. PMID: 37490472; PMCID: PMC10368277. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10368277/ (Full text)

A Proposed Explainable Artificial Intelligence-Based Machine Learning Model for Discriminative Metabolites for Myalgic Encephalomyelitis/Chronic Fatigue Syndrome

Abstract:

Background: Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) is a complex and debilitating disease with a significant global prevalence of over 65 million individuals. It affects various systems, including the immune, neurological, gastrointestinal, and circulatory systems. Studies have shown abnormalities in immune cell types, increased inflammatory cytokines, and brain abnormalities. Further research is needed to identify consistent biomarkers and develop targeted therapies. A multidisciplinary approach is essential for diagnosing, treating, and managing this complex disease.

The current study aims at employing explainable artificial intelligence (XAI) and machine learning (ML) techniques to identify discriminative metabolites for ME/CFS.

Material and Methods: The present study used a metabolomics dataset of CFS patients and healthy controls, including 26 healthy controls and 26 ME/CFS patients aged 22-72. The dataset encapsulated 768 metabolites, classified into nine metabolic super-pathways: amino acids, carbohydrates, cofactors, vitamins, energy, lipids, nucleotides, peptides, and xenobiotics.

Random forest-based feature selection and Bayesian Approach based-hyperparameter optimization were implemented on the target data. Four different ML algorithms [Gaussian Naive Bayes (GNB), Gradient Boosting Classifier (GBC), Logistic regression (LR) and Random Forest Classifier (RFC)] were used to classify individuals as ME/CFS patients and healthy individuals. XAI approaches were applied to clinically explain the prediction decisions of the optimum model. Performance evaluation was performed using the indices of accuracy, precision, recall, F1 score, Brier score, and AUC.

Results: The metabolomics of C-glycosyltryptophan, oleoylcholine, cortisone, and 3-hydroxydecanoate were determined to be crucial for ME/CFS diagnosis.

The RFC learning model outperformed GNB, GBC, and LR in ME/CFS prediction using the 1000 iteration bootstrapping method, achieving 98% accuracy, precision, recall, F1 score, 0.01 Brier score, and 99% AUC.

Conclusion: RFC model proposed in this study correctly classified and evaluated ME/CFS patients through the selected biomarker candidate metabolites. The methodology combining ML and XAI can provide a clear interpretation of risk estimation for ME/CFS, helping physicians intuitively understand the impact of key metabolomics features in the model.

Source: Yagin, F.H., Alkhateeb, A., Raza, A., Samee, N.A., Mahmoud, N.F., Colak, C., & Yagin, B. (2023). A Proposed Explainable Artificial Intelligence-Based Machine Learning Model for Discriminative Metabolites for Myalgic Encephalomyelitis/Chronic Fatigue Syndrome. Preprints. https://doi.org/10.20944/preprints202307.1585.v1 https://www.preprints.org/manuscript/202307.1585/v1 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10706650/ (Full text of completed study)

De-black-boxing health AI: demonstrating reproducible machine learning computable phenotypes using the N3C-RECOVER Long COVID model in the All of Us data repository

Abstract:

Machine learning (ML)-driven computable phenotypes are among the most challenging to share and reproduce. Despite this difficulty, the urgent public health considerations around Long COVID make it especially important to ensure the rigor and reproducibility of Long COVID phenotyping algorithms such that they can be made available to a broad audience of researchers. As part of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, researchers with the National COVID Cohort Collaborative (N3C) devised and trained an ML-based phenotype to identify patients highly probable to have Long COVID. Supported by RECOVER, N3C and NIH’s All of Us study partnered to reproduce the output of N3C’s trained model in the All of Us data enclave, demonstrating model extensibility in multiple environments. This case study in ML-based phenotype reuse illustrates how open-source software best practices and cross-site collaboration can de-black-box phenotyping algorithms, prevent unnecessary rework, and promote open science in informatics.

Source: Pfaff ER, Girvin AT, Crosskey M, Gangireddy S, Master H, Wei WQ, Kerchberger VE, Weiner M, Harris PA, Basford M, Lunt C, Chute CG, Moffitt RA, Haendel M; N3C and RECOVER Consortia. De-black-boxing health AI: demonstrating reproducible machine learning computable phenotypes using the N3C-RECOVER Long COVID model in the All of Us data repository. J Am Med Inform Assoc. 2023 May 22:ocad077. doi: 10.1093/jamia/ocad077. Epub ahead of print. PMID: 37218289. https://pubmed.ncbi.nlm.nih.gov/37218289/

Proteomics and cytokine analyses distinguish myalgic encephalomyelitis/chronic fatigue syndrome cases from controls

Abstract:

Background: Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) is a complex, heterogenous disease characterized by unexplained persistent fatigue and other features including cognitive impairment, myalgias, post-exertional malaise, and immune system dysfunction. Cytokines are present in plasma and encapsulated in extracellular vesicles (EVs), but there have been only a few reports of EV characteristics and cargo in ME/CFS. Several small studies have previously described plasma proteins or protein pathways that are associated with ME/CFS.

Methods: We prepared extracellular vesicles (EVs) from frozen plasma samples from a cohort of Myalgic Encephalomyelitis/Chronic Fatigue Syndrome (ME/CFS) cases and controls with prior published plasma cytokine and plasma proteomics data. The cytokine content of the plasma-derived extracellular vesicles was determined by a multiplex assay and differences between patients and controls were assessed. We then performed multi-omic statistical analyses that considered not only this new data, but extensive clinical data describing the health of the subjects.

Results: ME/CFS cases exhibited greater size and concentration of EVs in plasma. Assays of cytokine content in EVs revealed IL2 was significantly higher in cases. We observed numerous correlations among EV cytokines, among plasma cytokines, and among plasma proteins from mass spectrometry proteomics. Significant correlations between clinical data and protein levels suggest roles of particular proteins and pathways in the disease. For example, higher levels of the pro-inflammatory cytokines Granulocyte-Monocyte Colony-Stimulating Factor (CSF2) and Tumor Necrosis Factor (TNFα) were correlated with greater physical and fatigue symptoms in ME/CFS cases. Higher serine protease SERPINA5, which is involved in hemostasis, was correlated with higher SF-36 general health scores in ME/CFS. Machine learning classifiers were able to identify a list of 20 proteins that could discriminate between cases and controls, with XGBoost providing the best classification with 86.1% accuracy and a cross-validated AUROC value of 0.947. Random Forest distinguished cases from controls with 79.1% accuracy and an AUROC value of 0.891 using only 7 proteins.

Conclusions: These findings add to the substantial number of objective differences in biomolecules that have been identified in individuals with ME/CFS. The observed correlations of proteins important in immune responses and hemostasis with clinical data further implicates a disturbance of these functions in ME/CFS.

Source: Giloteaux L, Li J, Hornig M, Lipkin WI, Ruppert D, Hanson MR. Proteomics and cytokine analyses distinguish myalgic encephalomyelitis/chronic fatigue syndrome cases from controls. J Transl Med. 2023 May 13;21(1):322. doi: 10.1186/s12967-023-04179-3. PMID: 37179299. https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-023-04179-3 (Full text)

Developing a blood cell-based diagnostic test for myalgic encephalomyelitis/chronic fatigue syndrome using peripheral blood mononuclear cells

Abstract:

A blood-based diagnostic test for myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) and multiple sclerosis (MS) would be of great value in both conditions, facilitating more accurate and earlier diagnosis, helping with current treatment delivery, and supporting the development of new therapeutics.

Here we use Raman micro-spectroscopy to examine differences between the spectral profiles of blood cells of ME/CFS, MS and healthy controls.

We were able to discriminate the three groups using ensemble classification models with high levels of accuracy (91%) with the additional ability to distinguish mild, moderate, and severe ME/CFS patients from each other (84%).

To our knowledge, this is the first research using Raman micro-spectroscopy to discriminate specific subgroups of ME/CFS patients on the basis of their symptom severity. Specific Raman peaks linked with the different disease types with the potential in further investigations to provide insights into biological changes associated with the different conditions.

Source: Jiabao Xu, Tiffany Lodge,  Caroline Claire Kingdon, James W L Strong, John Maclennan, Eliana Lacerda, Slawomir Kujawski, Pawel Zalewski, Wei Huang, Karl J. Morten. Developing a blood cell-based diagnostic test for myalgic encephalomyelitis/chronic fatigue syndrome using peripheral blood mononuclear cells. medRxiv [Preprint] medRxiv 2023.03.18.23286575; doi: https://doi.org/10.1101/2023.03.18.23286575 https://www.medrxiv.org/content/10.1101/2023.03.18.23286575v1.full-text (Full text)

Investigating brain cortical activity in patients with post-COVID-19 brain fog

Abstract:

Brain fog is a kind of mental problem, similar to chronic fatigue syndrome, and appears about 3 months after the infection with COVID-19 and lasts up to 9 months. The maximum magnitude of the third wave of COVID-19 in Poland was in April 2021.

The research referred here aimed at carrying out the investigation comprising the electrophysiological analysis of the patients who suffered from COVID-19 and had symptoms of brain fog (sub-cohort A), suffered from COVID-19 and did not have symptoms of brain fog (sub-cohort B), and the control group that had no COVID-19 and no symptoms (sub-cohort C). The aim of this article was to examine whether there are differences in the brain cortical activity of these three sub-cohorts and, if possible differentiate and classify them using the machine-learning tools. The dense array electroencephalographic amplifier with 256 electrodes was used for recordings.

The event-related potentials were chosen as we expected to find the differences in the patients’ responses to three different mental tasks arranged in the experiments commonly known in experimental psychology: face recognition, digit span, and task switching. These potentials were plotted for all three patients’ sub-cohorts and all three experiments. The cross-correlation method was used to find differences, and, in fact, such differences manifested themselves in the shape of event-related potentials on the cognitive electrodes.

The discussion of such differences will be presented; however, an explanation of such differences would require the recruitment of a much larger cohort. In the classification problem, the avalanche analysis for feature extractions from the resting state signal and linear discriminant analysis for classification were used. The differences between sub-cohorts in such signals were expected to be found. Machine-learning tools were used, as finding the differences with eyes seemed impossible. Indeed, the A&B vs. C, B&C vs. A, A vs. B, A vs. C, and B vs. C classification tasks were performed, and the efficiency of around 60-70% was achieved.

In future, probably there will be pandemics again due to the imbalance in the natural environment, resulting in the decreasing number of species, temperature increase, and climate change-generated migrations. The research can help to predict brain fog after the COVID-19 recovery and prepare the patients for better convalescence. Shortening the time of brain fog recovery will be beneficial not only for the patients but also for social conditions.

Source: Wojcik GM, Shriki O, Kwasniewicz L, Kawiak A, Ben-Horin Y, Furman S, Wróbel K, Bartosik B, Panas E. Investigating brain cortical activity in patients with post-COVID-19 brain fog. Front Neurosci. 2023 Feb 9;17:1019778. doi: 10.3389/fnins.2023.1019778. PMID: 36845422; PMCID: PMC9947499. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9947499/ (Full text)

Organ and cell-specific biomarkers of Long-COVID identified with targeted proteomics and machine learning

Abstract:

Background: Survivors of acute COVID-19 often suffer prolonged, diffuse symptoms post-infection, referred to as “Long-COVID”. A lack of Long-COVID biomarkers and pathophysiological mechanisms limits effective diagnosis, treatment and disease surveillance. We performed targeted proteomics and machine learning analyses to identify novel blood biomarkers of Long-COVID.

Methods: A case-control study comparing the expression of 2925 unique blood proteins in Long-COVID outpatients versus COVID-19 inpatients and healthy control subjects. Targeted proteomics was accomplished with proximity extension assays, and machine learning was used to identify the most important proteins for identifying Long-COVID patients. Organ system and cell type expression patterns were identified with Natural Language Processing (NLP) of the UniProt Knowledgebase.

Results: Machine learning analysis identified 119 relevant proteins for differentiating Long-COVID outpatients (Bonferonni corrected P < 0.01). Protein combinations were narrowed down to two optimal models, with nine and five proteins each, and with both having excellent sensitivity and specificity for Long-COVID status (AUC = 1.00, F1 = 1.00). NLP expression analysis highlighted the diffuse organ system involvement in Long-COVID, as well as the involved cell types, including leukocytes and platelets, as key components associated with Long-COVID.

Conclusions: Proteomic analysis of plasma from Long-COVID patients identified 119 highly relevant proteins and two optimal models with nine and five proteins, respectively. The identified proteins reflected widespread organ and cell type expression. Optimal protein models, as well as individual proteins, hold the potential for accurate diagnosis of Long-COVID and targeted therapeutics.

Source: Patel MA, Knauer MJ, Nicholson M, Daley M, Van Nynatten LR, Cepinskas G, Fraser DD. Organ and cell-specific biomarkers of Long-COVID identified with targeted proteomics and machine learning. Mol Med. 2023 Feb 21;29(1):26. doi: 10.1186/s10020-023-00610-z. PMID: 36809921; PMCID: PMC9942653. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9942653/ (Full text)