Ir arriba
Información del Working Paper

Clinical characteristics and prognostic factors for ICU admission of patients with COVID-19 using machine learning and natural language processing

Jose L. Izquierdo, J. Ancochea, S. Lumbreras, et al., Joan B. Soriano


There remain many unknowns regarding the natural history, onset, distribution and both the individual and population burden of the ongoing COVID-19 pandemic associated with the spread of the SARS-CoV-2 virus. Here, we used a combination of classic epidemiological methods, natural language processing (NLP), and machine learning (for predictive modelling), to analyse the clinical information in the electronic health records (EHRs) of patients with COVID-19. This approach holds the potential to better define the disease and its associated outcomes, most notably ICU admission.


This is a multicentre, non-interventional, retrospective study using the unstructured free-text clinical information captured in the EHRs of the participating hospital sites within the SESCAM Healthcare Network (Castilla La-Mancha, Spain, with 2.035 M inhabitants). We collected clinical information from the entire population with available EHRs (1,364,924 patients) for the period comprised between January 1, 2020 and March 29, 2020. Following identification of all COVID-19 cases seen in hospitals and primary care settings (all departments), we extracted related information upon diagnosis (including demographic characteristics, symptoms upon diagnosis, and other clinical information) and during disease progression and outcome (admission, discharge, and ICU admission). A data-driven analysis explored the minimum set of clinical variables associated with requiring ICU admission.


A total of 10,504 patients with a clinical or PCR-confirmed diagnosis of COVID-19 were identified, 52.5% males, with a mean±SD age of 58.2±19.7 years, and age distribution ranging from <2 to 80 years and older. Upon admission, the most common symptoms were cough, fever, and dyspnoea, but all in less than half of cases; similarly, the most frequent comorbidities were cardiovascular disease, mainly arterial hypertension. Overall, 6% of hospitalized patients required ICU admission. Using a machine-learning, data-driven algorithm (i.e., random forest), we identified that a combination of age, fever, and tachypnoea was the most parsimonious predictor of ICU admission: those younger than 56 years, without tachypnoea, and temperature <39ºC/102ºF, (or >39ºC/102ºF without respiratory crackles), were free of ICU admission. On the contrary, COVID-19 patients aged 40 to 79 years were likely to be admitted to the ICU if they had tachypnoea and delayed their visit to the ER after being seen in primary care.


Our results show that a combination of easily obtained clinical variables (age, fever, and tachypnoea with/without respiratory crackles) predicts which COVID-19 patients require ICU admission.

Keywords: artificial intelligence; big data; coronavirus; electronic health records; SARS-CoV-2; tachypnea

Fecha de Registro: 2020-04-14


pdf Descargar el artículo