Digital well being data are an vital knowledge useful resource that accommodates a wealth of affected person info, comparable to pictures, vitals, and textual content. Scientific notes, normally related to hospital visits, describe a affected person’s journey from admission to discharge.
Regardless of their richness, leveraging scientific textual content for predicting affected person outcomes presents challenges, together with the prevalence of medical abbreviations and their appreciable size. Nonetheless, using notes just isn’t solely promising however is reaching new heights due to the current developments in pure language processing and enormous language fashions.
However how one can use this info in apply? On this article, we’ll discover the paper Revisiting Clinical Outcome Prediction for MIMIC-IV by Röhr et al., 2024, which particulars text-based consequence predictions based mostly on the notes from one of the crucial in style EHR datasets, MIMIC.
Background on MIMIC
MIMIC is a big, freely-available database of deidentified well being knowledge from over 40,000 sufferers in essential care at Beth Israel Deaconess Medical Heart.
Its newest model is MIMIC-IV. In distinction to MIMIC-III, it options a number of updates. For instance, it now incorporates Emergency Division (ED) knowledge along with Intensive Care Unit (ICU) knowledge. Moreover, it adopts a brand new medical coding customary, ICD-10. This will increase the dataset’s complexity as MIMIC-IV now encompasses each ICD-9 and ICD-10.
- ICD stands for Worldwide Classification of Illnesses and organizes ailments and procedures by way of a coding system; for example, a affected person with diabetes is assigned the code E0800 (ICD-10) or 250.0x (ICD-9).
Lastly, the anonymization customary has modified from HIPAA-compliant, initially based mostly on random identifiers, to utilizing censoring markers comparable to “___”.
Getting ready admission notes
When sufferers are admitted to the hospital, medical doctors usually need to assess varied elements, comparable to their threat of mortality and sure diagnoses. When utilizing notes as an enter for predictions, it’s essential to preprocess the scientific notes precisely to forestall knowledge leakage.
- As an illustration, MIMIC-IV accommodates discharge notes that doc the whole affected person journey within the hospital.
- To acquire admission notes, it’s essential to filter the notes to incorporate solely the related sections current at admission time, particularly: “Chief criticism, (Historical past of) Current sickness, Medical historical past, Admission drugs, Allergy symptoms, Bodily examination, Household historical past, and Social historical past.”
- The admission notes are used as enter for all of the duties.
On this paper, the admission notes had been encoded with encoder-only Transformer fashions. These fashions remodel textual content into numerical, vectorized representations that can be utilized for downstream predictions.
Fashions are then skilled to, given encoded admission notes, predict the next affected person outcomes:
Affected person routing (PR)
This job predicts the hospital unit to which a affected person is transferred from the Emergency Division (ED). It makes use of routing logs and goals to categorise sufferers into one in every of 18 potential hospital models, comparable to surgical procedure or obstetrics. It highlights real-world ED operations, the place well timed and correct affected person transfers are essential. Nonetheless, fashions face challenges when predicting uncommon unit locations.
Diagnoses (DIA)
This job goals to map scientific notes to ICD-10 codes. It’s multi-label since sufferers might have a number of diagnoses. Whereas MIMIC-IV has over 1,600 distinctive ICD-10 prognosis codes, imbalanced label distribution stays a big problem, as most diagnoses are rare.
Procedures (PRO)
Like diagnoses, this job includes predicting medical procedures based mostly on ICD-10 codes, which embrace over 4,000 process labels. The massive label house, together with sparse knowledge for a lot of procedures, presents a problem for uncommon procedures.
Size-of-stay (LOS)
The duty goals to foretell the period of a affected person’s keep within the ICU contemplating 4 classes: ≤3 days, 3–7 days, 7–14 days, and >14 days. LOS is completely different from others because it doesn’t embrace ED knowledge, focusing solely on ICU admissions. The authors spotlight that elements unrelated to affected person well being, comparable to hospital capability and administrative rules, might influence predictions of size of keep, including complexity to it.
In-hospital mortality
Whereas this job was included in earlier research, the authors determined to exclude it from MIMIC-IV, primarily due to the complexities concerned in preprocessing. Regardless of filtering the notes, they nonetheless noticed mentions of demise, which results in knowledge leakage and in the end leads to overly assured fashions.
Outcomes
The authors noticed that fashions pre-trained on MIMIC-III knowledge didn’t generalize properly to MIMIC-IV. PubMedBERT demonstrated superior efficiency throughout duties because of its domain-specific tokenization and pre-training on biomedical textual content.
All fashions struggled with uncommon labels, notably in DIA and PRO duties. MIMIC-IV has a essential long-tail distribution, with solely about 6% of labels (round 100 codes) representing 67% of the information. The remaining 94% (1,517 codes) are sparse, encompassing solely 33%!
Even PubMedBERT has issue reaching a excessive PR-AUC for these uncommon labels, primarily enhancing its efficiency on the extra prevalent head labels:
Conclusion
Regardless of the richness of notes, challenges associated to knowledge imbalance, annotation high quality, and job complexity stay for text-based prediction. For extra particulars on outcomes and future analysis, check with the original article.