- Researchers develop a new machine learning model to prevent symptomatic infection before it starts.
- The model is based on L2 regularized logistic regression method.
- The algorithm made accurate prediction 5 days before diagnostic samples were taken.
Around 29,000 U.S citizen die every year due to symptomatic infection called Clostridium difficile infection (CDI). It’s present in the air, soil, water, and in the feces of animals. The bacteria are mostly spread in nursing homes and hospitals, where employees like to come into contact with it, following residents or patients.
Now researchers at MIT and University of Michigan, have developed a machine learning model that can accurately predict which patients/workers are likely to develop CDI. This would help doctors to prevent the infection before its inception.
CDI is one of the major healthcare-associated bugs. Despite of numerous hard efforts, we’ve had only little success in decreasing infections. However, according to the researchers, the new tool can identify patients at highest risk much earlier than it would be diagnosed with existing techniques.
The New Machine Learning Approach
The problem with the existing models (to reduce CDI infections) is that they are built to follow ‘one size fits all’ methodology and they don’t have many risk factors. Different hospitals have different testing and treatment protocols, and different record maintenance system, which affect the model’s performance. Since existing models ignore crucial hospital-specific factors, their efficiency and usability is limited.
Individual CDI seen through scanning electron microscopy | Wikimedia
Researchers have focused on a generalized approach for developing facility-specific model. They used ‘big data’ handing tools to analyze the entire EHR (electronic health record) of the Massachusetts General Hospital and University of Michigan Hospitals.
This allowed them to efficiently deal with varying size of patient’s records, multiple EHR systems and factor-specific to each healthcare facility.
Extracting Data
Extracting data of University of Michigan Hospitals
They took EHR data from 65,718 adult admissions to Massachusetts General Hospital and 191,014 adult admissions to University of Michigan Hospitals. Then they extracted patient information, including admission details, history, treatment provided and demographics.
They split variables into two major sections – time varying and time invariant. All data were structured; some variables were categorical, while some were continuous. They mapped all categorial data, like medications, to binary features. Moreover, reference ranges were used in the EHR for continuous features like white blood cell count and glucose level.
This resulted in 1,837 features from patients at Massachusetts General Hospital and 4,836 from patients at University of Michigan Hospitals. They applied L2 regularized logistic regression to learn each model. Finally, they calculated model’s discriminative performance for both hospitals.
Reference: Cambridge University Press | doi:10.1017/ice.2018.16 | Source Code
Results
Discriminative performance
Researchers found that these models precisely predicted the patients who would ultimately be diagnosed with CDI. For 50% of infected patients, the algorithm made accurate prediction 5 days before diagnostic samples were taken.
More specifically, the models achieved 0.82 value (Massachusetts Hospital) and 0.75 value (Michigan Hospital) for area under the receiver operating characteristic curve. Only a few predictive factors were common between the two models; rest of the predictive factors, including the major ones, were different.
Read: Machine Learning Can Tell If You’re A Musician By Monitoring Your Brain Responses
This new technique could be used to develop hospital-specific models for other pathogens like Methicillin-resistant Staphylococcus aureus, and other outcomes where both institution-specific and patient-specific factors play a significant role.
Furthermore, the resulting models could be implemented with multiple configurations for serving different purposes. A good model could allow analyzers to focus recruitment on high-risk patient populations. The application could, in turn, result in less costly and more efficient clinical studies.