Evaluating the machine learning model
It is important to not train the model on the whole dataset to prevent overfitting and leave a subset of the dataset for testing how well the model is performing. A common train/test split is typically to use 70% of the data for training and 30% for testing or even to split the data into train/test/validation and use 60% for training, 20% for testing and 20% for validation. The validation dataset is used to evaluate the model as it is being built to find the best parameters. The testing data is then used to test how well the model performs on data it hasn’t seen before.
When evaluating the model to check for bias, a good test would be to take a subset of the test data (only including women, for example) and check whether there is a higher percentage of false positives and false negatives than when looking at the entire test data.
Regression metrics are quite different from classification metrics as regression models are predicting a continuous range instead of a discrete class. Common metrics for regression models are explained variance, mean squared error and R2.
It is also important to evaluate the model in a clinical setting different from the setting in which the data to train the model was gathered. A research group at the Icahn School of Medicine at Mount Sinai had developed an algorithm to identify pneumonia in lung X-rays. The model performed with greater than 90% accuracy on X-rays at Mount Sinai, but when they tested their model at different institutes it was far less accurate. They later found out that the model had developed a relationship to the odds of how common pneumonia was at each institution, which was something the model should not have done. It can be easy to miss little factors that the model is considering when it is being trained on data gathered at one location, using one group, at one time. This leads us back to the importance of understanding the data, problem and the solution being implemented, which brings us on to explainability in AI-based medical devices.
AI in medical devices: explainability
As well as being accurate, the model needs to be explainable, as clinicians, regulators and most likely patients will want to understand the basis on which AI-based medical devices are making their decision.
An example of the need for explainability was demonstrated when Rich Caruana, with other researchers, was working on a project to better understand pneumonia. The project involved using a machine learning system to decide whether patients should be kept in the hospital overnight or sent home after they are first diagnosed. One of the models used on the data had found a relationship between having asthma and being low risk, resulting in the patient being treated as an outpatient. The model arrived at this relationship because asthma patients were considered such a high risk that they were immediately put into the ICU and critical care, resulting in them being less likely to die from pneumonia. This potentially life-threatening rule that the machine learning algorithm had developed could be hard to find or correct. Likewise, it highlighted that there could be other harmful relationships that the machine learning algorithm had found going undetected.