Developing machine learning and AI for medical devices

31 Mar 2023 15min read

Machine learning (ML), which is a subset of artificial intelligence (AI), is a powerful process where algorithms are trained to make statements, decisions or predictions based on the data they are given.

From medical imaging to augmenting digital pathology, there is a huge amount of potential for AI in healthcare and diagnostics. The continued exploration of AI as a medical device will require a marked shift in regulatory approaches, alongside careful consideration of the risks and potential consequences of inaccurate models. While the industry continues to shift to respond to this rising and promising technology, this article aims to describe what needs to be considered when developing and releasing ML/AI-based medical devices to market under current regulations. It will also consider how future regulations are needed to leverage the full power and benefits of continuously learning AI algorithms for healthcare.

A quick history of AI in healthcare

The term artificial intelligence was first coined at the Dartmouth Summer Project in 1956 where researchers from different fields gathered and produced ideas, papers and concepts on their views of AI at that time. Funding for AI research was sporadic, from highs in the late 1950s and early 1960s to lows during the 1970s.

As early as the 1970s, researchers saw the benefits of AI in healthcare. During this era four different AI systems emerged from Pittsburgh, Rutgers, Stamford and MIT:

INTERNIST 1 was a system for differential diagnosis in internal medicine.
MYCIN was a system for infectious disease therapy assistance. During an independent assessment it was found to score highly on both accuracy of diagnosis and effectiveness of treatment.
CASNET was used in consultation of glaucoma.
PIP was an early diagnostic tool used in the evaluation of patients with oedema.

Already at that time, the potential of AI in healthcare had been realised and specifically how it could benefit diagnostics. Fast forward 50 years, and we have AI technology that can interpret a medical image up to 10,000 times faster than the average radiologist, machine learning models that can assess the severity of Covid-19 from chest X-rays, and AI that can improve patient diagnosis and care by augmenting digital pathology. There is still a long way to go before this technology is fully optimised and able to handle the full complexities of human health. For example, IBM’s Watson Health, launched in 2015 to diagnose patients and recommend treatment options, was reported by customers to spit out erroneous cancer treatments. With the huge impact that AI in medical devices can have on patient lives, it is important that these technologies are developed and regulated with great care to ensure patient safety.

Key considerations in the design of machine learning in medical applications

Design considerations

With the recent hype around AI, it is important to consider when AI should be applied to a problem. This is especially the case in the safety critical and highly regulated world of healthcare, where the technology must be accepted by multiple stakeholders, from clinicians to patients, regulators and payers. Any new AI tool will need to align to all of these stakeholder requirements. For example, it may need to be able to integrate well into a clinical setting, be easy to use for patients, and have safety critical algorithms that are transparent and explainable to regulators.

Quote from Curtis Langlotz, Radiologist at Stanford University 2

Some useful checks are provided below to help inform the AI design process:

1. Does it solve a real problem – something that all stakeholders (clinicians, patients, regulators and payers) identify with?
2. Does it use the principle of ‘Design Thinking’ to understand feasibility, viability and desirability?
3. Does it use many of the research techniques of front end innovation to confirm that AI is the right solution for a problem, rather than a technique without a specific purpose?

Gathering and accessing data for the machine learning model

A key consideration for the development process is the strategy for gathering and curating data. If the dataset is bad quality (e.g. not inclusive or too small) it will affect the performance of the ML model significantly. In the healthcare setting, data can be sensitive, difficult to obtain and potentially distributed across multiple different digital systems.

Steps for developing a machine learning model

It is possible to either access a publicly available dataset or put the controls and agreements in place to gather the data as part of the development process. As an example, The National Institute for Health Research (NIHR), which invests over £1 billion annually into health and social care research, makes its funded datasets publicly available. Gathering good quality data as part of ML development can take longer, but the dataset might be more representative of the problem the ML tool is trying to solve.

After the data has been collected, it is important to consider the following aspects:

How was the data collected? For example, data collected in a clinical hospital in England may not be appropriate if the application will be used in remote areas in Africa.
Is the dataset representative of the population of the target audience? Is there equal distribution of data from different ethnic groups, gender and age for example? If a public dataset was used which doesn’t include these parameters, the model must be evaluated against these different groups to verify it does not perform better for certain groups than others.
Does the dataset have enough coverage?If the application contains different features as inputs, while the outputs are classes representing different diseases, is each disease covered equally or is there a class imbalance? For example, are there more datapoints for a certain disease than others?

The next step is preparing the data that has been collected for training. This includes handling invalid or missing data or removing duplicate data, which must be done carefully to make sure important data isn’t accidentally removed. Data preparation can also include converting the data from continuous to discrete values, or normalising it to suit the selected algorithm better. If the dataset needs to be labelled and the outcome is not a ground truth but an opinion, this opinion should be deduced by several people to reduce bias and ensure the outcome is as representative as possible.

Training the machine learning model

After collecting and pre-processing the data, you will start to have a clearer picture of the problem and possible solutions, allowing you to select an appropriate ML model. The selection of the ML model will depend on the data and the application. Is the application predicting a value (regression) or a class (classification)? Does the data contain the expected outcome given input features (supervised learning) or not (unsupervised)? For example, if the dataset contains images of cancer cells where each image has been labelled cancerous or non-cancerous, the task would be considered supervised learning and classification.

The next step is selecting the appropriate model that can be fitted to the data and solve the task at hand. There are numerous algorithms that can be used to solve the same task, however there are several important factors that need to be considered to choose the right one.

These factors could include interpretability, such as how well the machine learning model explains its predictions. This should be considered a huge factor when choosing a model for machine learning in medical applications. The amount of time it takes to train the model can also be an important factor if the model needs to be repeatedly trained when receiving new data. Is it important that the application makes a quick prediction, or is accuracy more important than time? Performance and interpretability will probably have the biggest impact on which algorithm should be chosen. A good way to decide on the best algorithm would be to use the validation part of the dataset to determine which model is best suited for the application.

Evaluating the machine learning model

It is important to not train the model on the whole dataset to prevent overfitting and leave a subset of the dataset for testing how well the model is performing. A common train/test split is typically to use 70% of the data for training and 30% for testing or even to split the data into train/test/validation and use 60% for training, 20% for testing and 20% for validation. The validation dataset is used to evaluate the model as it is being built to find the best parameters. The testing data is then used to test how well the model performs on data it hasn’t seen before.

When evaluating the model to check for bias a good test would be to take a subset of the test data (only including women for example) and check whether there is a higher percentage of false positives and false negatives than when looking at the entire test data.

Regression metrics are quite different from classification metrics as regression models are predicting a continuous range instead of a discrete class. Common metrics for regression models are explained variance, mean squared error and R².

It is also important to evaluate the model in a clinical setting different from the setting in which the data to train the model was gathered. A research group at the Icahn School of Medicine at Mount Sinai had developed an algorithm to identify pneumonia in lung X-rays. The model performed with greater than 90% accuracy on X-rays at Mount Sinai, but when they tested their model at different institutes it was far less accurate. They later found out that the model had developed a relationship to the odds of how common pneumonia was at each institution, which was something the model should not have done. It can be easy to miss little factors that the model is considering when it is being trained on data gathered at one location, using one group, at one time. This leads us back to the importance of understanding the data, problem and the solution being implemented, which brings us on to explainability in AI-based medical devices.

AI in medical devices: explainability

As well as being accurate, the model needs to be explainable, as clinicians, regulators and most likely patients will want to understand the basis on which AI-based medical devices are making their decision.

An example of the need for explainability was demonstrated when Rich Caruana, with other researchers, was working on a project to better understand pneumonia. The project involved using a machine learning system to decide whether patients should be kept in the hospital overnight or sent home after they are first diagnosed. One of the models used on the data had found a relationship between having asthma and being low risk, resulting in the patient being treated as an outpatient. The model arrived at this relationship because asthma patients were considered such a high risk that they were immediately put into the ICU and critical care, resulting in them being less likely to die from pneumonia. This potentially life-threatening rule that the machine learning algorithm had developed could be hard to find or correct. Likewise, it highlighted that there could be other harmful relationships that the machine learning algorithm had found going undetected.

Regulators do already require explainability and the GDPR requires that if AI makes a sufficiently important decision about something then an explanation needs to be provided when that decision is made. Those working in AI medical device regulation should therefore work with other regulatory bodies, such as legal, before releasing AI tools to the market and developers should work closely with clinicians to integrate new AI tools into a clinical setting, to make it more understandable for everyone.

For further information on good machine learning practices (GMLP) for developing ML and AI-based medical devices, the FDA and MHRA have identified 10 guiding principles that can be found on the UK Government website.

AI medical device regulations

The benefits of AI in healthcare have already been realised by the research community, with over 12,000 life science papers describing AI/ML already published. However, making the most of the benefits of ML and AI-based medical devices seems to be lagging, largely due to the strict regulations that apply to medical devices.

The FDA has put together a list of AI/ML enabled medical devices that are marketed in the United States. At the time of writing, there are currently 343 medical devices with AI/ML capabilities on this list, with 93.5% having arrived on the market between 2015 and 2021.

Around 70% of those AI-based medical devices are used for diagnostic imaging, indicating that AI in diagnostics is a real trend. The FDA states that the AI/ML devices that have received approval involved “locked” algorithms, meaning that given the same input, the model will always give the same output. However, with this surge of AI/ML enabled medical devices hitting the market, there still seems to be reluctance towards adopting these solutions, possibly due to regulatory framework hindrance or lack of trust in these technologies by patients or practitioners.

Current ML/AI medical device regulations

Currently, there are no harmonised standards that regulate the use of machine learning in medical applications and devices and companies are not required to classify their technology as AI/ML based. Despite this, many AI-based medical devices that use “locked” algorithms, (algorithms that will not change beyond the point of regulatory submission), have been approved for use in the EU and USA by adhering to the medical software regulations already in place.

Regulations for AI

What to keep in mind when regulating an AI/ML enabled medical device:

Data gathering:
How was it determined that enough data was used to train the model?
How was bias eliminated?
How was invalid or missing data handled?
How was confidentiality and integrity of the data, as is required by the GDPR, handled?
Development of software for gathering data, labelling, training and testing must be validated under computerized system validation (CSV) according to ISO 13485:2016 4.16.
Machine Learning libraries to pre-process the data and train the model are usually not part of the medical device, so IEC 62304 is not applicable if that’s the case. However, any machine learning libraries or Software of an Unknown Provenance (SOUP)s / Off The Shelf Software (OTSS)s used to train the model need to be validated and proved to fit the best parameters for the selected model.
The trained model is considered part of the medical device and is responsible for making the correct prediction. The model needs to be verified to demonstrate that the trained model and the machine learning library used (for example a model with its parameters and predicted function), give the correct output mathematically. The validation of these machine learning libraries then, for example, include defining the requirements of the functions of the machine learning libraries used and verifying that the corresponding functions meet those requirements.
The validation of the medical device includes determining the accuracy of the device to predict outcomes, i.e., correctly detecting cancer in an image.

Proposed action plan

The FDA has proposed a regulatory framework for the modification to AI/ML based software as a medical device as well as an action plan for AI/ML software as a medical device, in 2019 and 2020 respectively. As current regulation frameworks do not account for adaptive or continuously learning algorithms, this proposed plan intended to describe a total product lifecycle-based regulatory framework. This would allow for modification to be made to the algorithm while it is learning in the field, without affecting the safety or the effectiveness of the ML or AI-based medical device.

Such adaptive AI could have great benefits in healthcare, if regulated effectively to ensure safety. An adaptive AI has the ability to continually train itself on real data, instead of being limited to historical data. Such learning could help improve diagnosis of rarer diseases for example, as more patients are analysed in real time. With a clearer pathway for FDA approval outlined, it is only a matter of time before the first medical device with adaptive AI software is FDA and/or EU approved.

The future of AI in diagnostics

By improving current regulations for AI/ML based software as a medical device, as well as providing a regulation framework for those technologies with continuously learning algorithms, we can facilitate the development and adoptions of these increasingly popular technologies which will have the power to learn from real-world experience and improve healthcare. Having a clear regulatory framework and good development practices of AI in medical devices will help to ensure transparency of these algorithms and gain the trust of patients and practitioners to adopt them.

Join the conversation

Looking for industry insights? Click below to get our opinions and thoughts into the world of
medical devices and healthcare.