Steps for developing a machine learning model
It is possible to either access a publicly available dataset or put the controls and agreements in place to gather the data as part of the development process. As an example, The National Institute for Health Research (NIHR), which invests over £1 billion annually into health and social care research, makes its funded datasets publicly available. Gathering good quality data as part of ML development can take longer, but the dataset might be more representative of the problem the ML tool is trying to solve.
After the data has been collected, it is important to consider the following aspects:
- How was the data collected? For example, data collected in a clinical hospital in England may not be appropriate if the application will be used in remote areas in Africa.
- Is the dataset representative of the population of the target audience? Is there equal distribution of data from different ethnic groups, gender and age for example? If a public dataset was used which doesn’t include these parameters, the model must be evaluated against these different groups to verify it does not perform better for certain groups than others.
- Does the dataset have enough coverage?If the application contains different features as inputs, while the outputs are classes representing different diseases, is each disease covered equally or is there a class imbalance? For example, are there more datapoints for a certain disease than others?
The next step is preparing the data that has been collected for training. This includes handling invalid or missing data or removing duplicate data, which must be done carefully to make sure important data isn’t accidentally removed. Data preparation can also include converting the data from continuous to discrete values, or normalising it to suit the selected algorithm better. If the dataset needs to be labelled and the outcome is not a ground truth but an opinion, this opinion should be deduced by several people to reduce bias and ensure the outcome is as representative as possible.
Training the machine learning model
After collecting and pre-processing the data, you will start to have a clearer picture of the problem and possible solutions, allowing you to select an appropriate ML model. The selection of the ML model will depend on the data and the application. Is the application predicting a value (regression) or a class (classification)? Does the data contain the expected outcome given input features (supervised learning) or not (unsupervised)? For example, if the dataset contains images of cancer cells where each image has been labelled cancerous or non-cancerous, the task would be considered supervised learning and classification.
The next step is selecting the appropriate model that can be fitted to the data and solve the task at hand. There are numerous algorithms that can be used to solve the same task, however there are several important factors that need to be considered to choose the right one.
These factors could include interpretability, such as how well the machine learning model explains its predictions. This should be considered a huge factor when choosing a model for machine learning in medical applications. The amount of time it takes to train the model can also be an important factor if the model needs to be repeatedly trained when receiving new data. Is it important that the application makes a quick prediction, or is accuracy more important than time? Performance and interpretability will probably have the biggest impact on which algorithm should be chosen. A good way to decide on the best algorithm would be to use the validation part of the dataset to determine which model is best suited for the application.