Tomasz Grzywalsk, head of AI solutions at StethoMe, writes about the potential flaws in research involving AI.
Artificial intelligence is revolutionising many areas of our life and medicine is no exception. There are many research teams around the globe that are independently working on solving the same problem. Although we do come across news about some spectacular breakthrough from time to time, most of the progress is actually done slowly through many cycles of improving existing solutions one step at a time. To make this progress possible there is an increasing need for a reliable, repeatable way to evaluate different solutions.
Unfortunately, in my experience, most papers and especially in the medical domain, are written and presented in such a way that makes the proposed solutions impossible to compare between themselves. Even when authors test their solution on a publicly available dataset, there is usually some form of “but”. The quality and value of the research, can be somewhat lacking. When reading medical journals or papers it is wise to read between the lines when considering competitive algorithms. Unfair (skewed) methodology can present very different test outcomes
These are the four sins most often committed by authors:
Evaluating the solution using a small dataset
It’s not uncommon to see research that was based on a tiny sample of 10-20 patients. Drawing any conclusions based on such a small batch is, at least, a stretch. Imagine someone stating that their solution has 80% accuracy and is 5% better than the baseline algorithm. If their test set included only 20 patients then the “improvement” is actually just one case, therefore including one or two more could completely change the results.
Rejecting some data from the test set
Authors of medical papers often like to analyse the data on a very granular level, often on a case-by-case basis. For many reasons (sometimes explained, sometimes not) researchers like to exclude some cases from the test. It is not such a crime if the authors provide a complete list of excluded cases, but they very seldom do. This of course makes other researchers unable to reproduce the experiment and therefore significantly lowers the reliability of research. The authors should at least provide results for both datasets; with and without exclusions.
Using custom splits between training and testing sets
Machine learning applications often require a separate set of data for training the algorithm and a separate one for evaluation. Sometimes this split is provided by the data itself which makes things much easier. If not, the authors often decide to make their own split – and don’t publish it explicitly. Why is this so important? We like to think that if an algorithm has 80% accuracy for any random subset of the data, then the algorithm would display a similar performance. But unless our dataset consists of thousands of cases, the split will have a huge effect on the obtained performance. If the data doesn’t present an explicit split, researchers should at least use the cross-validation procedure so that each case is used only once for testing. Using a single, random split without publishing it makes the results worthless.
Using the same data for training and testing
This is the gravest of sins. Fortunately it rarely occurs but when it does, it renders the whole research useless. Machine learning models show a surprising ability to overfit the training data, that is to say, learning off by heart all the training examples. Additionally, when working with machine learning models we usually don’t have control over which features the models use to make a decision and often it chooses the wrong ones. This happens to even the simplest models, even when we try to prevent this by using data augmentations or batch normalisation techniques. But this rule is not only limited to the same data instance that we work with, e.g. an image, but to people as a whole. It is not ok to use one RT scan of a person to train the model and then take another scan of the same person to evaluate its performance.
These two scans will show many similarities even if they are taken days or weeks apart. It is likely that the model learned to use a feature that was common for both scans, but didn’t generalise over the whole population. It is commonplace, unfortunately, to read articles in respectable journals that describe such mistakes. For example, where a single recording is made with an electronic stethoscope (let’s say 20 seconds long) and then split into smaller chunks and some of them are used for training and some for testing. This shouldn’t be done; you shouldn’t even use different recordings coming from the same person for both training and testing.
Imagine you’re attending a maths course in school and during the final exam you are faced only with the exact same exercises that were practiced during lesson time. You can easily attain a good grade, but how much does it tell you about your understanding of the subject?