Sounds impressive. But a laboratory’s precision assessment goes only so far. It says nothing about AI’s performance in the chaos of a real-world environment, and that’s what the Google Health team wanted to find out. For several months, they watched nurses perform eye exams and interviewed them about their experiences using the new system. The feedback was not entirely positive.
When it worked well, AI sped things up. But sometimes it failed to deliver a result. Like most image recognition systems, the deep learning model had been trained in high quality scans; to ensure accuracy, it is designed to reject images that fall below a certain quality limit. With nurses scanning dozens of patients an hour and taking pictures in poor lighting conditions, more than a fifth of the images were rejected.
Patients whose images were expelled from the system were told they would need to visit a specialist at another clinic on another day. If they found it difficult to take time off from work or had no car, it was obviously inconvenient. Nurses felt frustrated, especially when they believed that the rejected scans showed no signs of illness and follow-up visits were unnecessary. Sometimes they wasted time trying to retake or edit an image that AI had rejected.
Because the system had to upload images to the cloud for processing, poor internet connections at several clinics also caused delays. “Patients like the instant results, but the internet is slow and patients complain,” said a nurse. “They have been waiting here since six in the morning and, in the first two hours, we can only track 10 patients.”
The Google Health team is working with the local medical team to design new workflows. For example, nurses can be trained to use their own judgment in borderline cases. The model itself can also be adjusted to better handle imperfect images.
Risking a reaction
“This is a crucial study for anyone interested in getting their hands dirty and really implementing AI solutions in real-world environments,” says Hamid Tizhoosh, from the University of Waterloo, Canada, who works at AI for medical imaging. Tizhoosh is very critical of what he sees as a race to announce new AI tools in response to covid-19. In some cases, tools and models released by teams with no health knowledge are developed, he says. He sees the Google study as a timely reminder that establishing accuracy in a laboratory is just the first step.
Michael Abramoff, an ophthalmologist and computer scientist at the University of Iowa Hospitals and Clinics, has been developing AI to diagnose retinal diseases for several years and is CEO of a derivative startup called IDx Technologies, which collaborated with IBM Watson. Abramoff was once an AI healthcare cheerleader, but he also warns against a race, warning of a reaction if people have bad experiences with AI. “I am so happy that Google shows that it is willing to analyze the actual workflow in the clinics,” he says. “There is much more to healthcare than algorithms.”
Abramoff also questions the usefulness of comparing AI tools with human experts when it comes to precision. Obviously, we don’t want an AI to make a bad call. But human doctors disagree all the time, he says – and that’s a good thing. An AI system needs to fit into a process where sources of uncertainty are discussed, rather than simply rejected.
Do it right and benefits can be huge. When it worked well, Beede and his colleagues saw how AI made people who were good at their jobs even better. “There was a nurse who examined 1,000 patients on her own and, with this tool, she is unstoppable,” she says. “Patients really didn’t care that it was an AI, not a human, reading their images. They were more concerned with what their experience would be.”
Correction: The opening line has been changed to make it clear that not all countries are being overloaded.