Welcome back! If you’re new to ctrl-alt-operate, we do the work of keeping up with AI, so you don’t have to. We’re grounded in our clinical-first context, so you can be a discerning consumer and developer. We’ll help you decide when you’re ready to bring A.I. into the clinic, hospital or O.R.
Sam Altman testified on The Hill, Google announced a new medical LLM, and some big bets are being made that A.I. will revolutionize medicine. We’ll unpack all of that in The News this week.
In our Deep Dive, we break down the difference between model performance and team performance, and why only one matters.
Table of Contents
📰 The News: Where There’s Smoke, There’s Fire… Right? Big Healthcare Bets
🤿 Deep Dive: We Need Team, Not Model, Benchmarks
🪦 Best of Twitter: :(
The News: Big Healthcare Bets with AI
Sam Altman, CEO of OpenAI, testified before Congress this week. Hearing our representatives discuss technology feels increasingly more painful the more I hear it (here’s your reminder that the Tik Tok CEO testified earlier this year). A few important kernels here, including confirmation that Altman actually has zero equity in OpenAI, that he encourages regulation of the space, and importantly:
“My worst fear is we cause significant harm to the world"
Yikes. Small stakes. The regulatory space is good to keep an eye out for us clinicians. We say this because it’s becoming increasingly clear healthcare is becoming a big bet for A.I. moguls…(keep reading)
Google announced Med-PaLM 2, a large language model fine tuned on medical data which scored an impressive 87% on the MedQA dataset. Physicians also rated MedPaLM 2 answers as more high quality than their own colleague answers to the same questions!
An important note though, as these models continue to improve and improve, from Peter Lee (research exec at MSFT 0.00%↑ ) tweeted:
Getting the “right answer” is only one part of the clinical problem. And considering how many problems in medicine have multiple answers or no clear answers, getting the right answer may not even be the most important part!
In another big bet, Silicon Valley VCs invested $50M in a seed round to hippocraticai, a company focused on large language models for healthcare. They’ve built a state-of-the-art model which claims to outperform GPT-4 on a wide variety of medical examinations, and is pre-trained on evidence-based content. A 50M seed round is a large bet in one company, and without public availability of the models or any real use cases published yet, the jury is still very much out.
In other news, Apple banned chatGPT at work, citing proprietary data claim information concerns. And a startup out of Japan launched a robotic arm attachment which, quite honestly, reminds us all of the Spiderman villain, Dr. Octopus. But, the abilities for these types of cyborgs to have actual implications in the operating room can’t be understated. Even something as simple as holding a tool nearby would be an immediate win.
🤿Deep Dive: We Need Team, Not Model, Benchmarks
The A.I. hype train is in full throttle mode, and clinical practice is not immune to potential disruption. Clinicians are a funny bunch- we lament at the lack of change or progress in healthcare, yet rarely work with any new technology enough to allow it to grow into its role.
This becomes important with AI models. With Med PaLM2 achieving spectacular scores on a standardized question bank, and hippocraticai.com raising 50M in seed funding, it appears the models are getting all the attention. This is reasonable, given the rise and fame of chatGPT and GPT-4. However, let’s not lose sight of the bigger picture: the point is to have real, clinical significance. This involves a clinic/hospital, nurse, physician, EMR, patient portal … the list goes on.
We must focus on the ability for A.I. to improve the delivery of care, not on its ability to outperform itself on a test.
This introduces the concept of human-A.I. teams. This is of particular importance in healthcare, where even human-human teams have a complex hierarchy, liability structure, and ultimately responsibility. Adding a non-human, A.I. component to the fold is a very hard problem.
Take the following paper which showed employee performance improved more with A.I. than with human feedback, but fell once it was disclosed as coming from A.I. In fact, the best performance delta was when AI feedback was said to be coming from human managers. (h/t Ethan Mollick)
Clearly, there is a complex interaction between what is said, who says it, and how it is received by the end-user. And others have echoed this claim - even as far back as 2019, DARPA had an active call for projects looking for human-AI teaming efforts.
Just because a model gets a series of questions right, does not mean it deserves to be trialed in the real world. We must spend more time adjudicating how it fits with the environment, how the users of the model interact with it, and how it influences their decision making.
This was a fantastic slide that made the rounds following RSNA this year - very experienced, moderately experienced, and inexperienced radiologists were all influenced by A.I. decision making - even when the AI system was incorrect (part of the study design)
As we have postulated here before, inexperienced users were more likely to follow incorrect A.I. guidance than experienced ones. Hospitals are struggling with this too
One problem with putting so much emphasis on statistical performance…is that it disregards so many other factors that may bear on an AI’s impact, such as a clinician’s judgment or the individual values and preferences of patients.
Yes! The A.I. is one piece of the clinical puzzle, and we as a clinical society must decide how it fits in, and in what scenarios. If you follow this newsletter, odds are you’re bullish that there are many scenarios where it fits in - so let’s find them.
So, this begs the question - how do we test these teams? Of course, the randomized controlled trial cohort of academicians will scream for RCT’s for every AI tool. But an autonomous A.I. chatbot that triages patient acuity (high risk) versus an A.I. chatbot which just simplifies medical jargon, for example, should likely have different thresholds to meet.
Here’s a three part framework which we might want to use as we evaluate these tools moving forward. The mechanisms by which these should be dissected (randomized trials, pilot studies, etc.) are likely use case-specific. But this framework should carry us through most of their adjudications.
1. Define Target Point of Interaction
Who are the people that are going to be interacting with the A.I. model, where are they interacting with them, and in what format? Is it in the EMR, on a screen during a surgical case? Requiring a surgeon to scrub out to rotate a 3 dimensional image may not be the best implementation practice, for instance.
This is where even in the design phases of products, clinical input is necessary. The end users are ultimately the ones delivering care to patients. These clinicians are crucial to the delivery of the right data to the model and the delivery of timely insights to the team.
2. Define Behavioral Change A-priori
This, I think, is an under-described aspect of human-A.I. teams. A.I. is meant to modify human behavior - either by changing decisions, or removing humans from tasks entirely.
Predefining the behavioral change we expect to see is crucial to the trust and safety of these models. Take our radiology example, for instance. If the goal of the model was to increase the efficiency of radiologists, and that the radiologist would only use the model for triaging, then this appears wonderful, as all levels of radiologists benefited from the model.
But, if a change in the read was the behavioral adjustment, then it’s clear that there’s a gradation in effect between experience levels. This is not inherently good or bad - simply another factor to consider.
3. Create Clinical Endpoints
At this point, it's more than reasonable to discuss clinical endpoints. This gets back to medical school and our Phase 3 clinical trial paradigm, where effects on provider efficiency, patient outcomes, clinical efficiency, provider burnout, etc. should all be quantified prior to the implementation of an A.I. system, and following. We have the tools to accomplish this.
Whenever people bring up the “black box” nature of A.I., I like to remind them we still do not know exactly how Tylenol works. But, we have so much data on its efficacy, it doesn’t even matter.
That’s where we need to be with A.I.
🪦Best of Twitter: Hiatus
Feeling inspired? Drop us a line and let us know what you liked.
Like all surgeons, we are always looking to get better. Send us your M&M style roastings or favorable Press-Gainey ratings by email at ctrl.alt.operate@gmail.com