Discover more from Ctrl-Alt-Operate
What's New In Surgical AI: 4/30/23 Edition
Vol 23: Medical AI hits the Front Page of the WSJ, Non-Medical Grab Bag
Welcome back! If you’re new to ctrl-alt-operate, we do the work of keeping up with AI, so you don’t have to. We’re grounded in our clinical-first context, so you can be a discerning consumer and developer. We’ll help you decide when you’re ready to bring A.I. into the clinic, hospital or O.R.
This week’s newsletter is brought to you by AI Radio (language warning, most songs are pop-hip hop).
For us non hip-hop experts, these might sound eerily real, but they are all made without a real artist singing. We’ll guide you through these and other controversies below.
This week, as always, we bring you all that’s happened in A.I. this week and a deep dive into one topic. If you’ve been with us before, send this to one clinician who considers themselves “techie” - those are our people 😁
We launched a few months before our faithful friend chatGPT took the world by storm. Since then, we’ve been flooded with friends, colleagues, and new friends from all over the world asking how they can get involved with A.I. in surgery or more broadly in medicine. Today, we try to answer some of these questions in our deep dive. Let’s get started.
Table of Contents
📰 News of the Week:
🤿 Deep Dive: ChatGPT, MD?
🪦 Tweets of the Week: On Hiatus
News of the Week: Non-Medical Grab Bag
A Pew research poll found that most americans beleive that AI with affect jobholders, and only 7% favored using AI in making final hiring decisisons. What about in screening applications - like residency applicants? (many job applicants are using GPT to write their personal statements) Reading many hundreds of applications is extremely time consuming and typically done by clinicians (read: Dan) with numerous competing demands on their time. We can use GPT to read those voluminous ERAS packets and letters of recommendation, and pull out the gems that are most likely to sway our decisionmaking. By 2026, maybe our interviews will be between an applicant’s GPT agent and the program’s GPT. Regulators are getting hip to this, so stay tuned. Yes - we all worry that GPT’s will make horrible errors, but having been on the other side, I think we need to understand the human baseline rates of these problems as well.
Harvey, the GPT for JD’s startup, recently raised from Sequoia and is becoming increasingly integrated into BigLaw.
The freelance content writing market is thoroughly disrupted, with everyone* using GPT, the emergence of seccondary freelancers who clean up other freelancers’ GPT outputs, and some poorly thought out GPT bans. These debates are important to follow, serving as bellweathers for larger markets including medicine.
The AI from the movie “Her” (ScarJo/ Joaquin Pheonix / Spike Jonze) is now available as an app on your iPhone (ScarJo voice coming soon). I’m not a big fan of the impact of these apps on humanity writ large. But, no one is listening to me.
LLMs, including Khan Academy’s KhanMigo, are augmenting classroom performance. This private tutor-model could dramatically improve educational achievement. Will we see this in medicine, this year, as part of Amboss’s pivot from a flashcard company into a $1B unicorn?
A recent Semafor article highlighted tech folks who have transitioned from CS to healthtech/biotech, including friends and college classmates of ours. It’s a good time to be recruiting!
GOOG 0.00%↑ released Rapsai, a no-code ML tool that allows the generation of custom ML pipelines in 15 minutes. It’s like figma and colab had a baby. These tools are a big part of the commodification of ML implementation.
Meanwhile, in Medicine …
AAPL 0.00%↑ health is expected to release a machine learning-enabled health coach in September that uses your data to provide customized insights to improve your health. You can get a sneak preview of what this might look like at the HealthGPT Github repo, which provides chat functionality against your own personal iOS data.
ChatGPT got access to the internet (covered in much greater detail in the deep dive) , the front page of the WSJ, and has continued to captivate public imagination. Admittedly, the performance in today’s build leaves much to be desired. But with these technologies and chain-of-thought prompting reaching consumers, be prepared for another explosion of “advances” in GPT performance this summer.
Structured light navigation is coming to an operating theater near you. Proprio Vision, led by friend-of-the-substack entrepreneurial maven Dr. Sam Browd, recently announced it received FDA 510K clearance for a structured light-based computer vision-powered surgical navigation system. These camera based systems help surgeons “see through” the body to accurately place implants and identify critical structures. Navigation techniques register a three dimensional map of the surgical field to preoperative imaging, and there is a whole host of structured-light based technologies that are currently available (7D Surgical) and in development (Advanced Scanners, among others).
Urology leads the way. The annual AUA urology conference is ongoing this weekend, heavily featuring AI topics.
Remember More? Transformer-based neural networks might be getting longer memories. I won’t get too far into the nerdery of neural network architectures but it is important to understand that memory is one of the most important limitations on the current transformer based architectures that underly GPT, many of the computer vision models we use, and much else (see our prior newsletters). The problem is that for every increase in context of n tokens into your vanilla self-attention network, your compute requirement increases by n^2. You should care about memory and context length because most of the really “cool” things happen over long contexts - models get much “smarter”, can “remember” longer conversations, etc.
🤿 Deep Dive: Can ChatGPT answers patient questions better than doctors?
Run a little thought experiment with me:
Premise 1: Your neighbor has a health concern, but they don’t know you well enough to ask you.
Premise 2: Your neighbor will post on the internet about their problem.
Question: Would you prefer your neighbor receive advice from a random internet doctor, or from chatGPT?
This week’s deep dive is on an article published in JAMA Internal Medicine that was released on Friday (hot off the presses here at Ctrl-Alt-Operate), and has gone pretty viral on Twitter. The is Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum by Ayers et al., led out of UCSD.
a.k.a. “chatGPT and MDs answer questions from patients - who does better?” The results suggested chatGPT robustly outperformed clinicians, which even got a shoutout from Greg Brockman (OpenAI President):
So… did chatGPT beat out MDs? 🤔
Sort of, but not in the way you’re thinking. The study took 200 questions from a Reddit’s r/AskDocs - here, users can submit anonymous healthcare questions, and verified clinicians can answer them. Users are verified through an internal mechanism, display their qualifications (Doctor, EMT, Nurse, etc) in their bios, and likely represents an anonymous grab bag of physicians from around the world.
The study compared 200 questions answered by Reddit doctors to chatGPT, in answering questions like “What do I do if I swallowed a toothpick” and “I have a lump on my penis”
Sounds like a dinner party where all of a sudden someone shows you their new rash…
The group then examined chatGPT’s answers to the answers delivered on r/AskDocs by verified physicians, and asked a group of nurses and doctors to evaluate which responses were better, and score them in terms of Quality and Empathy.
ChatGPT was the preferred output in 79% of evaluations. ChatGPT outperformed doctors on Quality of response (3.6x higher prevalence of good or very good responses) and on Empathy (9.8x higher prevalence of empathetic or very empathetic responses).
The results from the abstract are pasted below
Of the 195 questions and responses, evaluators preferred chatbot responses to physician responses in 78.6% (95% CI, 75.0%-81.8%) of the 585 evaluations. Mean (IQR) physician responses were significantly shorter than chatbot responses (52 [17-62] words vs 211 [168-245] words; t = 25.4; P < .001). Chatbot responses were rated of significantly higher quality than physician responses (t = 13.3; P < .001). The proportion of responses rated as good or very goodquality (≥ 4), for instance, was higher for chatbot than physicians (chatbot: 78.5%, 95% CI, 72.3%-84.1%; physicians: 22.1%, 95% CI, 16.4%-28.2%;). This amounted to 3.6 times higher prevalence of good or very good quality responses for the chatbot. Chatbot responses were also rated significantly more empathetic than physician responses (t = 18.9; P < .001). The proportion of responses rated empathetic or very empathetic (≥4) was higher for chatbot than for physicians (physicians: 4.6%, 95% CI, 2.1%-7.7%; chatbot: 45.1%, 95% CI, 38.5%-51.8%; physicians: 4.6%, 95% CI, 2.1%-7.7%). This amounted to 9.8 times higher prevalence of empathetic or very empathetic responses for the chatbot.
Now, as all things in life, the internet took this article by storm as a chance to 1) disparage physicians, 2) AI-doomsday + fear-monger, and 3) talk about how A.I. will replace every job.
First off, kudos to the team - getting discussions on A.I. x medicine out into the open is a commendable accomplishment, and one our profession and society will deal with in the coming years.
The study design was obviously limited- they used internet doctors verified via an internal Reddit verification system answering questions submitted in an anonymous forum to simulate a clinician answering an in-basket message. This is of course not a real clinical scenario, and thus cannot be generalized as such. The authors clearly state that in their discussion (but who reads the paper, anyway).
But their study design was smart. It allowed a pseudo-control group (internet doctors) without actually putting patients in the loop.
Why is this key? As we keep building these systems faster than we can write IRBs, we need to find a way to test A.I. systems without necessarily deploying them in real clinical scenarios.
Some questions to think about:
Given these results, would you be comfortable with your institution running a clinical trial evaluating the outputs of MD-inbasket messages vs chatGPT?
What if another member of your team double checked its work?
What if I told you your team answered questions incorrectly 8% (made up number) of the time? How might that change your perspective?
What if only as a triaging tool?
Some more takeaways:
It’s clear people are turning to the internet for healthcare advice. How do make sure that advice is good? The solution cannot be “go see a doctor” when patients are increasingly dissatisfied with their care and the cost.
People continue to discuss chatGPT’s tendency to hallucinate information (i.e. “sometimes wrong, never in doubt”) - this is true and guardrails must be placed. However, I’ve yet to see an analysis of correct vs incorrect information delivered by real care teams, versus by chatGPT (either in a clinical scenario or a simulation). This is necessary.
If anything this study forces us to ask: how much of our advice is ignored by patients because responses lack empathy? Do your in-basket responses look more like chatGPT’s or the physician responding? How much better could we be if our responses were augmented by these capabilities?
If anyone is conducting research in this area or running a pseudo-trial, please let us know.
Best of Twitter (paused until mom and dad stop fighting)
We’re torn because we love both of our social media parents equally, but the formatting and maneuvering has sucked the joy out of this section. So I’m leaving it on perma-pause.
Feeling inspired? Drop us a line and let us know what you liked.
Like all surgeons, we are always looking to get better. Send us your M&M style roastings or favorable Press-Gainey ratings by email at email@example.com