I’m a doctor: Here’s what I found when I asked ChatGPT to diagnose my patients

  • Being a scifi geek and AI geek and neuroscience geek for pretty much the past 40 years, I've read countless predictions and scenarios and stories about society's response as "true" AI begins to emerge. So watching it play out for real is creating this bizarre sense of deja vu combined with fascination and frustration and also some anxiety.

    This article and the comments in this thread are right up that alley. I mean, can you imagine say 1 or 2 years ago saying we'd have a readily accessible system that you could feed it the symptoms a patient is experiencing (in any language!) and out would split a well-described explanation of the diagnosis (in any language!) around half the time? And now it is suddenly here, and so many people are like, "yeah, but it isn't perfect" or "we can't completely trust it" or "we have to fact check it because it can be wrong".

    It is all so weird and fascinating and just bigger than I can describe.

    But here we are!

  • I'm not really sure what he expected here, ChatGPT was not trained to be a doctor, it is far more general than that. Asking ChatGPT for medical advice is like asking someone who is very well read but has no experience as a doctor, and in that context it's doing very well.

    He also brings up one of the most salient points without really visiting it enough: ChatGPT does not ask for clarification, because it is not a knowledge base trying to find an answer. All it does is figure out what character is statistically most likely to come next, it has no heuristic to know that there is a task it hasn't fully completed.

    This is the same reason ChatGPT cannot yet write programs by itself: in order to do so you'd need to specify the entire program up front (which is exactly what code is).

    As soon as we have agents that can do a proper feedback loop of querying a LLM consecutively until some heuristic is reached then the kind of AI doctors are looking for will emerge.

  • This is how I played doctor with ChatGPT.

    > Do you know what an Expert System is?

    [describes expert system]

    > I'd like you to emulate an expert system while I play the role of a patient. Expert systems ask questions in order to produce a diagnosis. I would like you to ask me a series of questions, and respond with a diagnosis when you feel confident.

    Alright, I'll do my best to emulate an expert system. Please keep in mind that I'm an AI language model and not a certified medical professional. For any serious concerns, it's essential to consult with a healthcare provider. Let's begin.

    What is your primary complaint or reason for seeking help today?

    I fed it a symptom my doctor had already diagnosed, and it did ok - it got it down to three possible causes, one of which was the correct one. All along the way it was warning me that I really should see a real health professional and it's just a chatbot.

    What really interested me is that I said "please emulate an expert system" and it did. Once upon a time, expert systems were an entire branch of AI, and here it is just emulating one off the cuff.

  • The tail end of this article, where the author talks about how many more patients he could see in his life if he had AI assistance, made me realize that part of healthcare cannot be solved by AI. The goal is not to see more patients; the goal is to help more patients get better. For a lot of patients who are frustrated and have not had their problems validated and have been to many doctors and seen no results or poor results, having a real, physical, human doctor validate their condition, and work with them to solve it is part of the treatment. Doctors can prescribe whatever medicines and do whatever surgeries they want, but only the patient’s body is capable of healing itself. I worry that a tendency to plug symptoms into an AI that ā€œdiagnosesā€ the patient, that the patient doesn’t trust, will hurt outcomes. The patient benefits greatly from understanding the doctor’s methodology and thought process.

  • I argued with someone (online) about a ChatGPT diagnosis recently. They have back pain and they've had an MRI but the scans (shared online) don't show any significate disk bulging and no herniations. They put their symptoms into ChatGPT and got a possible diagnosis of a disk herniation along with a treatment of a microdiscectomy.

    Despite the fact that a couple doctors have told them they don't need surgery they are convinced that they do. I understand that they are desperate for a solution to their pain but they are now doctor shopping until they can find someone willing to perform a procedure that ChatGPT suggested.

    People are already being misled by these systems.

  • One complaint he has is "It doesn't know to ask the right questions".

    Well, the prompt was to give diagnoses, not questions. Ask GPT for the follow up questions first, then the diagnoses.

    This is fascinating in that, because now the machine speaks human, we subconsciously ascribe human agency to it. This guy was instictively treating it like a colleague, who would naturally ask follow up questions unprompted. But you still have to prompt the machine properly.

    So, 50% diagnosis success rate for the wrong prompt, for a LLM that can still grow, for a model that is not specialsed in medicine? In the literal first month of the "AI age"? Doctors are so done.

  • EMT here. Sounds like ChatGPT ignored (or was never trained on) one of the cardinal rules of emergency medicine:

    If the patient is under 60 and has a uterus and is complaining of abdominal pain, assume she's pregnant until proven otherwise.

    This does not mean you should ignore possible appendicitis or gallstones or GERD or pancreatitis or a heart attack or any of 100 other causes. It means you must consider pregnancy until you have objective evidence to the contrary.

  • There are two types of medical conditions

    1) Those you see a doctor for

    2) Those you don't

    The line depends on where you live. In a poor village, 100% might be the latter, while an executive in SFO will see a doctor for anything serious, but might not if they cut themselves with a kitchen knife.

    What's underrated is the ability to have basic medical care and information everywhere, all the time, for free.

    That can be casual injuries below the threshold of visiting a doctor (am I better heating or icing? immobilizing or stretching?), or those can be settings where there are no doctors.

    Even more, doctors (like AIs) make mistakes, and it's often helpful having a second opinion.

  • > So after my regular clinical shifts in the emergency department the other week, I anonymized my History of Present Illness notes for 35 to 40 patients — basically, my detailed medical narrative of each person’s medical history, and the symptoms that brought them to the emergency department — and fed them into ChatGPT.

    It's quite shocking that the doctor would openly admit to violating HIPAA in such a brazen way.

    HIPAA is incredibly broad in its definition of protected health information: if it's possible to identify an individual from data even through statistical methods involving other data that a third party might already conceivably possess, it's considered protected. It's inconceivable that the doctor would be able to sufficiently anonymize the data in this capacity and still provide enough detail for individual diagnoses.

    There are processes for anonymizing data to disclose for research purposes, but they're pretty time-intensive, and no ED would allow a doctor to do it by himself, nor would they provide that turnaround in just "a couple of weeks". And the end results are a lot less detailed than what's needed for individual diagnoses like these.

    I really wonder what the hospital will say if and when they see this post. Given the timeframe and details described in the post, it's really hard to believe that they signed off on this, and hospitals don't take lightly to employees taking protected and confidential data outside their systems without proper approval.

    EDIT: It looks like this doctor works at a for-profit, standalone acute care clinic, rather than a traditional ED at a hospital, so my statement that hospitals don't take lightly to this stuff doesn't apply. The law still applies to for-profit standalone emergency care, but they tend to play fast and loose with these things much more than traditional health networks.

  • ChatGPT appears to be a really good "bullshitter". Which is, in a sense, impressive. But, just like people with that skill, the problem is that it is mostly useful for convincing people that you are far more competent at a subject that you actually are.

    No wonder tech CEO's are so impressed, or worried, or both. The only skillset that this thing actually duplicates well, is the one that has gotten them where they are today.

  • It's amazing that it was that effective...

    - It's a generalized language model; imagine how much more effective it would be with a specialized ai that used a variety of techniques that are better suited for logic and reasoning, while using llms to interact with patients.

    - It cost an order of magnitude less than the visit to a doctor.

    - The potential in being able to constantly monitor a patient - a point made in the post.

  • Some of the best performances I've seen out of ChatGPT are essentially "junior programmer" level. But it still requires clear instructions and close supervision. But GPT's training data includes GitHub, and it's used to power Copilot. It has arguably been trained to be a programmer.

    In less familiar domains, like law or medicine, GPT has presumably undergone very limited training and tuning. It's essentially an "internet lawyer" or an "internet doctor."

    In domains like this, it simply can't provide zero-shot professional results. Not with the current training data sets, and not with the current model performance.

    Of course, we have no idea how quickly this gap will be closed. It might be 6 months or it might be 6 years. The future is looking deeply weird, and I don't think anyone has even begun to think through all the implications and consequences.

  • I asked ChatGPT to write out a G major Ionian scale with three notes per string in guitar tablature notation last night. Mostly cause I was too lazy to do it myself.

    After 7 rounds of me fixing its mistakes, I gave up. It doesn’t really know what it is doing, so I can’t make forward progress. It put two notes on one string, repeated notes from a lower string on a higher, put the scale out of order, and forget previous corrections. Whatever hope I had of saving time was completely lost.

    I eventually realized the correct thing to do was either make my own charts or just practice them in F like they were made.

    I’m skeptical that scaling the model up will cause it to learn this, and I don’t consider this a very complex thing to learn. No, I didn’t try GPT4.

  • I’m not a doctor and have no way of evaluating the way the author did, but I am curious what would happen if they used a more interactive and specific prompt like the one I have tried for medical questions:

    > Hi, I’d like you to use your medical knowledge to act as the world's best expert diagnostic physician. Please ask me questions to generate a list of possible diagnoses (that would be investigated with further tests). Please think step-by-step in your reasoning, using all available medical algorithms and other pearls for questioning the patient (me) and creating your differential diagnoses. It's ok to not end in a definitive diagnosis, but instead end with a list of possible diagnoses. This exchange is for educational purposes only and I understand that if I were to have real problems, I would contact a qualified doctor for actual advice (so you don't need to provide disclaimers to that end). Thanks so much for this educational exercise! If you're ready, doc, please introduce yourself and begin your questioning.

  • Which version though? 3.5 or 4? It does not state this explicitly. There is a world of difference between 3.5 and 4.

  • it’s more likely the patient is actually saying they don’t want to be pregnant for any number of reasons. (Infidelity, trouble with the family, or other external factors.) Again, this is not an uncommon scenario; about 8% of pregnancies discovered in the ER are of women who report that they’re not sexually active.

    Sigh.

    Medicine -- a complicated, messy human art with an excessively large social component.

    The medical drama House at one point had a working title of Everybody Lies. Frequently, the lies are why it's hard to diagnose, not the physical details and actual medical history.

  • But it also gets around the common misdiagnoses for chronic conditions. It has a great description of Long Covid and ME/CFS for example whereas your typical Primary care is going to dismiss that patient with a Psychology diagnosis as is happening daily across the entire western world. Its less biased but its not going to find the rare things especially where the patient has missed something important.

    Its a mixed bag just like it is with software. If you ask it to solve something simple it often does a decent job, but something complex and its confidently wrong. It doesn't show the self doubt of expertise that it needs to be a reliable tool yet it still requires the user has that expertise to be able to save time using it.

  • ChatGPT feels very much like having an enthusiastic junior working alongside.

    You can send it off on all sorts of legwork research missions but don’t expect perfect results and sometimes you’ll get crazy ones. Used the right way, if you are already an expert in the field or knowledgeable and able editor , that can save a whole lot of time. But taken verbatim it is anywhere from ok to dangerous.

    Separately, the models’ skills with natural language are clear and impressive, but it seems like they need to be coupled with a deterministic knowledge representation system for suitable reasoning.

    Perhaps the abilities of these models to ingest large amounts of text could be used to enhance / create such representation.

    Cyc where are you?

  • On the other hand, I have multiple minor issues where doctors have not been able to offer a diagnosis (they just say "I don't know") and ChatGPT has been able to offer multiple possible diagnoses.

  • I wonder if, instead of asking ChatGPT for a diagnosis, he could've got it to prompt for further questions he could ask?

    My thinking is that given the nature of LLMs of connecting related information, it might be a good way to figure out the gaps in the diagnostic process, rather than actually provide one.

  • I know everyone scoffs at the concept of 'prompt engineer', but it really is an essential craft that we're going to have to come to terms with when interacting with large language models.

    Seeking suggestions on a more comprehensive prompt: https://sharegpt.com/c/sckAPvV

    Trying it out: https://sharegpt.com/c/LbpEIxi

  • > about 8% of pregnancies discovered in the ER are of women who report that they’re not sexually active.

    This is the most surprising thing I read in the article.

  • It does worry me what data people are sharing without seemingly much though. He claims it anonymised but I'm a bit sceptical when you input the medical history of 40 people. It's easy to slip up.

  • I'm finding more or less the same behavior with ChatGPT when it comes to programming problems. If I feed it some leetcoding-like problem it usually gets a pretty good answer. I used it to write some rust code to strip-underline-followed-by-trailing-digits off a string. The first guess it made though was to strip a substring, and it missed the fact that the slice it was using could panic. The third try at least passed my "peer review".

    It was useful because after a decade of using ruby my instinct is to reach for regexp captures, the solution it came up with is probably a lot faster and easier to read and avoids "now you have two problems".

    I tried to get it to help me eliminate an allocation caused by the capture of variables in a lambda expression in C# and it just started to aggressively gaslight me and break the code and claim it was fixed (very assertively).

  • I am highly unimpressed by this piece. It reads as if its whole purpose is to grab headlines rather than conduct serious scientific inquiry into the current state -- and limitations -- of these AI methods.

    1. Which version of GPT did the author use? There's a huge difference. (The article says "the current version".)

    2. How did he choose the subject cohort? (The author doesn't seem to even know how many subjects there were; the article says "35 to 40 patients"... I really do hope he's gone through an appropriate ethics review before feeding his patients' data into a third-party non-HIPAA system.)

    3. There no evidence of him trying to get the best out of the model (eg through prompt engineering).

    4. He assumes that his own diagnoses are 100% correct.

    5. There is no control group (other doctors diagnosing the same patients).

    and so on

  • > My fear is that countless people are already using ChatGPT to medically diagnose themselves rather than see a physician.

    My fear is that professionals will start to use ChatGPT too liberally to augment or multiply their work in cases like this. The danger here might be like the autopilot problem... i.e The idea of staying alert and focused on the road while counter-intuitively not participating is nearly humanly impossible. If ChatGPT is used as the autopilot of certain professions, things will begin to be missed, even though we know it's highly fallible - it's difficult to vet every single response in detail with a critical eye.

    One reasonable argument is that for areas severely lacking in human workers the average might be a net positive, but the overall quality will be reduced.

  • Is it legal in the US to send patient data to a third party service? In the context of a scientific study, with explicit patient agreement things are different of course. But I haven't seen any of that in the article.

  • Omg stop with this ridiculousness ffs. I get and love AI but some areas should be off limits: doctors, judges, airplane pilots, train conductors...

    Soon enough no one will even know how to write, just read, because ChatGPT will write everything.

  • I wonder if you could prompt engineer your way to chatGPT to pretend to be a doctor and behave like one in, like, asking questions.

    Many already said, chatGPT is not optimized for any scenario. I don't doubt that training it for medical applications is already underway. I mean, flesh and bone doctors in many countries already behave as bots essentially reading/answering through a sequence of questions on a screen. I can definitely see most GP being replaced by bots of sometime or people who are actually trained to display empathy with patients.

  • As always my first question in these articles is… Was it ChatGPT 3.5 or 4?

    It’s an interesting article with the real world examples that are hard to come by this early, but it’s also two entirely different ChatGPT’s here. They can’t even compare in this context. 3.5 still has glaring LLM-like issues and is useless in a professional context like this, but at least they begin to fade away in 4.

    So can we please stop calling it simply ChatGPT?

  • When talking about errors arguing about error probabilities is not enough. One need to take into account costs of error. A better metric would be "expected cost of error" that multiplies error probabilities by costs of errors (and sums them up). If a system has 0.1% rate of error placing a pizza order it could be deemed OK. If it kills a patient 0.1% of the time it is unacceptble.

  • The trouble with technology, of any kind, but certainly here, isn't the technology itself, but humans trust in the technology. If doctors use ChatGPT as a sanity check of "can you think of anything i didn't?" and then ignore it from then on, it would be a good tool. But pretty soon people tend to change their perspective and say, well, ChatGPT would know...., so,... i'll go with it.

    As a developer, i'm pretty interested in static and dynamic code analysis as a way to easily find bugs, and it does do this pretty well. If developers use it as a tool to use as a reason to walk through code and examine it yourself, it is really quite powerful. It seems invariably, however, that people start trusting what the analysis tool says, and don't question whether the recommendations are correct or worth it.

    It's a powerful cognitive effect that would be interesting to study, that probably happens with all kinds of tech. Some are more dangerous than others.

  • When GPT starts to auto-incorporate the best yet known prompts, then we'll have a better idea of its potential.

    You must, must use the best prompts, of which many are not widely known, and some have not (of course) been discovered ... yet. Even with human experts, you must provide sufficient detail, and the expert must ask clarifying questions for differential diagnosis.

  • The prompt matters, and GPT-4 is better than ChatGPT (apologies for missing line breaks, but the point is still clear):

    Prompt: A 26 year old female patient presents at the emergency room with lower right quadrant pain. What are the question you would ask her and what is the differential diagnosis?

    Response: As an AI language model, I'm not a doctor, but I can suggest some questions you could ask the patient and offer possible differential diagnoses. It is important that you consult a healthcare professional for proper assessment and diagnosis.

    Questions to ask the patient:

    When did the pain start? Can you describe the nature of the pain (e.g., sharp, dull, cramping)? Has the pain changed in intensity or location since it began? Have you experienced any fever, nausea, or vomiting? Are you experiencing any changes in bowel habits, such as diarrhea or constipation? Have you had any urinary symptoms, such as pain during urination or increased frequency? Are you currently menstruating or have you recently missed a period? Is there any possibility that you may be pregnant? Have you experienced this type of pain before? Are there any other associated symptoms, such as dizziness or shortness of breath? Do you have any significant past medical or surgical history? Are you currently taking any medications, including over-the-counter or herbal remedies? Possible differential diagnoses for lower right quadrant pain in a 26-year-old female include:

    Appendicitis: Inflammation of the appendix, typically characterized by pain that starts around the belly button and migrates to the lower right abdomen. Ovarian cyst rupture or torsion: A cyst on the ovary can cause pain if it ruptures or twists. Ectopic pregnancy: A pregnancy that implants outside the uterus, usually in the fallopian tube, can cause severe abdominal pain and should be ruled out in women of childbearing age. Pelvic inflammatory disease (PID): Infection of the female reproductive organs, often caused by sexually transmitted infections, can lead to lower abdominal pain. Kidney stones: Pain from a kidney stone can radiate to the lower abdomen or groin. Gastroenteritis: Inflammation of the stomach and intestines, often from a viral or bacterial infection, can cause abdominal pain and cramping. Inflammatory bowel disease (Crohn's disease or ulcerative colitis): Chronic inflammation of the gastrointestinal tract can cause abdominal pain, diarrhea, and other symptoms. Irritable bowel syndrome (IBS): A functional gastrointestinal disorder that can cause abdominal pain, bloating, and changes in bowel habits. Urinary tract infection (UTI): An infection in the urinary system can cause pain, often accompanied by increased urinary frequency or pain during urination.

  • Using AI to find patterns across many patients, mentioned at the end of the article, sounds useful. Until we stop and realize we don't even have a decent way to share medical records across hospital software systems.

    I'd be happy if the government would mandate that all hospital software systems have to have portable data formats that allow sharing patient data.

  • Devil's advocate, but maybe this doctor should be better at charting?

  • Has ChatGPT ingested all the latest medical and scientific literature and will it continue to do so as the literature is changed, amended or deleted? Does ChatGPT handle deletions in its machine learning? can it unlearn something or more specifically told something is no longer true? What medical boards are reviewing what data is ingested? Does ChatGPT know all possible drug combination interactions? Do people sign a disclaimer giving ChatGPT immunity from malpractice? Are doctors consulting with medical boards, ethics boards and lawyers before utilizing ChatGPT? Finally and most importantly, if ChatGPT were ever to be certified as a licensed medical doctor how could we prove it is following all the same rules and regulations doctors and medical groups are required to follow? How does one audit what advice this thing will give or has given?

  • I think this doctor is forgetting about the other side of the coin... would chatgpt perform better than a really bad doctor?

  • I'm currently working on a machine learning project in healthcare and I'm kind of amazed by the lax attitude a lot of self-proclaimed data scientists seem to have about applying ML/DS methods to healthcare without involving any clinicians, because they insist that the "data doesn't lie"... there seems to be limited interest in exploring causality or the insights these methods provide. Instead the goal for so many is to transfer decision-making to models in the name of "efficiency".

    So many people in ML are haughty, arrogant hype-(wo)men, whose disinterest in the fields they are trying to 'disrupt' is gross. Please excuse the rant, but I'm so tired of this hype train.

    I agree with the author: people need to be aware of the limitations of machine learning models, but I'd add especially the people building them.

  • There's a few things here, outside of my usual complaints when someone says ChatGPT and doesn't say which model (4 is so much better than 3.5 it's really important).

    There's a question about whether gpt can be used, which is important because it's possibly a very powerful tool. This may require poking it to tell it it's supposed to ask followup questions, that its information may be incomplete, etc.

    Then the more important and immediate point in the article to me is people will use this right now to diagnose themselves. They won't be carefully constructing prompts and they'll probably be using 3.5, as that one is free. For good or ill it'll happen more and more.

    So with a new WebMD, how should doctors and public health messaging deal with this?

  • Back in 2004, took a seminar in college regarding Decision Support Systems and how they a manager or doctor could ask it a question and get a response to help them make a decision. Went to the doctor couple a years ago and he charged $300 to google search the symptoms. No thanks.

  • I'm thinking of parsing my wife's (vet) text books into a vector db format and then doing a search on those pdfs for relevant text to submit as part of a decent prompt. She'll tell me pretty quickly if this is useful or not.

  • I am curious if GPT-4 would have performed better.

  • "If my patient notes don’t include a question I haven’t yet asked, ChatGPT’s output will encourage me to keep missing that question."

    This is the point we have to help people understand and I'm not sure AI will catch up with this anytime soon; questions are the key to knowledge and intelligence. I haven't seen an AI ask interesting questions. Maybe it's possible with the right training set and weighting of factors to encourage enquiry, but this will be a gap in AI's ability for at least the near term.

  • I think that his point is: (1) don't use chatgpt for self diagnosis. Go see a doctor. (2) Doctors, chatgpt isn't ready or the right tool to help with your duties.

  • 1. What version is he using?

    2. It is all your fault that not providing all usefull information (like, my patient seems pregnant) and let the system to guess what you want.

  • Author didn't mention if he used GPT-3.5 or GPT-4.

  • The number of people in this country who don't even have a doctor is abysmal. I would think an AI doctor would be better than no doctor at all.

  • This ā€žstudyā€œ is missing a control group. He should have given his data also to some humans and see how they would do compared to ChatGPT.

  • ā€œAny chance you’re pregnant?ā€

    Sometimes a patient will reply with something like ā€œI can’t be.ā€

    ā€œBut how do you know?ā€

    If the response to that follow-up does not refer to an IUD or a specific medical condition, it’s more likely the patient is actually saying they don’t want to be pregnant for any number of reasons.

    Funny how languages are ambiguous around "can't" and "don't want".

  • There 's 2 main questions about what current AI systems (or whatever one thinks they are) : (1) can it be improved to learn to be ~100% correct via reinforcement learning or otherwise? It seems like the answer is Yes (2) Will people become addicted and dependent on AI to the point where it may become problematic? Also yes.

  • I found this quote really interesting:

    > this is not an uncommon scenario; about 8% of pregnancies discovered in the ER are of women who report that they’re not sexually active.

    We have so much work to do as a society to get honest about our bodies. Hoping my children do better; they are already getting better education than my wife did.

  • Would be great if OPENAI can get hospitals around the country could anonymize some trainable format data

  • It is a great article, doctors have been googling symptoms for quite some time, focused AI could sharpen that option and possibly put us into the realm of new discoveries.

  • Well there's an ethics lawsuit advertising to happen.

    I certainly don't want my docs handing my medical information to ChatGPT, even if they believe they've "anonymized" it.

  • It's not clear whether there doctor instructed it first to act like a doctor and then asked questions? It seems he didn't because it does make a difference

  • >If my patient in this case had done that, ChatGPT’s response could have killed her.

    Not if she lived in a state where there's no longer any legal treatment for ectopic pregnancy.

  • It got 50% right with a general model. That same model can do a pretty good job at a thousand other skills.

    Imagine what it can do if it's trained for exactly this purpose.

  • Ah yes, the "everyone lies" House M.D. problem

  • ā€œI’m employed as an [X]. Here is why machine learning won’t ever replace my job. Even once the point has been reached where all overwhelming evidence points to it being better than me in every regard, I’ll still summon an emotional appeal to that ā€˜magical’ human quality that machines will just never replace.ā€

    To be honest, I think I’d rather be friends with ChatGPT than most humans as it continues developing over the next decade.

  • It’s funny because it’s almost the exact same problem I have with using it professionally for writing software.

  • "It diagnosed another patient with torso pain as having a kidney stone — but missed that the patient actually had an aortic rupture. (And subsequently died on our operating table.)"

    Wow imagine if the AI had been used in an unquestioning way. Someone could have died!

  • To me this just reinforces the notion that if you train a system like this to be a doctor it would be very effective.

  • [flagged]