A 44-year-old man sends an online consultation. He describes three weeks of fatigue, unintentional weight loss, and drenching night sweats. He says he has been feeling generally run down and wonders if he needs a blood test.
You are running behind. You have twenty online consultations to process before lunch. For a moment, you think about typing the symptoms into ChatGPT to get a quick differential.
Do not.
That scenario, right there, is the line between using AI safely and using it dangerously. And in this lesson, I want to make that line as clear as possible.
The amber zone
Amber means proceed with extreme caution. These are tasks where AI might offer some value, but the stakes are higher and the safeguards need to be stronger.
Let me give you a concrete example.
You saw a patient this morning with an unusual combination of symptoms. Joint pain, mouth ulcers, and a photosensitive rash. You are thinking connective tissue disease, but you want to make sure you are not missing a differential.
You type into ChatGPT, without any patient identifiers: “A 40-year-old female presents with joint pain affecting small and large joints, recurrent oral ulcers, and a photosensitive rash on the face and arms. What are the key differentials to consider?”
The AI suggests systemic lupus erythematosus, Behçet’s disease, reactive arthritis, dermatomyositis, and drug reaction. It prompts you to consider antinuclear antibodies, complement levels, and a dermatology referral.
This is the amber zone. You used AI as a thinking partner — the way you might discuss a case with a colleague over coffee. You described the clinical picture in general terms, without identifying the patient. And you treated the output as a list of possibilities to consider, not a diagnosis to act on.
Two things made this amber rather than green. First, a clinical decision could be influenced by the output. Second, if the AI missed a key differential — and you did not think of it independently — a diagnosis could be delayed.
The safeguard? You still do your own clinical reasoning. You check the differentials against your knowledge. You investigate appropriately. The AI is a prompt to your thinking, not a replacement for it.
Here is another amber example. You need to write a referral letter, but you do not want to paste from the clinical record. So you type the clinical details from memory: “40-year-old female, six-month history of inflammatory joint pain and photosensitive rash, ANA positive, requesting rheumatology assessment for possible lupus.”
The AI produces a well-structured referral letter. But when you read it, you notice it has added a sentence about complement levels being low. You did not say that. The AI invented a plausible clinical detail.
This is why amber is amber. AI will sometimes add things that sound right but are not true. If you did not catch that invented detail, you would have sent a referral containing information that was not in the patient’s record. In an amber task, you must read every word and verify every fact.
The red zone
Red means stop. Do not use AI for this task. The risk is too high, the safeguards are insufficient, and the consequences of getting it wrong directly affect patient safety.
Let me go back to our opening scenario. A man with fatigue, weight loss, and night sweats. If you type those symptoms into ChatGPT, it might tell you the most common causes are viral illness, iron deficiency, and stress. All plausible. All reassuring.
But you are a GP. You know that those three symptoms together are a red flag for lymphoma. You know to ask about lumps, about itching, about alcohol-related pain. You know to examine for lymphadenopathy and hepatosplenomegaly. You know that the way the patient says he feels “run down,” the slight hesitation, the way his wife is sitting a little too upright in the chair next to him — all of those things inform your clinical judgement.
AI processes text. You assess people. That is not the same thing, and it is not a small difference.
Here is another red zone scenario. You are prescribing for a patient with chronic kidney disease and heart failure. They are already on ramipril, bisoprolol, furosemide, and spironolactone. You wonder about adding dapagliflozin. Could you ask ChatGPT whether it is safe?
You could. And it would probably give you a reasonable-sounding answer. But reasonable-sounding is not the same as safe. Drug interactions, contraindications, renal dosing adjustments, hyperkalaemia risk with the combination of ramipril and spironolactone — these require precision. They require the patient’s actual blood results, their actual fluid status, their actual clinical picture.
AI does not have those. And even if you gave it those details, the hallucination rate means you cannot trust the answer without checking it yourself anyway. At that point, you might as well check the BNF directly.
And one more. A patient contacts the surgery describing sudden-onset chest pain radiating to the left arm. This is not a moment for AI. This is a moment for clinical assessment, examination, and an emergency pathway. Red flag symptoms require a clinician, not an algorithm.
The test that always works
If you are ever unsure whether a task is green, amber, or red, ask yourself one question:
If something went wrong, would I be comfortable explaining to the GMC exactly what I did and why?
I used AI to draft a practice protocol, then verified it against NICE. The GMC would have no issue with that. Green.
I described a clinical picture to AI in anonymised terms to help me think through differentials, then did my own clinical reasoning. The GMC would want to know you verified independently, but the approach is defensible. Amber.
I pasted a patient’s symptoms into ChatGPT and followed its diagnostic suggestion. The GMC would have serious questions about that. Red.
The GMC holds you responsible for the decisions you make, regardless of what tools you used to inform them. AI does not share your professional accountability. You cannot delegate clinical judgement to a word prediction machine.
The principle
A stethoscope is invaluable for a chest examination. It is useless for diagnosing depression. You do not throw it away because it cannot do everything. You use it for what it is good at.
AI is the same. Powerful for the right tasks. Dangerous for the wrong ones. And your job, as a clinician who now understands how this technology works, is to know the difference.
In the next lesson, we are going to shift gears. You know what you can and cannot use AI for. Now let’s talk about how to get good results from it — because the quality of what you get out depends enormously on how you ask.
Key Takeaway
Use the GMC test: if something went wrong, could you explain what you did and why? Green tasks are defensible. Amber tasks need extreme caution and verification. Red tasks — clinical decisions, patient data, emergencies — are off-limits for AI.