How Language Models Actually Work | Module 1

“The patient presented with a three-day history of…” What comes next? Your brain just predicted several plausible words. That is essentially what ChatGPT does.

At its core, ChatGPT is a word prediction machine. An extraordinarily powerful one, trained on more text than any human could read in a thousand lifetimes. But still, fundamentally, a word prediction machine.

The world’s most well-read registrar

Imagine you had a GP registrar who had read every medical textbook ever published. Every journal article. Every clinical guideline from every country. Every patient forum. Every health website. Every Wikipedia article. Every novel, newspaper, and blog post on the internet.

This registrar has an astonishing memory for patterns. They can write fluently about any medical topic. They can match the style of any journal, any guideline, any patient leaflet. They sound incredibly knowledgeable.

But here’s the catch. This registrar has never seen a patient. They’ve never felt the resistance of a rigid abdomen under their hands. They’ve never watched a patient’s face when you deliver bad news. They’ve never smelled a diabetic ketoacidosis.

They learned medicine entirely from text. They know what words tend to appear near other words. They know that “chest pain” is often followed by words like “troponin,” “ECG,” and “referral.” But they don’t understand what any of those things actually are.

And critically — this registrar is completely incapable of saying “I don’t know.”

That is ChatGPT. That is Claude. That is Gemini. That is every large language model.

How the prediction works

Here’s a simplified version of what happens when you ask ChatGPT a question.

You type: “What are the side effects of ramipril?”

The system looks at those words. Based on the billions of documents it was trained on, it calculates which word is most likely to come next. It might predict “Common.” Then after “common,” it predicts “side.” Then “effects.” Then “include.” Then “dry.” Then “cough.”

Word by word, it builds a response. Each word is chosen based on probability — what word is most likely to follow, given everything that came before?

This is why the same question can produce different answers. There’s a small element of randomness in the selection process. Think of it like asking three different registrars the same question. They’ve all read the same textbooks. They’ll give you broadly similar answers. But the phrasing will be different each time.

Where it learned

These models were trained on enormous amounts of text — hundreds of billions of words. Books, websites, research papers, clinical guidelines, forum discussions, news articles. Essentially a snapshot of the written internet, plus digitised books and academic literature.

During training, the model read all of this text and learned patterns. It learned that the word “penicillin” often appears near “allergy.” That “metformin” appears near “type 2 diabetes.” That “NICE” appears near “guideline” and “recommendation.”

But — and this is critical — it learned these as patterns of language, not as facts about the world. It doesn’t have a database of drug interactions that it checks. It generates text that looks like the kind of text that discusses drug interactions.

Most of the time, those patterns are accurate. But sometimes the patterns lead it astray.

Why it gets things wrong

This is where the term “hallucination” comes in. What’s actually happening is simpler than it sounds: the model generates text that follows plausible patterns, but the patterns don’t always lead to truth.

A real example: a language model was asked about a specific drug interaction. The explanation it gave was beautifully written. It cited a mechanism of action. It referenced pharmacokinetics. It sounded exactly like something you’d read in the BNF.

It was completely wrong. The interaction it described doesn’t exist. But the language patterns made it look perfect.

Research shows 8–20% of medical responses contain inaccuracies. The terrifying part? The wrong answers sound exactly as confident as the right ones.

The family of language models

You’ll hear many names: ChatGPT from OpenAI, Claude from Anthropic, Gemini from Google, Copilot from Microsoft (which actually uses the same technology as ChatGPT underneath).

They have different strengths. Some are better at reasoning. Some are better at following instructions. Some are faster.

But at their core, they’re all doing the same thing — predicting the next word based on patterns learned from text. They’re all word prediction engines with the same fundamental limitations.

What this means for your practice

You now understand something that most people — including many people making policy decisions about AI in the NHS — don’t fully grasp.

These tools don’t “know” things. They generate text that resembles knowing things. That’s an enormous difference.

It doesn’t mean they’re useless. Far from it. A registrar who has read everything ever written about medicine is still remarkably useful — as long as you verify their work.

And that’s the key principle: use these tools as a starting point, never as a final answer. Treat them the way you would treat a confident but inexperienced colleague. Listen to what they say. Then check.

Key Takeaway

Language models are word prediction machines — extraordinarily well-read but with zero clinical experience and no ability to say “I don’t know.” Use them as a starting point, never as a final answer.