The patient is talking. You are listening, examining, thinking. Your phone is on the desk, screen off, microphone on. At the end of the consultation, there is a structured note waiting for you to review.
That is the promise. And it is, broadly speaking, how it works. But the gap between “broadly speaking” and “in clinical detail” is exactly where the risks live.
In this lesson, I want to walk you through what happens at each stage of the ambient scribing process — from the moment you press record to the moment a note lands in the clinical system. Because once you understand the mechanics, you will understand why certain types of errors happen and what you need to check.
Step 1: Activation
The process begins when you start the recording. Depending on the tool, this might be pressing a button on your screen, tapping an icon on your phone, or activating a feature within your clinical system.
Some tools are integrated into the clinical system itself — a button within EMIS or SystmOne that starts listening. Others are standalone applications running on a phone, tablet, or separate device on the desk. The integration matters because it affects how the finished note gets into the patient’s record.
At this point, something important is happening. You are recording a clinical consultation. The patient needs to know. NHS England’s guidance is clear: patients should be informed that AI-assisted documentation is being used, and they should have the opportunity to opt out.
Most practices using ambient scribing have signage in the waiting room and the consultation room explaining that AI documentation may be used. Many clinicians also mention it verbally at the start of the consultation: “I use an AI tool to help me write my notes — is that all right with you?” If the patient declines, you turn it off and document manually. It is the patient’s choice.
Step 2: Recording
While the consultation is happening, the tool is capturing audio. The quality of this step determines the quality of everything that follows.
Several factors affect recording quality. Background noise — a fan, traffic, conversation from the next room. Distance from the microphone — if the patient speaks quietly or turns away. Multiple speakers — if a family member is present, or an interpreter. Accents and dialects — speech recognition models are trained predominantly on standard accents and may struggle with strong regional accents or speakers for whom English is an additional language.
The tool records everything that is said. Not just the clinical content. The social pleasantries at the start. The tangential story about the patient’s daughter. The moment you apologise for the phone ringing. All of it goes in.
Importantly, the tool records only what is said. It does not see the patient. It does not know that you examined the chest. It does not know that you noticed the patient was limping as they walked in. It does not capture the examination findings you observed with your eyes and hands but did not speak aloud.
Step 3: Transcription
Once the consultation ends — or, in some tools, continuously during the consultation — the audio is converted to text. This is the speech-to-text step, and it is a well-established technology that predates modern AI.
The transcription engine converts the audio stream into a written record of everything that was said. It attempts to identify different speakers — clinician versus patient — and label them accordingly. This is called speaker diarisation, and it is not always accurate, particularly when speakers talk over each other or when there are more than two people in the room.
The raw transcription is a complete record of the conversation, but it is not a clinical note. It contains everything: every “um,” every repetition, every tangent, every moment where the patient talked about their holiday before getting to the reason for their appointment. If you have ever read a verbatim transcript of a consultation, you know how messy and long they are.
This is where the AI language model comes in.
Step 4: Structuring
The language model takes the raw transcript and transforms it into a structured clinical note. This is the step where the real intelligence — and the real risk — lies.
The model does several things simultaneously. It extracts the clinical content from the conversation, separating it from social chat and administrative discussion. It organises it into a structure — typically a SOAP format (Subjective, Objective, Assessment, Plan) or a practice-specific template. It summarises the patient’s account rather than quoting it verbatim. And it infers what belongs in each section based on the context of the conversation.
Let me give you a concrete example. The patient says: “I’ve had this cough for about three weeks now. It’s worse at night. I’m not bringing anything up. No blood or anything like that. My wife thinks it’s because of my new blood pressure tablet — I started that one about a month ago. The ramipril.”
The AI might generate: “Presenting complaint: Three-week history of dry cough, worse at night. No haemoptysis. Temporal association with ramipril started approximately one month ago. ACE inhibitor-induced cough suspected.”
Notice what happened. The AI did not just transcribe. It interpreted. It converted the patient’s words into clinical language. It added the term “haemoptysis” where the patient said “no blood.” It made an association between the cough and the ramipril and labelled it as a suspected ACE inhibitor-induced cough. That clinical interpretation is useful — but it is also the point where errors can creep in. The AI is making a judgement about what the patient meant and what is clinically relevant. It might be right. But it might miss nuance, add assumptions, or draw a connection that was not clinically appropriate.
Step 5: Review and sign-off
The structured note is presented to you for review. This is the critical step — the one that separates safe use from risky use.
You read the note. You check it against your memory of the consultation. You verify that the history is accurate, the examination findings are correct (or that you need to add them), the assessment is reasonable, and the plan matches what you actually discussed with the patient.
You may need to add things the AI missed — examination findings you did not dictate, observations you made visually, clinical reasoning that was in your head but not in the conversation. You may need to remove things the AI included incorrectly — a detail it misheard, an interpretation it got wrong, a plan element that was discussed but rejected.
Once you are satisfied that the note is accurate, you approve it. It is saved to the patient’s clinical record. Your name is on it. From that point forward, it is your documentation, indistinguishable from a note you typed yourself.
This is why I keep saying that AI scribing is a tool, not a replacement. The tool produces a draft. The clinician produces the record.
What it captures and what it misses
Understanding the limitations of ambient scribing is just as important as understanding how it works. Let me be specific about what the technology cannot do.
It cannot see. If you examine the patient’s chest and hear bilateral basal crepitations, the AI does not know that unless you say it aloud. Some clinicians develop a habit of narrating their examination findings: “Chest is clear. No wheeze. Good air entry bilaterally.” This is a practical workaround, but it requires a conscious change in your consultation style.
It cannot read non-verbal cues. The patient who says “I’m fine” while looking at the floor. The partner who glances nervously when you ask about alcohol. The child who clings to their parent when you walk into the room. All of this clinical information is invisible to the AI.
It cannot capture what was nearly said. Sometimes the most important part of a consultation is the thing the patient almost mentioned but pulled back from. You noticed. You gently followed up. The AI only captured the words, not the hesitation.
It cannot document your clinical reasoning. Why you chose one diagnosis over another, why you decided to watch and wait rather than investigate, why you thought this presentation was benign rather than sinister — unless you verbalised that reasoning, it is not in the transcript.
“Ambient” does not mean “autonomous.” The tool is ambient in that it listens in the background. But it is not autonomous in that it cannot independently assess, examine, or reason. It is a sophisticated note-taker. The clinical work remains entirely yours.
Face-to-face, telephone, and video
A quick note on consultation modes, because they affect how well scribing works.
Face-to-face consultations generally produce the best results, because the audio quality is usually good and the conversation follows a natural clinical structure. The main limitation is examination findings, which need to be spoken to be captured.
Telephone consultations can work well for scribing, because the entire consultation is spoken. However, audio quality depends on the phone connection, and the AI may struggle to distinguish between clinician and patient if the call is on speakerphone.
Video consultations are similar to telephone in terms of what the AI captures — it works from the audio stream. Any visual cues you observe during a video call need to be spoken to be documented.
In all three modes, the same principle applies: the AI can only work with what it hears. Everything else is your responsibility to add.
In the next lesson, we are going to focus on the most important practical skill in AI-assisted documentation: how to review an AI-generated note quickly, thoroughly, and safely.
Key Takeaway
Ambient scribing is a recording and transcription tool that produces a draft for you to review. It does not examine the patient. It does not make clinical decisions. It does not replace your professional responsibility to document accurately. It captures spoken words — everything else is yours to add.