How to use Generative AI and Prompt Engineering for Clinicians
How to use Generative AI and Prompt Engineering for Clinicians
Generative AI has been making waves ever since it burst into public consciousness in November 2022. The seemingly endless capabilities of tools such as ChatGPT, CoPilot, and Google’s Gemini heralded predictions of a transformed world, where work and leisure would be unrecognisable.
A year and a half later things look pretty familiar, but some early adopters are finding ways in which AI can help them in their personal and professional lives. As AI enters the healthcare domain, we need to ask: Is it welcome? Is it safe? Will it make our lives easier or harder?
Join Dr Keith Grimes, GP and Digital Health Consultant, as he explains everything you wanted to know about Large Language Models but were afraid to ask. From how they are built and trained, through to their strengths and weaknesses, and how they are being used in Healthcare, you’ll learn about how to safely use Generative AI and generate instructions that will help you get the most out of your AI assistant.
Expert: Keith Grimes, Digital Health & Innovation Consultant, Curistica
Transcription:
Hello, and a warm welcome to this webinar to all of you, wherever you're joining us from. Ciarán Walters is my name. I'm clinical director at BMJ, and it's my pleasure to be your moderator for today's session. for joining us. The topic for today is how to use generative AI and prompt engineering for clinicians.
And we're delighted to be joined by our expert, Keith Grimes. BMJ Future Health is a new community that consists of a webinar series, podcasts, and a live event in November in London. It's all about innovation through digital solutions to create health systems that are stable, Conducive to a supported workforce with resultant better patient outcomes and healthier populations.
At the end of this presentation, we'll have a few minutes of Q and A. Please do add your questions to the Q and A box. A reminder that this session is being recorded and will be made available to watch again afterwards. Without further ado, it's my pleasure to introduce Keith Grimes. Digital Health and Innovation Consultant at Curistica.
And Keith will be exploring everything you might want to know about large language models, how they're built and trained, strengths and weaknesses, and how they're being used in healthcare. And you'll learn about how to safely use generative AI, how to generate instructions or prompts that will help you to get the most out of your AI assistant.
Keith. You're very welcome. And over to yourself. Great. Hello. Thank you, Kieran. And hello, everyone calling from all over the place. So wonderful to see such a nice diverse group. And I'm just going to press ahead. If there's any technical problems, someone shout out, but we've got a packed agenda ahead of us.
So I'm just going to get going. So this is me. My name is Keith. I'm a GP. I've been a doctor for 28 years and a general practitioner for 24 of them. But I've always been really passionate about technology. I've always tried to do it within my clinical work and back in the noughties and 2010s, it was quite tricky to do that within the NHS.
There was a little bit of an appetite for it, but my passion for it exceeded that. And As I tried to transition into technology, I eventually made it into the sort of industrial sector and I worked for the company called Babylon for four and a half years as the director of digital health and innovation, where I learned a great deal about working with AI and bringing AI products to market.
I also learned a lot about product management and my passion is now combining those two, being a hybrid doctor, products person, technology specialist and leader. talking about, yeah, how to get technology into healthcare. And it's left me very happy person running a consultancy of my own called Curistica, which I might mention a bit at the end.
If there's got any particular conflicts as I go through, I'll raise them. I do advisory work for some companies, one of which I'll mention within here. But beyond that we should be free and ready to proceed. And as we heard from Kieran a moment ago, we're going to cover a few things. We're going to give you a.
baseline understanding about generative AI and large language models, what you really need to know to make the most out of things. And then we're going to go in a bit more depth about the strengths, the weaknesses, and the risks of using artificial intelligence, particularly generative AI in the healthcare space, and then look more specifically at how generative AI or gen AI, I might call it that, or just simply AI through the talk.
What role do they have to play in healthcare now and going forward? And then the last bit, which is actually really going to be a sort of taster is how do you use AI? Because people could put AI into products and services, or you can use it yourself. And I'm a big proponent of using it yourself, but you need to know how to use it safely and effectively and when you might not want to use it.
And the the shortcut here is that it's probably not ready for you to be using directly on patients just yet, but there's lots of things you can do to help your work. So we'll cover some of that and then answer some questions. So at this point, I say, please, maybe we have the poll so we can understand a little bit more about the people on the call.
So in front of you should see poll and if you would be so kind as to just say, where you are on this one. It's just three boxes. Have you used it? Have you not used it? Be useful to know. We'll just run it for a little bit and see. Here we go. 77 people. That's great. I'll hold on and encourage those people.
It's only three questions. Don't be shy. If you haven't used AI before at all, or even if you really rarely used it, still very happy to hear from you on this one. I think a couple of people have raised their hand there. I'm not sure why those people have raised their hands. If there's any problems on your end.
Then just speak up. I'm we've got a team in the background that can help with the technical stuff.
I might. Roll up just now and let's see where we are and the results are in. I'm gonna read it out before I click share. 64% of you are already using AI in your current role and 36% are unsure about how to use AI and what benefits there are thankfully , no one has said they would never use AI here.
You're in the wrong place if you said that. 'cause I'm gonna be encouraging you to do so those are the results. And with that, I am going to, let's see if I can click that, and then stop sharing, and then bang, back in the room. Okay, given that I'm not entirely sure what everyone has been doing with regards to using generative AI, oops, why don't we just do a very quick demo?
of generative AI. So some of you have already seen this but to best understand this, I'll quickly show you. This is ChatGPT or GPT 4. 0. It's their latest model. And if you use ChatGPT, you might be presented with something like this. So what does generative AI do? Why don't I ask it? So we'll just say, hello, I am speaking to some clinicians interested in LLMs in healthcare.
Introduce yourself. And lay out what I should cover. Let's see what it does.
Introduced me, interestingly. Anyway, talking about me in the background. And then then the role of large language models, the benefits of large language models in healthcare, challenges. Yeah, that's all pretty good. Closing thoughts in there. That's interesting. Okay, but that's too long. Just bullet points, please.
Let's see how it's going in here. So it's shortened up a little bit. Still a little bit too long maybe, I'm getting there. I understand that large language models can do a bit of translation. That's great. Could you tell me this again in the voice? What language will we go for of a pirate?
Because we should all be able to speak a little bit of pirate, I imagine. And there we go. It's speaking to you in Pirate, there! Ye must consider, guard it well, need for validation, and constant watch, and future advancements. Be key, yeah, keep your curiosity alive, that's great. And then one final bit before we come back into the presentation.
Why do you think I asked you to do this?
And here we go. And this is something that I find particularly interesting about generative AI. It's able to do a reasonable impression of reasoning and inference from this as well. Yes, I was trying to be a bit ice breaky and in playful in this as well to lay it out there. So yeah, you've seen a little bit of what large language models can do.
Some of you have probably used it for way more than that, but it is important that you see at least a little bit of it before I talk about the rest of this. Generative AI, how does it all tie together? Tradition dictates that when you talk about AI in any form, you have to talk about definitions.
So I will start by saying there are many definitions of artificial intelligence, a field that has been around since the early 1950s. Artificial intelligence at its simplest is when you ask machines to do something that a human brain would otherwise do. There are more technical ways of doing this, but it's a very useful way to think of it that broadly because artificial intelligence is very broad.
It covers a lot of things and it covers a lot of things that actually these days seem rather simple rule based systems and so on flows. And so there's a lot of technology out there that would technically be known as artificial intelligence, but you might be expecting slightly more. And that's okay because over the last 70 odd years we've seen some real advancements in artificial intelligence, starting out with very simple rules and then moving into including something like machine learning where you're using more advanced machine models that allow computers to learn from, and make decisions And predictions based on data, often labeled data or structured data that's been prepared.
So we have some indication of what it means and from that be able to do something useful. So that's machine learning and it's I sometimes describe it as like statistics on steroids going on from linear regression and multilinear regression into other more complex forms. And machine learning is used quite extensively these days across a number of different areas.
But we start getting close to what people think about with artificial intelligence when we talk about neural networks and deep learning. And here we're talking about algorithms and models that are modeled on basically the neuron the sort of the units of thought maybe. And with that, you build a digital representation of these nerve cells, that these neurons that connect to each other.
And they will sum up their inputs and based on the weightings inside them, they will then pass on an output which goes through to the next layer and the next layer and the next layer, much as you might find in a biological system. This has been around again for some time, but it's only been the recent advances in data and technology and processing that allows us to do very impressive things.
And so what we're getting is we're getting a network which can take input and give you an output and make predictions and do things useful. But the important thing to bear in mind at this point is that this point, it starts to get difficult to understand how it landed at that area. And so neural networks and deep learning can work with labeled and unlabeled data, but sometimes it's difficult to work out exactly why it's come up.
with the output. So when we start talking about deep learning, you might start hearing phrases like black boxes and they can cause particular problems. So what about generative AI? We're talking about taking one input and creating one output. Generative AI does the same thing, but what it does instead of going from a sort of large amount and coming to a small output, it does the opposite by learning some of the patterns in that data and can create content based on the training that it's had.
And this area generative AI is what we're going to be talking about today. And it's where things like ChatGPT sit. So what's the difference? A predictive model might be something like this. You might get lots and lots of pictures of cats and lots of pictures of not cats. And you'll label them and train a model and say, here's a bunch of pictures of cats.
Here's a bunch of picture of not cats. And over many training iterations, it will identify features in this picture to become very good at identifying cats. And so you might see this with image processing. And what generative AI does is the opposite, and you might have done this yourself with mid journey or DALI, is it will have learnt those patterns through its training, and then you provide it with the description here, a photographic image of a white cat meowing while seated at a table etc.
And you can see the rather scary picture of a yowling cat in front of a plate of vegetables here on the right. So again, the principle is you go from a small amount of data to a large amount of data based on patterns that were learned through training. And you can do this for any data that you can digitalize or serialize as well.
And that could be sound, vision, video, genetic sequences, or text. And that's very handy because the use of text is so abundant throughout how we operate as humans and as clinicians around the world. And we're going to focus on large language models for that. So what is a large language model? A large language model is a type of generative AI.
It's based on deep learning architecture. And what has been done is it's been trained on textual data. At its very simplest, and it is helpful to understand this simply, all these models do is learn the relationship between words and their positions and learns how to predict the next one and the next one and the next one until it predicts that it should stop.
That is basically what a large language model does. It takes text as an input and creates text as an output, predicting it based on what it's learned through training. And we'll come back to training in a minute. Now, this kind of technology has been around for quite a while. And in fact, some of the people here will remember predictive text on their phones before you had iPhones and the like, where you would actually get a prediction of what the next word would be.
And that's using a similar kind of model behind it. But up until about 2017 or thereabouts It would start to fall down after a while. You'd predict one word, but as you start to try and predict further and further, the predictions would get worse and worse. And it was deemed to be a very difficult problem to solve until 2017, where this paper came out.
Attention is all you need. And it's a real sort of foundation stone. And within this, the researchers identified that if you are trying to do this next prediction, it's not just the meaning of the words, that's important, but also how they related to each other. And very importantly, what is the most word to pay attention to?
Cause you're not going to pay attention to all of them equally. Once this was put into place, it started us on the path to where we are now. There was a rapid increase or improvement in the performance of these models. And that's based on something called the transformer architecture. And to illustrate this, if I said to you, the cat sat on the, you would probably, or hopefully be thinking of something like Matt.
You wouldn't be thinking of a submarine or helicopter or anything like that. If you think about how you made those predictions, you'd be thinking what are the important words? Cat, sat, and on, the other bits aren't quite so important. So you're performing some of this activity too. This activity is modeled in these transformer models.
This is as technical as it's going to get. A transformer is on the right, not the left. This is within that first paper. And it's useful to have a bit of an understanding, because you might see this elsewhere. But a transformer composes of two parts, on the left, an encoder, and on the right, a decoder.
So what does the encoder do? So you have some data in the form of words for the model to be able to process this. What it needs to do is it needs to convert it into numbers or a numerical representation. And this is the process of tokenizing where these words or this text is broken into slightly smaller chunks and becomes like a vocabulary, which is then embedded.
It's actually converted into a numeric representation. of each of those in multi dimensional space. What I mean by that, imagine a graph with not two, not three, but one and a half thousand axes. And each of these tokens is then mapped on these axes depending on how similar it is to other parts of words or tokens.
And each of these axes quantifies some aspect or characteristic of this token. So this is the way in which these words are converted into basically large vector representations. And then the second thing that happens in the encoder is that it encodes the position of these in the relation. of each together.
So this process is the model understanding what the meaning of the text is as it comes in. The decoder portion, the bit on the right, is when the model then uses this, and all the information that's been trained on, to then look at what's gone before, pay attention to the important parts, and then predict the likelihood of any of the tokens being most likely next.
And it will pick one of those, often one of the more likely ones, and then it will feed the whole lot back into the start and do it again. And again, until it predicts that it should stop. And that, my friends, is how these models work. There is a lot more detail to it, and it starts to get quite complex.
You're free to go and look into it, but it's important to understand that basic principle. So that's how the models work. The next question is large language model. Language I get, model I get. What do we mean by large? In a large language model, we're talking about two ways of thinking about it. A large language model has a lot of parameters, and the parameters are essentially the connections between each of these little digital neurons, these little interfaces as well.
There's a large number of them. And there's also a large amount of data used to train it. So in this sort of description I'm going to do, you can think of the human brain, how many kind of connections does the human brain have, when you think about synapses. There's some references saying there's about 100 trillion synapses in the human brain.
Okay. And then what about the amount of data? To help you visualize how much data is trained on, have a piece of paper in front of you. Piece of paper has about 500 words, 650 tokens. So 100 sheets of A4 is about a centimeter high. So you've got a brain as a reference and you've got stack of paper as a reference.
So transformer model is released. OpenAI, formed beforehand, starts to build GPTs, general pre generative pre trained transformers. And GPT 1 came out in 2018, and at that point it was trained on about four and a half gigabytes worth of text, taken from about 7, 000 unpublished books. That amount of data, if you represented it as a stack of paper like I described, would be about 365 meters tall, about the height of the Empire State Building.
And then the size of that network, the parameters in it, is about 120 million. That's about the same as a fruit fly in terms of synapses. Okay, so that was June 2018, and then not very soon afterwards, February 2019, they had another go, GPT 2, at which point they used a big, a large amount more data. 40 gigabytes of data for webdecks, about 8 million documents.
From 45 million webpages, some of which came from Reddit. And as a result of that, if you stack that data up, it's about 3. 24 kilometers high, the height of Cusco in Peru, the sixth highest city in the world, or three times the height of Snowden. So again, very substantial. At that point, you've got 1. 5 billion parameters, about 1.
5 honeybees worth of synapses as well. So it's still getting bigger. What about GPT 3? GPT 3, June, 2020, at this point, they really cranked up the amount of data. 570 gigabytes of plain text from the internet. You have 300 billion tokens, big chunks of the internet, web text, Wikipedia, two corporate books, number of parameters about 175 billion.
That's about 0. 2 percent of the human brain in terms of synapses. But look at the amount of data. If you stacked it in a four 46. 2 kilometers high, imagine that's five times higher than the highest commercial airliner. Now, at some point after this, we've got GPT 3. 5, which was a version of this as well, but GPT four is a sort of the frontier model used by open AI at the moment.
And that was trained back in March, 2023. It was trained on 13 trillion tokens worth of information. That is a substantial chunk of the public internet. And we'll come back to that again in a little bit, that would be. 2, 000 kilometers high of paper of data stored there. That's the border between low and middle earth orbit.
And in terms of the number of parameters, that's about 2 percent of the size of the human brain in terms of these parameters, these interconnections. Now it's not modeling the human brain. It is important to say that, but it just gives you an idea of this sort of pace. So why on earth does that matter at all?
It turns out that when you're making large language models, size really matters. This graph here on the left shows some evaluations of the performance of large language models of different stages. There's one, two, and three here. GPT1, GPT2, GPT3. It doesn't matter what the evaluations are specifically on here, but what you can see is as the number of parameters rise, and these models are trained on more data, not much seems to happen.
And then all of a sudden you get this kind of takeoff, this exponential takeoff. Now this was predicted mathematically. from the manufacturers and the researchers working on this. But it is very interesting because what it means is that at this point, you start to see the things that make us most interested in large language models.
That it's scaling in a nonlinear fashion. That it's actually beginning to be able to do tasks that it wasn't explicitly trained to do. Or if you get it operating on new tasks that it hasn't been trained on. It only takes a small amount of additional data to pick this up. At this point, large language models start to look really promising in doing some useful work.
Now, there are some challenges. No one's entirely clear why this is happening. No one can explain exactly how this is happening either. And some people challenge the fact that this is happening at all. This is common in the world of generative AI research as people understand more about this. So it moves fast, but at this point, all you need to know is the larger the model, the more capable it is.
So how do you train these models? You've got all this data. There are three stages. Once you have a model, the first stage is called pre training and in pre training, this is basically where a large language model learns the facts. It learns the relationship between all these words and all these positions and all these hundreds of millions and billions of documents.
So you start with a large amount, high quantity of low quality. Unlabeled, but also coming from the internet data and with this huge amount of data, you then put it into a transformer model or a model based on the transformer architecture. And you train it. You basically start with this model. You randomly assign some values.
You give it the first part of a document, it predicts what the next part of the document will be, and then you correct it. It updates the weights and does this again and again, and then do that. A head scrambling amount of times. This takes hundreds and thousands of graphical processing units, some of the most expensive processes out there running for months at a time, costing some estimates a hundred million dollars to run the pre training phase for GPT 4.
You get something that is excellent at completing documents, a foundation model. Now, a document completing model is impressive but it's not terribly useful. So you have to do two more stages. So stage two is fine tuning, supervised fine tuning. And what happens here is you take this document completing base model and you give it a smaller amount of high quality examples.
And by that, what you do is you give it an example of questions and answers or long text documents and summarizations, the sort of things that we find quite useful about large language models. So you provide it with the input, it generates the output, you give the gold standard output, and then it updates the weights and keeps going.
Now, these examples are produced by human researchers based on principles from the manufacturers. And so at this point, this is where you can talk about alignment. This is where the model kind of learns. right from wrong, what is harmful, what is not harmful. It has not got any kind of emotions or moral judgments, but it's just being fed examples that adhere to these principles.
And on the right, you can see some of those. This is OpenAI's, what's called model card. And they basically have three layers. of sort of objectives, rules and defaults that the people creating the sample texts and input and output examples will use. So for example, high level objectives, assist the developer and the end user, benefit humanity and reflect well on OpenAI.
This is OpenAI's model card of course. And then next, the rules. Follow the chain of command, comply with applicable laws. Don't provide information hazards. For example, at this point, large language models, or GPT, will be trained not to provide any information about self harm. Respecting rights, protecting privacy, and then not don't respond if you're not safe for work content.
And then down to the defaults, assume the best intentions of the user, ask clarifying questions, and so on. And with this, the model starts to behave in a way that is helpful and not damaging.
But there's one more stage. Once you've got this supervised, fine tuned model, you then move into one final stage called reinforcement learning, where you'll give it some example questions. It will spit out an answer and humans will rate those answers. And those ratings will do two things. Number one, feedback the model.
And number two, train another model so it can do it at scale. And at the end of that, it trains and trains. And at the very end of all this process, you get something that you might recognize, such as chat GPT. or Copilot or Gemini or Claude or Mistral or any of the models that you have out there wrapped in a chat interface.
This is what you are using. So you understand how they've been built. So you've even seen how they work. What are the strengths and the weaknesses and the capabilities and the risks of these models? The basic thing that they can do falls into these areas. So it can expand a model, can take a small input and create a very large output.
How tall is the empire state building? The empire state building is based in New York. It's 365 meters tall, blah, blah, blah. Yeah. Small to large question to answer, for example, or it could be the opposite. It can summarize and go from a large amount of input and create a small output. So that could be classify this information.
What's the sentiment of these reviews, or can you help me understand this complex journal article on artificial intelligence. So I've explained this one area. It can summarize and bring this down or patient records, summarize what's happened between last time and now or the discharge summary.
And it could also translate it from one form to another. Now that could be English to French, German, whatever you like, but also the other way also from one to another. use of language, clinical language to patient facing language, different languages at different levels. But also computer language as well.
You can put a free text instruction that will give you the computer code so you can instruct it to assist you with helping with coding. It can also help with reasoning and inference. Now what's very important here is that it is not reasoning and it is not inferring. It is simply using the patterns that it's learned from data to predict what the next.
Most likely token is until it should stop. However, within that, if you ask it to reason, it will give a prediction of what the reasoning should look like. And that in itself can become useful. That's a distinction that is sometimes slightly hard to wrap your head around, but remember, these are just tools and they are helping you to perform a useful task.
They can do it in a conversational form, and this is how they've been trained, and they can combine any and all of the above across a whole bunch of different domains, which is why they are so very interesting and so very useful. But they can't do everything. They're not magic. They're not search engines.
They've been trained on a large amount of information, but that has an end date. And so if you ask for anything outside the training data, it won't work. know it, it won't be stored, it won't be represented. It may give you an answer and we'll come back to that in a bit. And you can actually use some large language models that have tools to search the internet, but they are not a search engine.
And oddly enough, they're not good at maths. They're not good at references either. Why? Because they're predicting the next token until they stop. They're not actually doing math and maths in the same way. And that surprises a lot of people because these, of course, computer models, what people will often do or what manufacturers will do is will use large language models to predict when they need to use an external tool.
And then they will use a tool and then incorporate that answer. They're less good in less common, they're not as good in less common languages. And if you think about the state of the internet, it is some languages predominate. And if you have a large amount of data in one language, it will perform better than in other areas.
And this graph on the right shows the performance of GPT 4 against different languages. So the bit just below the blue line is the performance of GPT 4 in multiple tests. in English. And then as you start dropping down on the right here we have Italian 84. 1%, German 83. 7%, and we drop down and down.
Turkish is about midpoint, Japanese around about there, Welsh is there. And then once you start getting down to Thai and Punjabi and Telugu, it's actually down at the bottom. This is really important because if we're using this to provide services in a country where these languages are spoken, you can see how they won't perform quite as well.
Now remember, These models have no memory. They are stateless. They have no morality. They have no emotions. They have been aligned to give answers in a certain way, and they can't explain themselves. Although you can ask for a rationalization, it will predict one, it is not explaining itself as such. And then on top of that, they can be easily distracted.
They can be forgetful. They can be very eager to please. They've been trained that way. And so if you ask questions in a certain way, you're more likely to get a kind of confirming answer off the back of that. And that can reinforce bias. Now, all of these problems can lead to trouble. and risks. And it's really important that you understand this because this is the thing that you have to bear in mind when you're using these models and how you want to try and maintain safety.
So number one, they can hallucinate. They can be biased. They can be unfair. They can leak data. They can infringe copyright. They are beasts to regulate and they can be very costly to run. So you're thinking why on earth would you use these? They can do some very useful things. But let's step through each of these, and then you can understand them more and think about how you mitigate the problems.
So the first problem is hallucinations. And you've probably heard of these. Hallucinations are where these large language models, once asked to complete a task, their output contains nonsensical or factually incorrect data. Now, hallucinations isn't really a good medical word for this. I prefer confabulations but hallucinations is what stuck.
And they're a product of how these models work. They're just predicting in a semi random way what the next word will be. And you can tune exactly how random that is, but they're tuned to essentially always give an answer. So as a result of that, even if they can't give an answer, they'll give the most probable token.
And as a result, you may get some hallucinations. Produced in extremely plausible way. It makes them very hard to identify, even sometimes by experts. What does that mean? The medical setting? You can imagine a doctor may miss something and follow some advice that's incorrect. The patient may have much less understanding of this and actually follow advice and come to harm or even die as a result of this.
So for this reason, large language models, particularly things like GPT Claude and so on, they are very clear. These are not medical devices. They are not intended for medical use. Now people will use them for that. And we can come back to that. I'm in, but. It is important to know this. So how do you fix this hallucination thing?
It's difficult to eliminate it completely, but bigger models hallucinate less. You can train them more to be less likely to. You can correct it at the time of generating with some prompting techniques that we'll mention. And you could even use some external tools like something called RAG, Retrieval Augmented Generation, or fact checking, or use one AI to check another one.
You can do. But the most common thing you do is you put a human in the loop. Let's get a human to check this result. And I think most times people are using it, they will be insist, you'll be doing that, be well placed to check it. So hallucinations are a problem. What about bias and fairness?
Models are like mirrors. They reflect the data they're trained on. And we know that the data that we have in medicine is actually highly flawed at times. It's very biased towards certain groups and certain ethnicities. And if you just All this into a model without much thought, you are just going to get the same biases straight out.
It is important for you, therefore, to look at the data that's coming in and make sure that it's balanced and representative and potentially use synthetic data to make up the difference. But at the same time, we need to correct the problems that we have, not only in medicine, but elsewhere about focusing on small groups and making sure that we have more diverse representation in the data sets that we use.
What does this mean? Obviously misdiagnosis, inequitable treatment, miscommunication, particularly for those people or those people that are in groups that are less well represented in the training data. So we have to be very careful about how we train these models and how we use them. You can do things to improve this, choosing the right training data, using the right fine tuning, the right kind of alignment strategies.
But for you, the right kind of prompts are very important as well. And what I think is really interesting about all of this is that such as this sort of excitement about realizing the potency of artificial intelligence and realizing that the data that we have is not very good, that we're actually seeing AI helping correct some of the intrinsic biases.
And one paper back here from 2021 showed how an image recognition model, an AI machine learning model, Or a deep learning model was able to identify with much greater accuracy, the progression of arthritis in the knee of people who were from black or African Caribbean backgrounds than the existing algorithm, which was around since the fifties, unsurprisingly trained on what predominantly white men.
And as a result, it was under triaging or underestimating the severity of illness. So I can actually help and start to pitch things back the way I've talked about how much data gets used to train the models. large language models being trained on this, some of the data that's on the internet is going to be sensitive data.
And when large language models are producing their outputs, they will sometimes just regurgitate or spit this back out again. And that means that you could have leaking of data to other sources. You can lose control of this. What does this mean? That means any generated work. Can actually present as training data.
And you could inadvertently plagiarize someone else. If you're doing this, if it was incorporated, it could lead to data breaches and famously in companies like Samsung lost control of some of their code base because they were using large language models to help assist some of the coding and what did that lead to?
Some of that training data at that time was used to train the model and then it went out further now. Always make sure that you're careful not to share data in the way that you shouldn't. This is a general principle, but also when you're using models to do some experimentation or use you're going to want to make sure you use ones that aren't using the data that you submit for training the model as well.
And all of them, the common ones we'll have, either that's on by off by default, or you can switch it off. And there's lots of different ways that you can mitigate for this as well, in particular, controlling what data goes into it and how you build the model. But remember, these models are being built by independent organizations that kind of don't really answer to anyone.
So this is an important area to watch. And, I've just told you that. Let's look at what these organizations do. They have scraped large chunks of the internet. publicly available information and use them to train the model. Now, the copyright law didn't potentially foresee people training large language models at the time it was written.
And as a result of this there are some court cases going on around the world and in the U. S., unsurprisingly, where AI generated content is thought to have essentially broken copyrights or breached copyrights because it's trained on large amounts of data and it's replicating things there. Now, The argument is going to and fro.
The manufacturers argue fair use and the importance of what they're doing. The creators understandably are upset at this infringing their rights. What does this mean for you? Anything that comes out of AI could technically be seen as a derivative work. Could be, you could be subject to infringement action.
Are you still comfortable about using AI? All the large manufacturers promise to indemnify you and protect you against this. That hasn't been tested just yet. But other solutions about licensing the data and you can see companies like AI working on open AI working on doing this right now.
Or you can build your own model. But as we've already said, that is no mean feat. What about sustainability, right? How much energy, how much power, how much heat is generated through trading these models? They are very costly. You heard about them running high capacity, high performance models for months at a time.
What does this mean? It uses energy, heat, water and materials. And the impact is very high predicted 2 percent of global energy consumption is currently by the use of cloud and AI architecture. That's predicted to rise to 4 percent by 2026. That's the same as Japan consumes. And even in the last few years, Google's greenhouse gas emissions are up 48 percent on 2018 2019.
Microsoft's up by 29 percent on 2020, throwing into doubt their ability to meet net zero targets as well. And these models consume water for cooling as well. Not insubstantial impact on the environment. How do you mitigate against this? More efficient models, using them in the right way, only using a model that is the right size for the task that you have.
And even just asking the question, do I really need to use AI for this? Is there a simpler way of doing this? We are going to increasingly have to account for our use of cognitive carbon and then regulation and safety. I've already explained to you that these are difficult. And the question is why, but large language models, it doesn't matter that they're no different to any other medical device.
If you have an intended medical use, you have to comply with regulations. And regulations are written in such a way, correctly, but written in such a way, that makes it very difficult for these models to actually pass because of how they're built. They're built in a closed fashion. They're not releasing details about how they've been manufactured.
There's such a huge amount of training data that the provenance of all this data is hard to prove. And because they're very broad in their activity, how on earth do you evidence that they're safe to use in all these different use scenarios? What does that mean? Up to today, as far as I know, there's not a single large language model on a medical device around the world.
Does that mean it's not being used for medical purposes? I'm sure people are. And people are also using large language models aligned to clinical use cases. So they're using this sort of borderline fashion. There's a lot of work going into how we can use large language models and medical devices. Will regulations change?
What's going to happen? You just have to watch this space. So how you build them, how they work, what the risks are. Should we use them in healthcare? Are we using them in healthcare? Let's have a look at that. How might you use generative AI, knowing everything we know so far in healthcare? We'll go back to that original table, shall we?
And I've listed some things here. We know that generative AI can help with expansion. What does that mean? It can help with personalization of medicine. It can help with creating treatment plans or the application of guidelines. Or question answering or decision support. You may be asking some questions about the patient in front of you.
Or creating content, referral letters, discharge letters. This kind of thing, generative AI is seen to have quite a promising future in. But remember, if it's a medical device, it's got intended medical use, it would have to be a medical device. This is limiting some of the use cases. What about summarization?
Yeah, the summarization of consultations. We've seen already the use of natural language processing to transcribe what happens in a consultation. It can then be converted into a political note, freeing up the doctor or time so they don't have to be typing in the notes. Or automate documentation as you go as well.
Then translation, image recognition, signal analysis, radiology, pathology, language translation from one to another, comprehension and readability translation, all can be very helpful. And then for reason and inference, decision support, modeling, and so on. trial simulation, digital twins, the list is endless.
And then you can stack that inside a chat bot, making it available to a clinician as a co pilot or a patient as a co pilot, or even a virtual tutor to take students and continue professional development forward as well. Everyone's very excited about all of this, but remember if it has an intended medical use, it would have to be a medical device.
And as a result of that, we know the right now. So people are trying to find ways in. Anyway, even if you could find a way in, how do we know that AI is good enough to do this? Typically, what happens is people say let's take doctors, for example. How do I know that a doctor is good enough to practice?
We put them through exams. So maybe if we get an AI to pass an exam, it's good enough to practice medicine. The answer to that is that AI is actually very good at passing exams. It's great at passing multiple choice exams. It's great at passing long form questions as well. And this graph here shows the progress from December, 2020, all the way through to March, 2023, and the progress of different models.
As you went along, the pass mark was rising to 86. 5 in March, 2023. This is just multiple choice, of course, multiple choice. Maybe it's expected to do that's not so complex. Actually, I'm sorry, long form questions here on the right. It's actually very good at that too. And so in this example, it was showing that the long form answers, the long questions answered by AI, the AI went independently and blindly judged by by clinicians, the answer is better reflected consensus, better reading comprehension, better knowledge, recall, better reasoning, omits less information, less evidence of bias, less extent of harm, less likelihood of harm.
The only thing that it was. Inferior to humans in this particular trial was including inaccurate, irrelevant information. Now, of course that's quite good, but is it going to top out? No, you see this all the time. AI beats AI. This continually happens. The state of the art at the moment is something called meta gemini.
There's probably some more ones in the meantime. Things move fast. 91. 1 percent pass mark in there as well. And it's starting to wrap in not just text, but also images too. So AI is going to be doing well, but medicine is of course much more than passing exams. It's about communication and empathy where I'm sorry to say that AI has actually been shown in certain forms to be more empathetic than doctors.
Now it doesn't have any emotions, but in this particular study, which caused a lot of controversy last year, they got AI to answer questions posed on Reddit. And then they had judgments made of the human answers versus the AI's answers. And the AI answers were deemed to be better responses quality wise, but also more empathic as well.
Why? Slightly longer, maybe more formulaic than some of the thrown answer, the quickly thrown off answers on Reddit from doctors. But, there's some promise in there as well about it providing some sort of empathic response or maybe correcting the non empathic responses from some of the users.
But medicine is of course more than just answering questions and empathy. It's about the consultation, isn't it? Surely it can't do that. I'm sorry. It is also very good at that too. This is another paper. Now, this is in a slightly artificial situation where they simulated a consultation happening over a chat interface.
It was using something called Amy a model, specifically trained by Google. And in this, what they did is that they did an OSCE format question, why? They had some scenarios played by actors and they'd randomly allocate 'em to either human clinicians or a e model. And they'd do these different scenarios and they did a questionnaire for the.
the participants and they also had specialist evaluation. Now this spider plot shows the red shows the performance of the AI and you can see it performs very well here. It was better in terms of diagnostic accuracy, management plan, empathy, openness and honesty, even the patient's confidence in care. The only thing that it fell down on here was escalation recommendation.
So that's pretty convincing. But then you look at the next part and this is a very dense chart. The AI was comprehensively better than the human participants. In fact, it was superior in every metric apart from one, which was about the appropriateness of onboard referral. Significantly so in most cases.
Now I put links to these papers at the back of this, but in this artificial situation with a specialist trained model over a small number of cases AI did seem to perform very well. What are we going to do with that? AI is not always great. There are some papers obviously which show that it is not without risk.
Large language models in clinical oncology, it showed that it actually didn't quite make the, meet the benchmarks that were set elsewhere. That automated coding off the shelf models actually weren't that good at it. Or replying to patient queries actually increased the amount of time that it took clinicians to actually answer their questions.
Or even answering cancer care questions on cancer care, actually some dangerous answers in there as well. So these models are not ready in many cases for clinical use. And remember what we said about devices. So how about we combine clinicians and artificial intelligence? And this image here is the idea of cyborgs and centaurs, humans working with technology in a way that's integrated, cyborg or handing over, centaur.
You can see the joint. Yeah, actually the study from Amy shows that if you take a doctor in a similar sort of setting and you give them some complex cases, you give them access to the internet, their performance improves, you give them access to AI such as Amy, it improves again. Sad news is that if you took the human out of it and just left Amy on its own, it performed best still.
Either way, we're seeing how decision support might be the way forward here. And there's another route is ambient documentation. This is where I have an interest. I'm an advisor to Tortoise and this company and there are other companies will transcribe the consultation, convert it into medical notes.
And in this situation, it's been showing some promising results. It reduces the workload for the doctors and the staff that use it. It actually makes them sometimes more accurate records. They can be more engaged with the patient. It's quite promising. And certainly the commercial offerings have been well received around the world.
At all times, they ask for a human to be in the loop. to check. And I just want to flag this, that if we are using generative AI and we are putting a human in the loop, remember, humans are not very good at being in the loop. We're not very good at vigilance, automation, bias, fatigue, poor understanding of this technology.
It means that actually we may not be best placed for this. This is an area that I slightly worry about because the standard answer for anyone implementing AI is, or we'll put a human in the loop. If we're not careful about this, the only role of humans in healthcare going forward would be to be that human in the loop.
And that is not a happy place because that means you're the one that gets sued at the end of it all. So watch this space, understand about this. This will evolve quite quickly as we try and work out the best way to use humans and AI together. Okay, cramming a lot in, we're going to go the last 10 minutes or so for prompt engineering, what you're here for.
So what is prompt engineering? Remember I told you about that supervised fine tuning, that point at which you give it examples and it gives you outputs. At that point, it's learning how to respond to instruction. And prompting is essentially instruction. You're giving instruction to this model to get it to behave in a certain way.
That instruction can take the form of a. task, a question, a list of questions, or even actions. How that prompt is worded has a very significant impact on the output. And that's prompt engineering. And the reason it's called this, it's closer to coding. It's closer to engineering. Those are called prompts, but it's also slightly woolly art.
There's a little bit of science to it, but there's also an element of just practice. And this is why I give talks like this and other talks that I do in more detailed workshops. You understand AI best. When you are using it yourself and you want it to be able to do the things that you want it to do, this is prompt engineering.
So the basic approach, AI is a little bit like an incredibly talented intern that is extremely literal and has just turned up. So if you were working with an intern, giving it a task, what would you do? First of all, you think about what you want to do. You want to get it set up to do this safely. And then you start by just asking, maybe the intern can just do it.
But if not, you might want to try some other things and there's some techniques you can add, which I'm going to show you just now, that can improve the quality of the output, the accuracy and the likelihood that you're going to get what you want out of this. First of all, if you're doing this, you want to get set up right.
I recommend you use something like ChatGPT. We use ChatGPT, Teams, Curistica, and the reason for this is it uses a real sort of state of the art model. It's secure, the data is not used to train, and I pay for access, meaning I can use it as much as I wish. Basically, I can upload documents as reference. It can do custom GPTs, which we'll come back to.
But there are other ones out there. You may have access to Microsoft Copilot, which uses OpenAI's models in a slightly different way. Go by what your organization suggests, or you can use AnthropX Claude which is another fantastic model that works in a slightly different way. I use it for slightly different purposes.
By using these models, you get a sense of what they can and can't do. So set up to make sure that using the right platform, check that the information you're submitting is not being used to train the model, and then you can enter into maybe completely non workplace related tests, and then start moving to non clinical while we work out how to deal with the clinical side of things.
Thanks. So this is the first thing you do, you just give it a go. All right, so imagine a task in front of you. I'll maybe just describe a task right now. It's let's say I have a discharge letter. Let's assume that we're working in a world where this is safe and a regulated device, but I want to take a discharge letter, and I want to extract the clinical codes from this as well.
I might just put the copy in it, and I might say, can you extract the clinical codes from this? Or can you write up a version of this that's suitable for handing to a patient? Large language models are actually very good off the shelf, and it may be that it does it exactly right the first time.
If it does, that's great. In my experience, it doesn't always do that. But remember, if you try and it doesn't work, Then try other things. Iterate on what you're doing. And the first thing that you're going to want to do is add some structure. What do I mean by that? AI, these models can do almost anything across their entire training set.
And you ask it to do something, it's going to try and say what role am I going to take? What am I what am I trying to do with this? Now, part of it is you describing the task but start out by describing the role, the sort of personality that you want it to have. Ask it to be good because remember it can perform in any of these different roles.
Don't tell it to be super humanly clever because it drifts into science fiction, you are in a, so let's summarizing, you are an excellent document summarizing artificial intelligence. The goal. I want you to take this. documents and I want you to extract any clinically relevant codes from it and I want you to then give me them in SNOMED CT form.
I don't want you to give me any codes that code any let's think, observations potentially or anything that would interfere with quoth, my GP friends in front of us just now. I want it to be spat out in SNOMED CT in a table format and maybe I'll give it some examples as well and I can give it some context, I'm a general practitioner working in the UK general practice.
I receive letters from hospital, I have to put the codes in the system. So I'm breaking down that task, putting it in structure. And one structure you can use is something called CoStar, which I provided here. And after this, if you want copies of this presentation just get in touch, we'll happily give you this as well.
This is a structure again, in a slightly different way, context, what your objective is, the style of the response, the tone, who the audience is at the end, and then how the response should be formatted as well. And how do you do this? You structure it. With either caps or new lines, or even XML, if you use a bit of coding as well.
If you structure things and break it up, the model is much more likely to be able to follow your instruction. So number one, add structure. Number two, add context. Okay. It's been trained on a huge amount of information, but it might not have the information that's very specific to you. So what you can do is you can actually copy that context into this.
Now I described a bit of it, I will get a better answer if I describe who I am and the situation I'm using it in. So I could post that straight into there. But let's say for the sake of argument, I'm wanting to ask a question about a new guideline from NICE that hasn't been used in the training data or even a local guideline.
I could either upload that document or I could paste the entire guideline into it and say, based on the guideline that I just gave you, answer me these questions. So you could use that. Now, Models have context windows, a certain size of context they're allowed. Some allow file uploads and some of the more modern models allow you to get more in there.
But putting context in, or even examples, if I say this, you say that, the more examples you give, the more accurate it will get for you as well. And that's called context. in context learning. The next thing is you tell it to go step by step. This is a very simple one in some regards. You literally say, do this step by step, or rather let's think this through step by step.
This was discovered early in some of the testing of these models, asking the model to break down and show it's working. Even though it's just predicting that brings it closer to a more accurate. output. So I will always incorporate, let's think this through step by step or some form of that. Or if I'm providing examples, try and break those examples down as well.
And then the final things you might want to try repeating it. You can try shouting for the important bits. You could try offering it tips. You could even give it some encouragement. These are last roll of the dice type things, but worth going. And with Kieran popping up just now, I'm there is a little bit more I can go through those ways of wrapping up into custom form.
And there are ways of learning more either for these online courses or contacting me and I can help sort you out. We'll come on to the questions.
Fantastic. Thank you very much, Keith. That was absolutely outstanding, I think. But we've got a few minutes, so let's dive into the questions from Michael Fernando. You might be able to see them yourself Keith, in the Q and A. All right, should we yeah, let's go for it. Yeah. Yeah go for it yourself if you want to.
Let's have a look here and see, am I seeing the questions? I think I've got the chat or Q& A. Yeah, here we go. It says, here we go. So Michael, hello, Michael. It says, is one model better than others for use in UK healthcare at present for learning or decision support, providing citations and science as well.
Okay. And then you got the low hanging fruit bit. Okay. First question. They're not medical devices. They're not intended for medical use. You step into this area at risk. Okay. You would always need to check the outputs as well. Okay. This is not a fully resolved area. That being the case, people will want to explore and use this as well to help augment their decision making and their research perhaps.
So if we couch it in that circumstances, what would you be wanting to use? I would always say at this point, I don't have a sort of recommendation. Because we don't have a kind of medical large language model. There's research versions that will do this, but they're not generally available.
Use the best model that you can get your hands on the most advanced version of the one that you can get your hands on. And then when you're actually instructing it, remember to provide context of where you are. I am a X in Y. So I am a. UK based clinician working in the NHS as a general practitioner. You are more likely to get an answer and accurate in that particular area there.
And then the second part was the low hanging fruit goes to AI and clinical practice about increased productivity and augmenting clinical work. Okay. Yeah. So that is where you are more likely to gain some benefits. And when people are talking about that, they're doing things like for example, using AI.
in a model to help with transcribing and summarizing a consultation. As we understand right now, that is not a medical device. You would still have to check the content afterwards. But there are companies like Tortoise, I'm flagging that I have a conflict there, and there's a few others out there as well, NABLA Microsoft has one as well.
There's a few, there's a few out there that you can look at, but you would want to speak to your local practice or, the team that procures for you as well. But the other thing is about, is there anything else that it can do that can offload the administrative support as well? And to that end, I think it could be really helpful.
For example, assisting you in providing patient facing content, not specific to one patient, but general content, for example, or maybe some of the administrative work inside of practice. If you're a GP or helping the administrative staff, I've seen one of the examples I worked through recently with building an induction plan.
Sorry, I'm going to have to ask you to keep your answers slightly shorter. I'll tighten them up, yeah. Yeah, let's press on to Harriet. Yeah. What about specific medical large language models? Yeah, so keeping it short, there are some. Google itself has got a medical family of Gemini. So they've got their most advanced models.
And then they've done a bit of additional training or fine tuning using medical reference data. And their intent is that this should be used in medical circumstances. It is currently only being used in certain research settings and also in some settings in the US. The question is, are these models better for medical uses or not?
That is being explored. Okay. Hannon, sure. Will AI be able to perform surgeries? It depends. If you're talking about a GP surgery and consultations, what we show you an example of that. It's not ready for that just yet, but potentially assisting in some forms. If you're talking about cutting people open and moving things around, then unless you put it inside a robot, no, but if you put it inside a robot, potentially, but it's a bit further off.
Fantastic. Keith, thank you very much. If you wouldn't mind typing a couple of answers in, if you wouldn't mind, because we do want to, um, on the hour at two. Thank you. Thanks everybody for taking part. And thanks for adding. The questions. That's the, this is the end of today's webinar.
Thanks once again to our speaker, Keith Grimes, for this excellent presentation and discussion. On behalf of BMJ Future Health, we're delighted to announce that the registration for our event on the 9th and 20th of November in London is now open. is now open. Please scan the QR code on the screen or visit our website, BMJFutureHealth, for additional information.
We're going to be pausing our webinar series for the summer holidays, but we'll be starting again in September. You can stay connected with us. By joining our new future health linked in group and you can access that page by scanning the QR code on the screen now. Can I just say one thing, Kieran, as well?
I'm going to be sharing some other links, some links to some articles I have written, and also contact details for me if you want to get in touch to get more information and answers to some of the questions that you may not have asked here and other courses, then do get in touch. Thanks. Fantastic, Keith.
Thank you very much. Thanks once again to everybody who's joined. We hope that you found it useful.