BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Generally AI - Season 2 - Episode 1: Generative AI and Creativity

Generally AI - Season 2 - Episode 1: Generative AI and Creativity

Hosts Roland and Anthony discuss how AI is being used to make creativity more accessible. While some Generative AI content lacks variety and artistic depth, there is potential for AI to assist human creators rather than replace them. They also explore the challenge of evaluating generative AI models. Unlike with non-generative AI models, Gen AI evaluation lacks clear ground truth, especially for creative tasks such as image generation or music creation. The conversation covers different evaluation methods for language models and image generation systems, including metrics like BLEU, ROUGE, BERTScore, and CLIPScore, as well as the role of human judges in ranking AI outputs.

Key Takeaways

  • AI is rapidly increasing its presence in creative fields, with AI-generated books and music seeing exponential growth, but quality and variety remain challenging.
  • AI tools like Suno and MusicGen provide creative possibilities for generating music, but they often lack the nuanced creativity and variety of human-made music, which can lead to repetitive or shallow content.
  • There is potential for AI to enhance creativity by supporting human artists in generating samples and collaborating on creative projects, though the technology is still evolving to meet this vision.
  • Generative AI evaluation metrics include BLEU, ROUGE, and BERTScore measure text similarity, while FID and CLIPScore assess the quality of generated images by comparing them with human-labeled data.
  • Human Judgment Matters: Despite advancements in automated evaluation, human ranking and Reinforcement Learning from Human Feedback (RLHF) remain crucial for fine-tuning and aligning AI outputs with user expectations

Transcript

Roland Meertens: Welcome to Generally AI. Anthony, normally I start by reading fun facts, but I think in this case you might have some input about the fun facts. What I read last week was that the Dutch news magazine, the Green Amsterdammer, researched at the start of June what percentage of books on the internet is written by AI.

They trained a classifier to indicate whether a book is AI generated, and they looked at the Dutch book website, and they tested 323,000 titles, and their classifier basically found 6,600 books which were AI generated, so that's 2% of the books which this bookstore sold. And when they looked at the growth of AI-generated books, it went from about 0.1% before the introduction of ChatGPT to 4.2% in April of 2024.

Anthony Alford: Wow.

Roland Meertens: Yes. And these books often did not get a rating higher than two stars.

Anthony Alford: Well, how do you think the actual person who wrote a book that only got one or zero stars feels about that?

Roland Meertens: No, that's a good point, I didn't realize that there are probably authors which have an even worse rating. The extra fun fact I have for you is that Amazon now has a publishing limit of three books per day per author.

Anthony Alford: Okay. Danielle Steele hardest hit.

Roland Meertens: But that's still an insanely high limit, right? Who in the world can write, manually, three books per day?

Anthony Alford: They'd be pretty short.

Roland Meertens: Yes, I think so too.

Anthony Alford: I will say, there are some InfoQ editors who produce so much news that it astonishes me.

Roland Meertens: Yes, that's also true. It is very interesting to see how much you can produce in a day if you want to. I did also find a case, by the way, on Reddit, where an author, instead of shipping the generated book, they actually shipped a book with the ChatGPT prompt in it. So, one Redditor received the generation prompt for a book to create crochet patterns, I believe.

Anthony Alford: Maybe they sell this book at IKEA.

Roland Meertens: Yes, that's a good point. They have all the do-it-yourself books.

Anthony Alford: That is a fun fact though. I don't know if that's surprising how high it is or surprising how low it is.

Roland Meertens: Yes, true.

AI Creativity [02:42]

Roland Meertens: All right, welcome to Generally AI, an InfoQ podcast. It is Season two, Episode number one. And in this podcast, I, Roland Meertens will be diving deep with Anthony Alford into the world of AI creativity.

Anthony Alford: I cannot wait.

Roland Meertens: I'm excited about this, shall I start with my topic?

Anthony Alford: Please.

Roland Meertens: The thing I wanted to do for this episode is that in terms of creativity, we talked about AI-generated music in episode two of the first season, and that is when you showed me Meta's MusicGen model.

Anthony Alford: Right.

Roland Meertens: And what I find interesting is that since that episode generated music got insanely better.

Anthony Alford: I think we should take credit for that, at least partially.

Song-Generating AI Suno [03:51]

Roland Meertens: Thanks to this episode. But did you play around with Suno?

Anthony Alford: I have not, but several people have come up to me and showed me the songs that they've created with it.

Roland Meertens: Yes, I must say these are insanely good. You can generate songs. The song I like the best is the cat song. Let me see if I can play that for you.

Song: Cat, cat, cat, cat, cat, cat, cat. Cat, cat, cat, cat, cat.

Roland Meertens: It is good, right?

Song: Cat, cat, cat, cat, cat.

Anthony Alford: Got a good beat, easy to dance to. What's the name of this song?

Roland Meertens: Cat. But just coming back to the fun fact we had at the start, I discovered that this person is basically releasing multiple albums a day on Spotify with only cat-themed music, all generated with this tool.

Anthony Alford: Let's see when Spotify imposes a limit.

Roland Meertens: Yes, like only three albums a day. But I must say that what I also tried is that I tried using Suno as a dedicated music player, so replacing my normal streaming music with Suno-generated music, and it is quite doable, it's quite fun. But after about half an hour of listening, your mind kind of dumbs down because every song is super catchy, super poppy, and it clearly knows how to follow a recipe and then doesn't deviate from it, where normal human artists do, and that makes it interesting.

Anthony Alford: So, it's really cool for the first time, but maybe after a while you want some variety?

Using AI to Enhance Creativity [05:33]

Roland Meertens: Yes, the variety seems to be important for my human brain, but I can easily see how you can fill a radio station or fill time on a radio station with this music. But what I wanted to do in this episode was explore more how an artist can really use AI to enhance their creativity.

So, rather than replacing the creativity, which Suno basically kills for me, because it is too easy to get something which sounds good, I wanted to explore how AI can enhance your creativity. And one place where I found it was at Google I/O. Did you look at Google I/O?

Anthony Alford: I haven't yet, I think there was some news on InfoQ about it, but I haven't read it all yet.

Roland Meertens: Yes, so at Google I/O they presented this thing called Music FX, and they gave a demo, an artist, which I really like is Marc Rebillet, he did a demo of this tool and he showed that you can have prompts and you can mix different styles and instruments live. For me, it felt like the actual tool was too easy, but he also showed how he got it mixed into his own loops.

Anthony Alford: Okay.

Roland Meertens: Although, he uploaded one of those things online and then commented under it like, "I know it's a weird upload, don't hate me, it's just part of partnership". So, it seems that he's not very satisfied with it.

Anthony Alford: I feel like that's our end game, a nice end game: where we can get cooperation between people and machines.

AI-based Music Sample Generator [07:03]

Roland Meertens: Yes, indeed. And that's what I tried to explore for this episode. The main thing which I wanted to discuss with you is an AI-based sample generator. As you might know, sampling is a big thing, which artists like Daft Punk popularized. So people, instead of playing an actual instrument they find the correct sound in another song, in a record, and then they play and remix that. And I thought if you can just create your samples, AI-generated samples, you have an infinite amount of possibilities of your own creativity.

And the tools I looked at, number one was the AI Sample Generator on aisamplegenerator.com, where if you go to it, you can select what style you want something in, like some percussions, some bongo, some melody, some ambience, and then it will create this sample for you.

And this is using Audiogen, the brother of Meta's MusicGen, which we talked about last time. And this gives you a short sample of two to four seconds. And what I figured you could do is take these samples and put them into a sampling machine to create some drum beats.

Anthony Alford: Okay.

Roland Meertens: I've been playing around with this, I will let you hear some of the samples. So, this is the piano.

Anthony Alford: Okay.

Roland Meertens: This is a funky melody. That was the percussion woodblock.

Anthony Alford: Okay.

Roland Meertens: This is the metal rod strings.

Anthony Alford: Interesting.

Roland Meertens: Interesting is the right word. It's using Audiogen, and I must say that for me it felt like it's really hard to get good sounding sounds. But still, someone in the marketing video from Google Music FX, someone compares it to digging in an infinite amount of vinyl crates.

What I tried to do is take one of those tools you can use to create samples and turn it a bit into songs. In this case, I got a Teenage Engineering PO Knockout II. So I tried to play with these samples, and I must say I tried my best for multiple days and this is the best I could come up with.

Anthony Alford: Let's hear it.

Roland Meertens: Well, I must say I'm ashamed of it. Here we go. You hear the bongos. Okay, let's stop it there. Okay. What did you think of it?

Anthony Alford: What genre would you call that?

Roland Meertens: The genre called: "I really can't do anything with how terrible these samples are".

Anthony Alford: There is a genre called lo-fi. Maybe, I don't know if that quite fits, but some people like that aesthetic of sort of a do-it-yourself, but it was catchy.

Roland Meertens: I also try to make more lo-fi style music, but what I mostly noticed with this sample tool is that it's really hard to get good sounding sounds. Sometimes the sound wave starts too early, sometimes too late, the buildup isn't really good, so you have to cut a lot to extract the small use for things from the samples.

And also, AudioGen seemed to be too eager to produce a lot of noise. So I don't know, if you would create a quacking duck, the duck would immediately quack a lot instead of just producing a nice single quack.

Anthony Alford: And I remember when we did the episode last season, a lot of the generated sounds sounded really noisy, and I wonder if that has to do with whatever decoding…a lot of times if they're doing something like a transformer, they're outputting a sequence of tokens. Those tokens have to be decoded into audio waveforms. So, possibly there's some quantization noise or something. I'm just speculating.

Roland Meertens: And the other thing which I personally didn't like is that a collaboration with a computer isn't really possible. So, I can't take a sample and say something like, "Oh, I want this bass, but a bit longer and a bit darker", or something, and get a response. So, I would just generate a lot of samples and then pick the best ones. But this was kind of the best I could find.

Anthony Alford: That's interesting, because that kind of thing is something that a lot of the generative AI can do when it has a text input is, "Do that again, but change this adjective. More this, more that".

Roland Meertens: And that isn't really there for music yet. I also noticed that once I started shifting the pitch, basically the pluck of the guitar and a play of the piano sounded absolutely the same. So, I tried to create a song with some piano, some guitar, but I shifted the pitch and it was just the same sounds.

Anthony Alford: Interesting.

Roland Meertens: I figured that once you're able to do this, it can be a valuable resource for musicians when they can actually use AI to enhance their own creativity. I must also say that I found it hard to find something where you can jam with a computer. I see some value in that, that maybe I can play my instrument for a few bars or you play the bass, right?

So maybe you could play a catchy bass loop and then say, "I want to keep playing this, but can you generate a drum beat under it and then maybe some strings over it? Or maybe some violin, find a violin sound which blends into the background". That also doesn't really exist. That's something which I think could be a nice add.

Anthony Alford: Yes, I can very well see that being extremely useful for me as a bass player, because nobody wants to hear just bass. It's not that entertaining without something like drums and maybe a guitar as well.

Roland Meertens: Yes, if you can get it small enough, you could fit it into one of those effect pedals that you take an effect pedal, it automatically detects your bar length and then starts jamming with you. Maybe a project for another time.

Anthony Alford: But that goes back to my dream scenario of the machines being there and working with us, working as tools.

Roland Meertens: Yes, indeed. Yes, and as I said, that's what I try to explore. The other thing, if people want to explore this world, the other thing you can use is a tool called Samplab. You can either use this as a standalone tool to create samples from text, but you can also have it as a plugin in Ableton.

And I by no means know how much easier that makes it or how much harder, but at least you can drag the samples into your virtual drum kits. And it is text to sample, but it runs locally so it at least runs on your MacBook. And this also uses the MusicGen model in the background, so it downloads is the first thing it does.

Anthony Alford: Remind me, or do you know the license on the MusicGen, is that why we're seeing it pop up in these tools? Is the license one of those more permissive ones?

Roland Meertens: Yes, I think so. And also, this is where I do see a value in these tools that I don't know anything about licensing in the music industry, but I can imagine that if you take actual samples from songs, you have to clear those samples with the artists. And I don't know if that's one note or one second, how long do you have to sample before it becomes copyright infringement?

Anthony Alford: I'm not sure. Now we're talking about the intricacies of case law and all kinds of good stuff.

Roland Meertens: But I feel like this is a legal area you don't want to risk getting into. And then AI-generated samples can really help you.

Anthony Alford: That's a good point.

Roland Meertens: Especially, if you're a band like Justice who just built up an entire album over samples on single notes and tiny pieces of music.

Anthony Alford: Are you a Beastie Boys fan?

Roland Meertens: I am hearing that I should listen more to the Beastie Boys. Why?

Anthony Alford: Well, the reason I said that is that their albums are just chock-full of samples and maybe sometimes not actual samples, but recreations of snippets of songs. Paul's Boutique, their second album is just, it's all samples.

Roland Meertens: Interesting. As I said, I'm trying to get better at it, and if I have extra samples, it seems to be easier than the AI-generated samples. You want to listen to my beat one last time?

Anthony Alford:  You know it.

Roland Meertens: There we go. Yes, the bongos already killing it.

Anthony Alford: When is the next Eurovision? I think you might have a winner on your hands here.

Roland Meertens: I don't know if the Netherlands is willing to send another candidate from the Netherlands after what happened this year to Eurovision, the Dutch candidate got disqualified. I still don't know why. I don't think that the Netherlands wants to participate anymore.

Anthony Alford: Oh, are they done?

Roland Meertens: I don't know what's up.

How Good is Generative AI? [17:36]

Anthony Alford: All right, so we're talking a little bit about generative AI. And my angle this episode is, how good is Generative AI? How do you measure how good generative AI is? So with more traditional machine learning things like discriminative or predictive AI, like classifiers, regression models, we know how to measure how good those are.

Typically, there's some test data set, which is input to your model, but you also have the expected output that we call ground truth. So, given an input, we know what the model should output. And you can compare what you actually get to that ground truth that you know should get and then you calculate an error metric.

So with classifiers you have things like confusion matrices, you have precision, recall, F1. These are all measuring the relationship among true positive, false positive, true negative, false negative. With generative AI, what is ground truth? What's the ground truth for, "I want a picture of an astronaut riding a horse?" Or what is ground truth for, "I want a funky beat?"

Roland Meertens: And how do you measure it? Yes, how do you know how creative or how good this is?

Anthony Alford: Yes. Let me just ask you, how would you do it if I charged you with the task of evaluating a generative AI model?

Roland Meertens: Well, the problem is that you can try to see if you can at least nudge it towards something which exists and then try to see if you can do that.

Anthony Alford: Yes, that's definitely---and I don't want to spoil it--but yes, that is one way to do it.

Roland Meertens: Is there some kind of deeper feature space you can compare it to?

Anthony Alford: Pretty much, yes, so we'll get into that in fact. Now, with code generation, I think it's pretty easy, right? When I'm evaluating code that I write, there's one simple test: it compiles. "It works on my box".

Roland Meertens: If you're lucky.

Evaluation Metrics for LLMs [19:31]

Anthony Alford: And if we want to be really strict, well, it passes all the unit tests. We won't be talking really about code generation, but I will mainly focus on language models like ChatGPT and text-to-image generation models like DALL-E or Midjourney. We'll start with language models. And in the case of language models, there actually are some objective evaluation metrics.

Roland Meertens: Are these things like the BLEU score or something like that?

Anthony Alford: You are right on. That is one of the things I'm going to talk about. But first I'm going to talk about what a language model does. A language model, you give it several input tokens or words, and that's the context. And then you ask it to predict the most likely next token.

And so, the way that these models are trained is you actually do know the next token, because you took some text from the internet, chopped off the last word, gave that to the model, and you want the model to predict that last word that you hid from it. The mechanics of training requires that we calculate a loss function, which measures this: how good it is at predicting the next token.

Now, you probably remember several years ago OpenAI came up with these so-called scaling laws and those are equations that predict that loss metric. Given the size of the model, size of the data set, the amount of compute to train the model, you can predict the loss that the model will achieve on a test data set.

Roland Meertens: Is there a super strong correlation in this case between the loss and the metric they actually care about?

Anthony Alford: There is, so that's a proxy, but that's a very good point. Nobody really cares about loss. When you're using ChatGPT, you don't care about the training loss. You want it to do some task like question answering. And in fact, with question answering you can produce objective metrics, because we've been doing this forever.

For human beings in school, we can give them a test and score their ability to answer questions. So you can do multiple choice or true/false, and then it's pretty easy to measure that. In fact, this is a common way that language models are evaluated. There are several benchmarks that are basically just tests just like students are given. There's a benchmark called MMLU, which stands for Massive Multitask Language Understanding, and it's got things like elementary mathematics.

Roland Meertens: I never really liked the fact that they do mathematics in those large language models, but fair enough, it's probably part of it.

Anthony Alford: And of course, different models have different strengths, but there are other things where you see headlines: they give a language model…I think GPT-4 took the bar exam, the legal exam, and passed.

But anyway, the website Hugging Face, which is sort of a hub for all things large language models, they have a leaderboard. And you can go and look and see which language model does the best on these benchmarks. This is automated.

Roland Meertens: Who is the best?

Anthony Alford: I was looking at the open language models, and of course, it's some language model that some random person uploaded to Hugging Face. Someone named David Kim right now has the top ranked model on the open LLM leaderboard.

Roland Meertens: Congratulations to David. I really like this. You can just create your own massive language model in your own computers.

Anthony Alford: And you can submit it to the leaderboard and they run the automated tests. Like you said, yes, that is pretty neat. Now, of course, being good citizens, we wouldn't use a large language model to take our high school exams for us.

Roland Meertens: Naturally.

Anthony Alford: But we're going to use it in a more proper way: maybe to write essays or summarize documents. In this case the task doesn't really have a bright-line ground truth, right? Let's say we want ChatGPT to summarize a long document. You copy and paste the text of a document and say, "Please summarize this". The answer that you get could be correct, but how do we compare that against a ground truth?

Roland Meertens: Or do you get a lot of English teachers to rate your LLM outputs every day?

Anthony Alford: Well, maybe you will. But so you mentioned BLEU. I'm trying to say that with an accent because it's not B-L-U-E, it's B-L-E-U.

Roland Meertens: Bleu, bleu.

Anthony Alford: Which stands for bilingual evaluation understudy. And there's another one called ROUGE, which stands for Recall-Oriented Understudy for Gisting Evaluation. These are two metrics that can calculate the similarity between two pieces of text. BLEU was of course created for translation checking.

Roland Meertens: Yes, that's also what I used it for in the past.

Anthony Alford: Essentially both of them compare the overlap of the words in your generated output versus the ground truth. BLEU is more like a precision where it's got the fraction of generated words in the ground truth. And ROUGE has recall in the name, it's the number of matching engrams between the two texts. That's a similarity. So again, that is still kind of an objective metric. There's another way to do similarity, which is to compare the meaning of two different pieces of text.

Roland Meertens: Because that's what I didn't like about BLEU is that you could have exactly the same meaningful translation and still get a massive punishment.

Anthony Alford: The way you can measure similarity, you can use an encoder language model like BERT. BERT, if you put in some text, will give you a vector representation of the meaning or the semantics of that text. If you have your generated text from the language model, you put that into BERT, you get a vector that represents the meaning of that text. And you could take the ground truth, do the same thing, put it into BERT, and you get a vector. Now you can see how similar the vectors are. You can use a distance metric like cosine distance.

Roland Meertens: Yes, that makes a lot of sense. I once had this idea about, I don't know, nine years ago, and then I tried to submit it to some kind of machine learning competition or camp or something and it got rejected. And I'm now like, "Ah, so it's actually a good idea".

Evaluating Image Generation [26:18]

Anthony Alford: Yes, it's called BERT score. You were ahead of your time. So, that's language models. What about text-to-image generation? How do you evaluate those? Because again, there's no ground truth for astronaut riding a horse, but you mentioned something, do you remember what-

Roland Meertens: Maybe it gives you the same feeling, gives you the same vibe?

Anthony Alford: There are some automated metrics where you have an existing data set and a metric, just like we had with ground truth and these BLEU or ROUGE metrics. So There's a data set for images called COCO that came from Microsoft. It has a bunch of images with captions.

Roland Meertens: Yes, it's Common Objects in Context, right?

Anthony Alford: Exactly. When you do text to image generation, it's kind of like reverse captioning. You give it the caption and say, "Make an image that matches this caption".

Roland Meertens: Yes.

Anthony Alford: So the idea is, from the COCO dataset you take the captions and you give those to your image generator and it generates a bunch of images. And now, the ground truth, sort of, is the images from the COCO dataset. You need a measure of similarity between those images and the generated images. And there's a metric, I'm struggling with these French words, I believe it's called Fréchet Inception distance, or I'm just going to say FID.

Roland Meertens: The inception distance.

Anthony Alford: The inception comes from a pre-trained image classifier called Inception. This is very similar to the idea of using BERT, it's a classifier. But right before that last classification layer, you have a vector that represents that image in some space, which you said that, I think earlier.

Roland Meertens: So, the embedding should be close to the embedding of a ground truth.

Anthony Alford: But you have to do it on a collection, so you have to do it on a distribution. And the FID compares the stats. When you put your generated images, multiple ones in, you get a distribution of those activations. You do the same with the COCO dataset. You get a distribution of the activations that those images create.

Roland Meertens: And so, the distribution of your generated images should match the distribution of the original images. So, it's not about being able to get the exact factor of the original image?

Anthony Alford: So, that's a shortcoming. There's another metric that can be used to measure how close the image content represents the prompt. And that's the CLIPScore. So just like BERT the BERT score for a text, you can get a CLIPScore for an image. Now, I know you know all about CLIP because you gave a talk about CLIP.

Roland Meertens: I want to talk about CLIP all day.

Anthony Alford: As you know, CLIP can measure the similarity between an image and the text description of that image.

Roland Meertens: Yes.

Anthony Alford: So, to get the CLIPScore, you put your prompt into CLIP and you put the generated image into CLIP. And if CLIP says they're similar, then your generated model did a good job.

Roland Meertens: But they are probably already using CLIP to guide the generation, so then you're kind of double dipping with the same model.

Anthony Alford: That's exactly right, so these metrics are not perfect. For FID you need a collection of images. CLIP, as you mentioned, has its own shortcomings. Also, supposedly it's not good at counting, which maybe explains why the generated AI is bad at fingers and hands.

Okay, so now if I asked you how you would do it, and you did mention you are having a bunch of teachers read the essays that the AI generates.

Roland Meertens: Yes.

Anthony Alford: Well, it turns out that is a great way to evaluate generative AI content. You ask a real person what they think? I don't know much about art, but I know what I like. That's an old thing.

Roland Meertens: Yes, yes. You can put it all in a museum and see what people are actually looking at, and that is good.

Evaluation by Ranking [30:23]

Anthony Alford: Exactly. Now, someone says, "Well, that's not objective or scientific. You're asking people for their subjective opinion". There's a trick you can do with that, and I call it the optometrist trick. Better one, better two.

You show the judge the output of two different models and say, "Which one do you like better? Which one did a better job? Given this prompt, which one gave you a better output?" This is a head-to-head ranking. And again, maybe you don't get an objective score for your particular model, but you can rank a model against other models. So, this is an E.L.O., is it E.L.O. or ELO?

Roland Meertens: I always say ELO.

Anthony Alford: ELO, so it's like chess ranking, right? You see how one model does against the other and eventually you can rank them all. And in fact, this is a thing for chatbots. There's a Chatbot Arena where real people can log in and take a prompt and it's given to two different models side by side, and you can pick which one you like.

Roland Meertens: Chatbot Arena sounds like a way too exciting name for the fact that people are just ranking different texts.

Anthony Alford: Well, they had to drive traffic somehow.

Roland Meertens: Yes.

Anthony Alford: But a lot of research teams for text image generation do this. They take their model, they prompt it, they prompt something else like DALL-E, and they give it to a human judge and say, "Which one is better?" You'll see that in research papers when our model outperforms DALL-E, it's based on that.

Roland Meertens: Yes. Interesting. I once looked at this, because at some point there was the rumor that someone spotted a really good model in the wild and then people on the internet started talking about what model it could be, because the output was so much better when they were ranking it. And to be honest, when I checked it out, this hidden secret model was indeed really good.

Anthony Alford: Was it actually some guy in a hidden basement drawing pictures?

Roland Meertens: I don't know, maybe it was just OpenAI testing a new beta version of a model. That was what people suspected.

Anthony Alford: Well, it turns out this idea of ranking model outputs, of a human ranking model outputs has another application beyond just evaluating the models. And so, you've probably heard of RLHF.

Roland Meertens: Reinforcement learning.

Anthony Alford: From Human Feedback.

Roland Meertens: From Human Feedback.

Anthony Alford: And that's what OpenAI used to fine tune GPT-3 to construct GPT. And they did it again with GPT-4. And now everybody's doing this instruct tuning their models. This is to solve the problem of you get an output from a large language model and you read it and you say, "Well, this is really amazing and interesting, but it is not at all what I wanted you to do".

One example is we know that they'll just make stuff up. Or it'll be very long-winded and go on and on and on, and it's just not what you want. So, we say that the model's output is not aligned with your intent.

So, the solution is this RLHF. And the way that works is, you've got to create a data set, a fine-tuning data set, and the way you collect that is you give a prompt to a language model and it generates several outputs. And then you give it to a human judge and say, "Rank these".

And so now with that fine-tuning data set you can do some reinforcement learning so that the language model being fine-tuned does a better job generating text that people like.

Roland Meertens: Yes, they could also sell books online and see how many stars they get.

Anthony Alford: That sounds like a great way to do reinforcement learning.

Roland Meertens: If it's less than two stars you get rid of the model.

Anthony Alford: Reward function is stars. Well, of course, the whole reason we want these models to begin with is, we don't want to do this kind of grunt work of things like ranking LLM responses, so we're going to automate that. We're already kind of doing it with BERT score, with CLIPScore.

Well, BERT's a language model. What if you just gave some generated text to GPT-4 and said, "Evaluate this". And so, researchers are doing this, especially with these smaller language models, they're taking the output of that and having GPT-4 evaluate it.

Roland Meertens: But it feels like the butcher is evaluating his own meat, which is probably not an English saying.

Anthony Alford: No, I don't think I've heard that one. But if I may allude to the podcast we did with Michael, it's robots versus robots now, it's robots all the way down. And so, I will go back to what I said earlier at the beginning of the episode. Where we really want to be, I think, is that the AI is helping us, is doing what we want. And so ultimately, I think the best evaluation is human judges. You get an output and you like it. You get a funky beat that you can dance to.

Roland Meertens: Hopefully. If you're lucky you get the cat song.

Anthony Alford: Well, that's all my content, Roland.

Roland Meertens: I really enjoy it.

Anthony Alford: And I enjoyed learning about the cat song.

Words of Wisdom [35:55]

Roland Meertens: All right. Shall we do words of wisdom? What did we learn in this podcast? And did you learn anything yourself recently?

Anthony Alford: I did have a fun fact. I did some prep.

Roland Meertens: Please tell me.

Anthony Alford: The BBC had a headline about whether animals are conscious, and in there some researcher claims that bees can count, recognize human faces and learn how to use tools.

Roland Meertens: Oh, interesting.

Roland Meertens: I think bees probably have far fewer neurons than GPT-4, so there's hope.

Roland Meertens: How far can a bee count?

Anthony Alford: That was not in the article.

Roland Meertens: Because I can tell you that a crow can count to, I believe it's seven.

Anthony Alford: Okay, well, bees have six legs, so maybe they can count to six.

Roland Meertens: The way that they evaluate how far a crow can count is that they put up a hut just outside of a crow's nest, and then people walk into it. So, the crow stays at a distance, and if you put more than seven people in, the crow will keep being away. Once you start walking them out, after seven walkouts the crow apparently loses count and goes back to his nest.

Anthony Alford: Oh, what a great job that would be.

Roland Meertens: To test this?

Anthony Alford: Yes.

Roland Meertens: Ask a lot of animals to count.

Anthony Alford: Well, I mean…Clever Hans.

Roland Meertens: I love Clever Hans.

Anthony Alford: It's probably the biggest challenge. Was he Dutch, Clever Hans?

Roland Meertens: No, it's a German horse.

Anthony Alford: Of course.

Roland Meertens: Also, for those who don't know who Clever Hans is. Clever Hans was a horse which could apparently do a lot of tricks and calculations, but it turned out that the owner either with knowing or without knowing it just made a certain movement as soon as the horse should stop tapping. So, the horse was just really good at interpreting its owner.

Anthony Alford: Which is actually pretty clever anyway.

Roland Meertens: But I always tell people that sometimes someone comes to you with the output of a large language model and they say, "Look at how good it is". And it's just a person interpreting the output in a really interesting way, rather than the large language model being right.

Anthony Alford: I think so. So, are those words of wisdom?

Roland Meertens: Well, I watched the series Masters of the Air, and I learned more about the Norden bombsights. Do you know about that?

Anthony Alford: Yes, I've heard of it.

Roland Meertens: I find it so fascinating that in the Second World War they had a mechanical computer which could control the airplane and mechanically do all the calculations for when to release the bomb to get the correct trajectory to target a certain place.

Anthony Alford: It's still unclear how effective it was.

Roland Meertens: I don't know about that. But I do know that they were very secretive about it. So, you also see that in series that as soon as they have to get out of the plane when it's crashing, they first take the Norden bombsight to keep it out of enemy hands, to not have them have access to this mechanical computer which can do all these calculations and control an airplane, et cetera, et cetera.

Anthony Alford: That's true. They definitely thought it was a wonder weapon.

Roland Meertens: And otherwise you have to do all the calculations by hand, right?

Anthony Alford: Right.

Roland Meertens: So, I think there you can improve something.

Anthony Alford: Definitely.

Roland Meertens: If only they had an arena where they could test these control algorithms.

Anthony Alford: We're getting pretty dark.

Roland Meertens: Yes.

Anthony Alford: I thought it funny that the two text similarity metrics are both French words for colors.

Roland Meertens: Yes.

Anthony Alford: BLEU came first. I suspect the person who came up with ROUGE did that on purpose.

Roland Meertens: Yes, I think there is already a large language model which can take the words you want to get as output and then generate what the acronym actually stands for?

Anthony Alford: I've done that a couple of times with ChatGPT.

Roland Meertens: How well did it work?

Anthony Alford: It's as good as ... I used it.

Roland Meertens: Nice, yes. Thanks so much for listening to the start of our second season. I'm very excited to be back.

Anthony Alford: Me too.

Roland Meertens: Yes, so if you enjoyed this podcast, please tell your friends about it. I think normally people ask you to rate a podcast, but I honestly don't know where you can find this function in the most popular platforms. But if you are listening and you're happy to know where to find it, please try and search it. Because I think that even for Spotify, you can't rate podcasts on the Mac app, you can only do it on your mobile phone.

So, telling your friends about it and telling them what podcast to listen to is the most effective way. I would say that you can follow us on loads of media platforms, but as we can see, media platforms are changing so rapidly that I don't know if this is true anymore when this gets recorded, or when this gets aired.

Anthony Alford: I just hope our rating is higher than the ChatGPT-generated podcasts.

Roland Meertens: Yes, that's also a good question. I don't know if there's already ChatGPT-generated podcasts, there probably are. And you can always find us back on infoq.com. There we publish interesting news. So yes, like and subscribe. Thank you very much for listening, and thank you very much, Anthony.

Anthony Alford: Thank you. See you next time.

Roland Meertens: See you next time.

The other thing which is interesting, by the way, about the books is that on this Dutch book website, if you search for a biography of Oppenheimer, the search results only contain AI-generated biographies. You don't even find the actual biography anymore.

Anthony Alford: Wow.

Roland Meertens: So there's a really good biography of Oppenheimer. I didn't read it.

Anthony Alford: Prometheus…American Prometheus?

Roland Meertens: Prometheus, yes.

Anthony Alford: I read that. Supposedly that's what the movie was based on, or at least was an influence. It was good.

Roland Meertens: Yes. But if you search for the biography of Oppenheimer, you get a lot of books, which only cost three, four, five dollars or something like that. They don't have any ratings here. And the other thing which is interesting is that the people who buy the books and leave the ratings, they don't tend to say this book was AI-generated. They just tend to say their writing was very repetitive, didn't go very deep.

So, the people who buy the books don't even seem to understand the difference between a really good book and a bad book. And they don't distinguish, like it passes the Turing test in some way in that they can't distinguish between bad writing and AI-generated. So, I suspect that nobody is asking for their money back.

Anthony Alford: Crazy.

Roland Meertens: Can you guess what the top categories are for these books?

Anthony Alford: How to do AI, probably.

Roland Meertens: No, it's worse.

Anthony Alford: Oh. Does it rhyme with corn?

Roland Meertens: No, no, no, no. It's better than that. No, so most of the books they found were about health and self-help and management and your health, personal development, language. There were a set of cookbooks, computer/IT books, and sports. But can you imagine buying a book about health which is AI-generated?

Anthony Alford: Oh, man.

Roland Meertens: Yes. This is not the future I signed up for.

Mentioned:

About the Authors

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT