#84 LAURA RUIS – Large language models are not zero-shot communicators [NEURIPS UNPLUGGED]

Hello folks, I was in New Orleans last week and I had the pleasure of interviewing Laura Ruiz, the primary author on this paper, large language models are not zero-shot communicators. Now this is exploring the ability of language models to perform in clicker check, which I guess from a machine learning audience point of view, you might think of as being some kind of extrapolation or even abstractive reasoning. There’s an example of this that we can try. Esther asked, can you come to my party on Friday and as you want, responded, I’ve got to work, which means no. Yeah, part of the reason I wanted to do this quick intro is since this interview, OpenAI has released chat GPT, which is pretty impressive actually. So we can come in here and we can say something along the lines of Esther asked, can you come to the party on Friday? Zhiyuan responded, I have to work. Does Zhiyuan, can Zhiyuan come to party? It looks like it has failed. It says it’s not possible to say whether Zhiyuan can come to the party or not because we don’t have enough information. Zhiyuan may or may not be able to come to the party depending on his work schedule and other factors. Oh, yeah. So this is an example of failed implicature. But anyway, if we come over to Laura’s Twitter, she posted a little thread the other day saying that loads of people have been sending her implicatures, which they used as examples in the paper. And apparently, chat GPT does understand some of them, which she’s very happy about, that she wanted to write a short thread about it. So she said, before they started writing the paper, she would try lots of implicatures that she came up with on David Shitu in different wordings with moderate success. Some always solved and some half of the time depending on the wording, meaning random performance, since the test is a binary, which is to say a yes or a no. That’s why they decided to do a systematic test to figure out how good it actually was and how much it depended on the wording of the prompt. And a few months later, they had the answer that it was okay, but not close to humans. And okay means that on David Shitu and three, the performance of zero short implicature is roughly 70%. Most of the other models fail even with few short in context prompting. So anyway, she said that she gets that people are excited, that chat GPT is doing pragmatic inferences, but she felt the same with David Shitu. It’s all anecdotal, she says, but a more systematic test shows a significant gap with humans, nonetheless, and it’s the same for David Shitu and presumably the same for chat GPT. She says that once this implicature dataset gets solved and she has no doubt that it will get solved relatively soon, since fine-tuning with human feedback helps a lot, they might have some baseline pragmatics in their models. And that’s when it will get really exciting. She says that she’s personally blown away by chat GPT’s capabilities. It’s absolutely incredible at explaining things, compositional generalization of concepts, simulating AVM. I’m not sure what VM means, coherence, creativity, writing essays, poems, and more. She said that the pragmatic language that they studied is part of a type of casual language that we’re using conversation that might emerge from social interactions. She’s personally thinking about why human feedback helps so much, and whether interactivity and social pressures might help even more. Anyway, enjoy the interview. Hi. It’s lovely to meet you. Nice to meet you too. So I’m speaking with Andrew Lampin him yesterday and he really highly recommended your paper. I looked it up, it’s called, a large language models are not zero-shot communicators, and I also recognise Stella Bidiman and Sarah Hoker, of course. Sarah’s an absolute legend. Now, you led in the paper by saying humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response I wore gloves to the question, did you leave fingerprints as meaning no? So you call this implicature, but I suppose I would think of it as some kind of extrapolation being able to reuse knowledge that we have about the world in a different situation. But could you talk us through that paper? Thank you. That was a great introduction to the paper. Indeed, implicature is the technical term that we use for this example that you gave. Indeed, extrapolation is a sensible way to describe this. What we do in this paper is it’s kind of showed that large language models are not really good at this aspect of communication, and we think it’s a very important aspect of communication. So the title says large language models are not zero-shouts communicators, right? So what we mean by that is to be a communicator, you have to infer the meaning of utterances, not only by the semantics, not only by how words combine into some kind of meaning, but by interpreting the shared knowledge or shared experience of the world. And that’s what we look at in this paper. And what we find is that large language models are pretty bad at this. Specifically, we group them into different groups. So we have based large language models like OPT and Bloom that are just trained on next word prediction. And we also have instructable models like flaunti5, T0 or the Finchy’s instructable models by OpenAI. And all models perform really bad, closer to random than to humans, but OpenAI’s instructable models have much more promise. They’re much better at it. Interesting. Okay, so now the zero-shot thing is very interesting. So we take these models and it’s kind of like self-supervised learning. We train them on loads of data on the internet. And you’re saying that zero-shot is when we don’t really give much information in the prompt. So there’s a relationship between how big the model is and how much in context learning that we give to the model in the prompt. Yeah, that’s true. Yeah. The zero-shot case that we tested is we give the model a short instruction saying like in the following exchange someone gives a response that has some meaning beyond the utterances. It had the meaning as yes or no, K resolves this. And then we give an example. And then we evaluate it on ways based on whether it can choose yes or no. So that’s the zero-shot case and humans we don’t give any instructions at all. We just said resolve this to a yes or no. Okay, so is it the case that large language models and zero-shot almost irrespective of their size and irrespective of this human feedback alignment. They just don’t perform very well at this implicature at all. The instructable models by OpenAI gets a non-trivial performance. Okay. I think the models like OPT and Bloom, those kind of base models, they really can’t do this styles very well at all. They get 10% above random. But OpenAI’s models are around 70% at zero-shots. Interesting. So did you do some work looking at okay, well let’s try some in context learning. Does that improve the implicature? Definitely. Yeah, we like it’s unclear, right? Whether zero-shots is a fair comparison to humans for these models. Humans are crimes in different ways. So we also wanted to try few shots in context learning. Personally, I thought in this case in context learning wouldn’t help much because each implicature requires a completely novel type of inference. But in fact, we show that OpenAI’s models is the only group of models that really benefits from this a lot. They can get to up to 80% performance with roughly five examples, which in afterwards more than five examples is kind of blotos. But there’s still a significant gap with humans. But it’s a great improvement. Yeah, that’s fascinating. So can you give an example of if we were doing some in context learning, let’s say with DaVinci too, what would that prompt look like? So if we, I don’t exactly remember the wording of the prompts, but there would be something like the following are examples of the task. And then you get a bunch of implicatures that are already resolved to ESR&R. And then you get the original instruct prompt that says resolve the following sentence to ESR&R. And then you get the actual example. And these in context examples are all taken from a development set. Okay, okay. So can you tell us a little bit about how this reinforcement learning for human preferences works on language models? So reinforcement learning for human preferences is a method to fine tune a base start sinus models. So the base start sinus models are OPT and blue. For example, that’s part of the group and they are just trying to own next word prediction, right? But they are not really aligned. They’re sort of this alignment problem where they they’re trained on next word prediction. And that’s not really what we are asking them to do. And then we’re doing reinforcement learning from human feedback. What we further do, we, I mean, not we unfortunately other people do is they take some kind of human preferences from somewhere alike. For example, humans are shown prompts and completions by models and they say this one is better than that one. This completion for the text for this prompt is better than that one. By that we get a sort of ranking by preference and we can learn a reward model on those preferences with an interesting trick that was published in 2017. And true to this rewards model, we have sort of we can bootstrap the preferences from humans into the base start sinus model by fine tuning them with regular RL on these on this reward model. Yes, what interesting. I was speaking with Sree Jan Kumar, who won one of the outstanding paper awards at Nureps and he’s got this work on kind of we want the models to be more anthropomorphic. And we have these priors to help us understand the world. And he came up with a framework of kind of like importing these priors from language encodings into let’s say discrete program synthesis model. But I guess what I’m saying is that there’s something really interesting going on within context learning. And it’s almost like we’re giving the model the priors to extrapolate or to do something useful in this particular situation. Yeah, yeah, that’s a really interesting. I don’t know the paper I should check it out, but the way I fuse and my thinking has been shaped this week also by Andrew Lampin and who wrote an interesting paper on comparing models and humans is that it seems that in context learning for this specific task in placatures, it’s not really the that they learn how to use their shared experience from the inconse samples. They’re primed for the task with a few short examples in the context. And I think that’s actually what’s happening here. Like if you test the model zero shots, there’s no why would we expect it to, you know, do this task properly. There’s no motivation or anything like that, but if you primates with inconse samples, it does better. And that would also explain why it doesn’t help to add more than five examples because it’s not using sort of the inherent information in the examples. It’s just being primed for this specific task. That’s really interesting. Sarah has done lots of work on its availability and in machine learning models. And one thing I wrestle with a lot is whether we should try and get models to think the way humans do. And you can come at it from an intelligibility and interpret, you know, like an interpretability point of view. But you can also come at it from a generalization point of view. Like maybe we do symbolic generalization over cognitive priors and that’s how we understand the world. But there are people who just say large language models, they’re just a different mode of understanding and we shouldn’t try and make them like us. Like what would you think? It’s a good question. I am really an honest person in interpretability. I’m like, I always come at it from a very anthropocentric view. Like I would love them to be more like humans because that would make them interesting subjects to study also and better to communicate with. But at the same time, you can take this opposite view. And I think Stella, the co-author on this paper often says like you’re making a category area or attributing something to these models that they kind of, they don’t have knowledge, you know, like those kind of things. So it might also very well be that we’re trying to look for pragmatics or somatic understanding in these models. But that’s just now how you should think about it. And I completely forgot to ask you so again, just some of the audience don’t know about natural language understanding in linguistics and so on. What is pragmatics? Yeah, that’s a good question. So pragmatics is the aspects of, it’s an aspect of language, the way we study language, that doesn’t really look at the syntax. So how or somantics which look at, for example, those kind of aspects of linguistics look at how what a word means and how you combine them into novel meetings. So those kind of areas of linguistics really look at when someone understands the term the Autrancist John Loves Mary, they also understand Autranc Mary Loves John. Pragmatics goes beyond that. It looks at how context and our shared experience really influences meaning. So usually the meaning determines by pragmatic inference is not really directly part of the context window. You really have to you know tap into your prior knowledge. Oh, that’s not yeah. So I’m a fan of Montague as well. So it’s almost like we have the semantic potential and then we have pragmatics and that’s bringing like some additional context. Yeah, yeah, exactly. Yeah. Okay. And because that’s a really great example from you know like that that extraplitative example from Montague about Mary Loves John. How could a large language model realistically because I think of that as being a symbolic generalization. So how could a language model do that kind of generalization? Symbolic generalization? Yes. Oh, that’s a big one. I don’t know. I really, really don’t know. Like I in my research journey I can kind of come from studying compositionality in language which is really more this type of thing that we’re talking about now and looking at more sort of neuro symbolic approaches or stronger inductive biases. And now you know these large language models really showed us that there is an insane amount of compositional generalization going on without any inductive bias for that. Yeah. And chat GPT kind of shows us that with all these examples on Twitter, right? So you give it to novel concepts and it combines it beautifully into some kind of story. But yeah, to go back to your question, how can they do it? I don’t know. Maybe skill will get us there to the extent that humans are also in perfect symbolic reasoners. Again, to mention Andrew Limpin and he did a great paper on symbolic behavior. Yeah. Or it’s not really a discrete I can do symbolic processing versus I cannot do symbolic processing, but it’s more a scale. That’s kind of shape my thinking as well. I think it’s a scale. Large language models are pretty far on that scale. They can do very interesting compositional generalization and sort of symbolic behavior, but they feel in catastrophic ways as well. Like again, an example that I think comes from Gary Marcus is when you all set the chat GPT, how does how do horses write cowboys and it just right the whole story about how horse writes a cowboy even though it doesn’t work. Yeah, it’s so interesting because I think it’s easy to think of large language models in the binary. So Marcus says they’re blaviators and and Bender says they’re stochastic parrots and then you have the folks who think that it’s it’s emergent, you know, reasoning and symbolic generalization. Yeah. And I was a skeptic and I just can’t ignore the evidence. They really are. They really are doing amazing things now. Yeah. And you were just speaking to Limpin and it’s a similar thing with this idea of symbolic generalization. It might not be a binary, right? Yeah, exactly. Yeah. It might be a very great competency. Yeah. Yeah. And humans also feel in certain cases. So on this in context learning because that’s something that that has interested me. So the first version of GPT3 is zero shot. We didn’t really know how to prompt it. It looked like a blaviator. Yeah. We then went on this intellectual journey and we discovered prompt engineering, scratch pad, chain of thought reasoning. I spoke with Hattie Zo the other day and she’s got this kind of like algorithmic prompt in context learning and it’s just remarkable what’s going on there. So like do you have any intuition? Is it is it like the prompt is some kind of a program interpreter or something? My intuition is rather that the prompt kind of I don’t know how to formalize this intuition but I guess that’s why it’s an intuition. That’s a prompt kind of primed some model and puts it into a sort of area of its weight space where it can where it can you know better answer the actual question that is asked in in yeah the actual question that’s asked. And I think certain things that point towards this is that there is also some research coming out where they permute the the labels in the in context examples and they show that the performance is similarly good even though at the same or like they do completely random labels in the in context examples and the performance is still pretty good. But there’s also other work that shows that that doesn’t always work sometimes you do need actual labels. So yeah again the answer is basically I don’t know but my intuition is rather that the model is really primed for or there’s also another great way of viewing this and I read that’s on less wrong at some point by I don’t know how to attribute all their buds because I forgot their name but it’s about that these models are good at simulating anything so you have to sort of prime them to know let them know what they’re simulating right now. Yeah it’s weird isn’t it because we have an anthropocentric view of the world and we have you know we’re agentive with individual agents and a language model is everything at once. It’s almost like you need to give it a trajectory just just to just to get it to go somewhere interesting. So with with this in mind so you know we really want to make progress in natural language understanding and what do you think are the steps we need to make to you know robustify these language models. Yeah that’s a good question. Personally from this pragmatic paper I think pragmatics is one area where they can make huge strides too. I think even though they have semantic failure modes they’re really impressive at that they’re really impressive at compositional generalization. Well pragmatics might be something that they’re simply not you know trained for currently and one very low-hanging fruit is the RLA shaft that we talked about. I think that clearly really improves that it’s and intuitively it seems like these in the instructor-tipity paper you see that they ask the human lablers to really infer the human incense in their problems and write on and that’s very reminiscent of implicatures. But then on a more sort of broader skill I think some kind of embodiments or interactivity might be really important. Like pragmatic inference is really a social skill that we have. There’s a lot of pragmatic a lot of pragmatic pressures that you encounter while just acting in the world and navigating communication and navigating a lot of things. So I think I’m currently trying to look at a setup or in reinforcement learning where we are trying to do a make a pragmatic cells can see when pragmatic reasoning was emerged there and I yeah I don’t know how to consolidate that fully with large language models yet but I think interactivity and social navigation is important. I’m really fascinated by the embedded tradition in cognitive science and I suspect you do as well a little bit. How do you contrast what is essentially the representation of this view where everything’s in this big model, a prolific model to some kind of relational view where we’re using essentially the world as its own representation. Yeah, again, I don’t know. So what extends is also possible to express everything in just the representation as a few where you have an internal world model and I don’t know to what extent you really need an external world to learn but it’s intuitively it seems like that might be very important and intuitively it also seems like the behaviors that can emerge are really limited by the world’s and models acting in. So our language model only sees sex and there’s basic things that just simply cannot learn even though it has surprised us a lot. So I think I don’t know it’s easy to think about is in that it’s really important to have some kind of external world to interact with and but you know I’m happy people are working on scaling and I’m not saying you know some type of AGI whatever that means might not emerge from simply scaling up basically what we’re doing right now. Amazing and are there any other parts of your paper that we haven’t spoken about? Yeah one thing that we found pretty interesting is that even though all these open-air models can get really high performance close to humans 6% rough data won’t tell you much without the details but it’s a significant gap still but it’s really really close I don’t know if a human might figure out whether this model is a model or a human you know in that case but when we sort of drill down and make its autonomy of the examples that are in this data says we find that they are mostly benefiting from the simple examples and we’re not a lot of context is needing so one example is an imprecature is if you ask me how many people came to your party and then I say some people came it’s really the conventional meaning kind of of the word some that I meant not all people came but it’s still an imprecature but it’s a it’s a very common one so those kind of examples if we isolate those and we look at specifically examples that are really context dependent like are you coming to the open-air party tonight I have food poisoning you know those need much more context to be resolved and then the performance decreases again and there is roughly a 9% gap which like the best model but all other based models and instructable models like fancy five and stuff they then again completely fail on those kind of examples fascinating I’m really interested by this idea that understanding is a complex phenomenon and we it’s like the parable of the blind men and the elephants so we create all of these metrics and the metrics exclude most of the truth and the metrics for pragmatics presumably are in some sense even more complicated than the metrics that we already use in natural language understanding and it just feels like is that is that going to be a serious problem for us to kind of encapsulate how well the model understands do you mean that we’re sort of giving it a test that is couldn’t reach all well I suppose one one we’re looking at it is in this particular test we’ve come up with lots of examples of of pragmatic inference if you like yeah and what we’re doing is we’re taking a very complex phenomenon and we’re kind of we’re putting pins through it so we’re putting like little slices through it in different angles and and then we’ve got this shortcut problem that if we optimize on any one of those slices we might be kind of like excluding everything else that’s important yeah so it feels like we’ve got I mean is this like a general a general problem in natural language understanding it seems like you’re getting at evaluation in some to some extent right yeah I think evaluation is the single most difficult thing in NLP it’s this is just a benchmark to give us some intuition as to what these failure model current failure models failure modes of these models might be and I think if this benchmark is at some point passed by a model that’s you know in and of itself without trying to diminish my own paper doesn’t tell us much like there’s much more to be done we need more different benchmarks we need like human testing and a sort of tearing style maybe and yeah I think I think this is the most interesting problem in NLP like how to property evaluates interesting yeah and do you take an interest in like fairness and bias in in the models as well I’m very interested in it but from a sort of spectator view as well okay I haven’t worked on it at all okay okay yeah because that’s presumably a massive challenge yeah yeah yeah definitely yeah amazing and so in in final question like what are you most excited about you know in your research trajectory of the next year or say um well definitely I just feel super excited to be working on stuff like this currently given the capabilities these models show like they’re they’re absolutely amazing and I love seeing how people interact with them like the creativity of people is really needed to get some kind of interesting response out of these models right and also the creativity of people is needed to find the failure modes and yeah so what I’m most interested excited about now personally for my own research journey is really trying to look at you know and interactive setup and see when pragmatic inferences might emerge in what’s kind of environments and what’s kind of pressures do we need and and how can we translate that back to getting to be to getting like language models be zero shells communicators amazing and where can people find out more about you um they can follow me on Twitter is first name last name and I have a website also first name last name calm yeah amazing Laura thank you so much it’s a pleasure to meet you thank you for having me amazing amazing cool.

AI video(s) you might be interested in …