AI Trends 2023: Natural Language Proc – ChatGPT, GPT-4 and Cutting Edge Research with Sameer Singh

All right, everyone. Welcome to our AI Trends 2023 series. Each year, we invite friends of the show to join us to recap key developments of the year and anticipate future advancement in the most interesting subfields in AI. And today, we’re joined by Samir Singh. Samir is an associate professor in the Department of Computer Science at UC Irvine and a fellow at the Allen Institute for Artificial Intelligence for AI2 to talk through some of the key research developments in NLP. Of course, before we get going, take a moment to hit that subscribe button wherever you’re listening to today’s show. And you can also follow us on TikTok and Instagram at Twimble AI for highlights from every episode. All right, let’s jump in, Samir. Welcome back to the podcast and our Trends series. Yeah, thank you for having me, Sam. It’s great to be back. It’s super excited to have you back. We were joking a little bit before we got rolling that we picked big years to have you on. The last one was our 2020 right in the wake of GPT-3, a big year. And of course, this has been a huge year for NLP with the relatively recent release of chat GPT. Yeah, it’s always kind of crazy when you have these big changes happening in the year where there is research still going on in parallel and people are exploring research questions and a lot of them either become obsolete or have to be revisited and things like that in the middle of the year. And in this year, especially, it was much closer to the end of the year. So looking back at the near, it’s always the first thing to think about the tragic period, what ideas will still persist in what we’re doing. Yeah, yeah, at a great point, chat GPT happened right at the end of the year. Do you think we’d have the same sense that this was a huge year in NLP if it wasn’t for that that late year release of chat GPT? Oh, I definitely, I think this year has been really impressive. I would say even bigger, even if you take out chat GPT. Overall, this year has been really big for NLP even compared to sort of the ERGPT 3K mount. So, yeah, I think there have been, I feel like it took us a while to come in terms with what these large language models are capable of or what they clearly fail at and what they are good at and try to sort of build better to wing around it, build better support systems around it. So yeah, I think this year has been good, even if you don’t take it along, chat GPT. Yeah, awesome. Well, we’re going to dig into chat GPT in a fair amount of detail as well as some of the other advances you just hinted at. But before we do, I’d love to have you take a few minutes to just kind of introduce yourself to our audience with a focus on kind of your research focus and what your interests are. Cool. Yeah, so I’ve been working in NLP for a long time now, but my focus has mostly been looking at when these language models or machine learning in general gets interface with real users, what are the needs that sort of are there. So a lot of my work has been in explanations and interpretability, but also in robustness both from an adversarial perspective, but also from out of domain generalization perspective. And also in terms of evaluation, like how do we know whether the models are doing well, how well are they doing and in general, be able to understand and predict when the models would work and when the models will not work. And I’m imagining that the advent of large language models and the kind of the dominance of that approach to NLP modeling is well, it certainly changed the tools and the approach that you take has it changed kind of the fundamental way that you approach the problem. So I don’t agree, yes and no, I think it has made a lot of my work obsolete in the sense that we were, we were doing a really good job of finding fundamental softballs and in a lot of these language models and turned out a lot of them go away when you have a lot more data or a lot larger size. And also, the other great observations and insights we had, not all of them have persisted, but the other differentiation we had in our work was always being somewhat model agnostic or try to use a black box approach to the model rather than looking inside what’s going on. And that is something that you can use in this world of only access to API, a lot of those students can still work out. So yeah, so it’s been a mix, but it’s been exciting to sort of continue to do that. So you’ve identified some themes that from your purview have been some of the key topic areas and research that have emerged in the field over the past year. So let’s start there and maybe before we dive into any of the individual items, what’s your take on 2022 broadly and some of the areas that you are most excited about in the year. So I think broadly speaking and we’ll do well deeper into a bunch of these topics, but broadly speaking, I think the importance of data and the importance of looking at what might be in the pre-chaining data has sort of brought up back into the focus in a way that I feel earlier years, we were a lot more agnostic of what the model was being trained on and just more data was better and a thing. This year, it’s been a lot more sort of thinking about what goes in the models and also thinking of ways to use the models, not just by simply prompting it with a simple thing, but trying to get it to reason, trying to get it to break down the problem into pieces and try to evaluate how much the language models can do that. And that I think is key when you start thinking about taking language models to more higher level decision making or higher level reason. Awesome, awesome. What’s the first area you’d like to dig into? So let’s actually start with a chain of thought prompting this is this is work coming out of Google that came out earlier this year and I guess the easiest way to summarize is to say let’s think step by step. The idea here is to have the model not just generate the answer directly, but try to have it go through the reasoning process and then arrive at the answer. This ended up being quite a strong, like quite an effective method to get the model to do a lot of things, especially when it comes to mathematical reasoning and sort of where you can break down the problems into a bunch of these things. So, a chain of thought prompting did extremely well compared to what we had before. And part of the difference, I guess, was you’re not just prompting with questions and answers, but you’re also prompting with something that is much more detailed. So the prompt itself has a bunch of examples of breaking the reasoning down and then you have the model being able to walk through that reason and get to the answer. And in that work is the idea that the user of the model should break the prompts down into more detail or that the model should learn how to kind of show its work and given a course green prompts break it break the prompt down itself. Yes, I think the initial paper focused on the user providing a few examples of this breakdown. Right. So if you’re saying, like, you know, here’s a mathematical word problem, you have two apples and then somebody gives you double of that. How many apples do you have breaking it down into a double means times two and two times two is four. This is a very simple example, but this kind of giving an example or two of breaking this down can be can be quite powerful for. Especially I think one of the key insights here and we can talk about other papers that sort of showed later things, but this is a very emergent property that seems to exist for really large language models. And as if you have smaller language models, it’s kind of difficult to get them to do this kind of piece. So that’s also been exciting to see. And the results there, did you find them surprising, were they counterintuitive that, you know, that that would work. I think the how well they worked were. I think it’s a surprise surprise everyone because it’s a very simple idea to just break it down a little bit. Everybody kind of assume that the transformers are sort of either doing this internally or completely not doing this internally, right. And by showing you that if you actually write out a bunch of examples, these transform models are able to do this to the extent that they are was quite surprising and the gains were quite impressive. Can you talk a little bit about the evaluation of that method. Yes, the evaluation was mostly focused on mathematical word problems. So there’s this GSM in GSM 8 dataset and then there’s this M8WPS mobs, I guess, dataset as well. These are mathematical word problems. And this first evaluation was mostly looking at how well you can do a reason through some of those. And yeah, it was much, much better than anything that we had before. And then they were, they had some evaluations on symbolic reasoning as well. So if you give them sort of tasks which have like, you know, like, like, let have a bunch of. And then you’re finding a character inside a long string, like what is the fifth character or something like that. You can break it down into a bunch of steps and you can give it a few examples. If you don’t give the examples of how to break it down the models are very bad. And have you seen any work that looks to extend this beyond the kind of math and symbolic domain. I’ll talk a little bit about some of related ideas and sort of question answering a little bit later. But there is one work that is related that I like. This is called algorithmic prompting. And this is stuff that came out of Google brain as well. So, you know, a lot of the stuff is coming out of Google brain because you need really large language models to be able to work with this or even bigger than GPD 3, for example. So in this algorithmic prompting paper, this was kind of interesting where they had essentially the same idea as Jane of thought, except that they go really detailed into what those reasoning steps would be. So they mostly focus on things that can be described more as an algorithm rather than that’s just breaking it into a few pieces. So you can say things like if I had to add 12 plus 24, right. Now, would you do that they literally break it down into digits or you take the ones place that’s two in one case four in the other you add them up you get six. There is no carry. That’s that’s one. This is the second step is looking now look at the next 10th place. It’s one and two at them up three. So the carry or the carry is zero. So it’s just three and then 36 right. So all of this this very detailed breakdown which look like extremely detailed. But what was really impressive to me about that paper is they showed that you can give examples of really low digit operations. We do a three digit operations when you’re talking about addition or multiplication any of these things, but at test time, you can firstly, even on two or three digit stuff, it was much, much accurate compared to regular Jane of thought. Like you know, 20 going from 80% of for Jane of thought to something that’s 100%. I’m kind of making up numbers. And this is relative to asking for the model to solve the same problem without any intermediate steps. No, so without the tip of the steps is even worse, right. So this is asking the model to so like 12 plus 24. I don’t know exactly what the chain of thought would be, but it would be something that would be at a higher granularity. And so when you go, when you give this detail from the models are more accurate, which is not so surprising, but what was surprising was that they kept increasing the size of the number at the test at test time. So started adding more and more digits and even up to 18 digit numbers, the model is able to do really these operations much, much more accurately, because even though the problems were only on two or three digits, sort of numbers. And so does this, does this type of work answer definitively whether, you know, this is already happening inside the model versus there’s some other effects like in a sense, it’s really counterintuitive that it would work at all. Like, you know, there’s no registers inside the model that are tracking digits, you know, the ones place and attend place. Like why should that work? Yeah, so I think there are, there are people are still trying to come to terms with white end of sort of reasoning was, is there something in the pre-gaining data, is there something in the model and there’s been some interesting work there. But no, I think the tricky thing here is you’re making all of these things explicit. So you’re, you’re not relying on the model to keep these bits somewhere latent in its sort of memory, right? Like you’re making it explicit and of course it’s attending to all of that. And so the chances of it sort of going away into a wrong place is much lower. So, you know, scratch pad and a bunch of other papers had similar ideas of like, hey, let’s give some model some space to think about things, right? So it’s possible that this is just letting the model actually think things through. So it’s somehow more computation that the model is getting. And there’ve been some papers showing that yeah, that might be the difference, the fact that you’re generating a single number, but you’re letting not just asking to model to give it one shot, but letting it think about it. And, and it’s not so much the fact that you’re giving these examples breakdowns that helps, but I think you know, as many of these things, I’m sure the answer is complicated and it’s some combination of things. The last thing you said almost sounds like the kind of multitask argument. It’s not that, you know, the specific other thing that you’re asking the model to do matters, but that you’re asking it to do another thing in that kind of, you know, on the traditional side, like has some kind of regularization effect or some kind of effect that causes your results to be better just by overloading the model a little bit. Yeah, exactly. Right. So you’re letting in some sense, you have more activations, you have more slated states, you just have giving model and more things to do. And so it has space to explore through more reasoning. So maybe that’s that’s one explanation for why this kind of stuff works. Amazing, amazing. And I should have mentioned earlier on, but I will mention it now all of the papers that were referring to will be available on the show notes page. So folks can check them out. So the next thing that you had on your list was decomposed reasoning. It sounds like it’s in a similar vein. Yes, so I think this is that’s why I kind of put them together, but I think fundamentally this is a very different approach to the same idea. So yes, I think terminology is something that the field is going to be revisiting and decomposed using is kind of something that I gave up with. I don’t even know if you will. The idea here is that there have been bunch of papers here and I’m just going to sort of put here on through some of them. But the idea here is that you shouldn’t rely on the language model alone to do the whole task. So suppose I give it a mathematical word problem, or if I give it a question answering problem that’s lot complicated. I shouldn’t rely on the model and its parameters to be able to carry everything out. Maybe the model needs to use a calculator. Maybe the model needs to do a web search. Maybe the model needs to even write a small Python script and actually run it to get the answer that I want. So this whole idea and yeah, of language models getting what you need, but not just relying on its own parameters, but breaking down your problem and figuring out, oh, I need to call something else. And this is what I’m going to call it. It’s an idea that sort of claim came out post chain of thoughts sort of middle of the year, but they’ve been a bunch of papers all the way to the end of the year that have been doing a lot of a lot of this. So yeah, it’s kind of been exciting. And one of them have been on the QA side of things. So the two I’ll mention is success at prompting that came out of my group that is also decompa was prompting that came out of the idea behind both of these was to take a complex question, break it down into simpler ones. And then have the language model sort of call another language model that is answering each of these simple questions. So if a simple question is a mathematical operation, then you would use a calculator for simple question is a very simple look up question, then you would use something that is like a squad style question answering system things like that. And being able to take what the user wants and breaking down into pieces and then composing the answers together to give you the actual answer is this. Can you talk a little bit in a little bit more detail the difference between successive prompting and decompose prompting what how do the settings for those differ. They came out pretty much around the same time. So it’s difficult to sort of and they sort of appeared at the same conference as well. I think, yeah, so I think some of it, depending on sort of which data set they use so decompose prompting used, especially multi hop data sets and sort of try to decompose it that way, success at prompting focus a little bit more on calculations and symbolic operations as well. So yeah, I would say the differences between them, but kind of same idea different data sets slightly different tooling. Right. And we’ll see in some of these cases, other pairs of papers also that are very similar to the democracy time, because that’s that’s where we are. How about tool augmented? So two augmented stuff. So there was a paper coming out of Google, I believe, called tool augmented language models. So down is a paper. And this is one of the papers that was essentially showing that you can have instead of just calling a calculator, specifically or just having a fixed set of things. Create a description of API is that the language model has access to and have the language model itself generate example calls to that API when it’s doing it out. So if I want to say like, hey, GPT three or whatever, how what is it going to get today? Right. Or how hot is it going to get today in Irvine? So the language model is going to say, okay, this is a question about the weather in Irvine. So I’m going to compose an API called to a weather service. That’s going to say, what’s the weather in Irvine? And then it will return some JSON object that says, or the high is this low is this probability of rain is this. And then the language model will kick in again and take that out, but and say, oh, it’s going to be really hot today. As since it is Southern California and yeah, you know, something like this research is heading in the direction of how would you kind of rebuild Siri or, you know, likes or something like that with LLM’s. I think this is one of the key sort of advantages of these language models is not that they can do additions and subtractions internally, like I think that’s interesting from an intellectual point of view. But when you’re making actual products, you want this language model to language is a way to interface with things that are external to you. And then you can do the language models should take in the user queries, but also be the interface to other things outside and be able to carry it. I think we will talk a little bit about that later, but one of the reasons I like this is you can also somehow now attribute the answer that you’re getting. And I don’t know parameter in the language model, but to say look, this is the API call I made and this is the answer I got and now that’s what that’s the answer I gave you. So in some sense, it becomes a little bit more attribute. And the idea of the language model writing a program to figure out the answer to a question is a fascinating one and it, you know, almost feels like if anything that, you know, anything around LLM is going to be the path to AGI it’s like it’s that. What was your, what was your reaction to that research? Yeah, I think it seems quite like to me from a practical point of view, it seems quite quite exciting right like so. From a core generation point of view and things like that is useful as well, but the nice thing about the code writing code is that it’s it’s an ambiguous right so it’s making some calls to an external database. And I want to update the language model or update this whole system, I can just update my knowledge directly right the knowledge is external somehow to the parameterization of the language model that makes it super convenient to delete things or to add things or to get contributions and all these things. The interface to that data source is always programs either it’s like a simple API call or a more complex one and I think I really like this idea because it allows the language models to do things that it should be doing, which is to understand language or let’s not call it on stand would be able to parse language be able to sort of, you know, transform it. But doesn’t necessarily have to know the temperature of Irvine every day or things like that, like that’s not something I necessarily wanted to explain. So just very subtly in there you kind of addressed another big conversation that’s happening in the community now in this this idea of do language models understand you call this decomposed reasoning the thing is writing programs that kind of requires some kind of reasoning like what’s your take on, you know, these broader questions about. And understanding and in LLM’s or would you like to defer that is there a natural point later for us to to talk about that. Come back to it a little bit later maybe even in the next section of sort of we are trying to sort of question what reasoning is and trying to evaluate that in some sense. Yeah, the semantic argument around understanding like that’s not that interesting, but like how a language model can reason and, you know, the extent to which it’s reasoning versus like cutting pasting at some, you know, level beyond. At an impressive in an impressive way like that’s kind of really interesting. Yeah, definitely so I would say and then even pushing it a little bit further like what are the consequences of off the fact that it is cutting pasting versus it’s reasoning right like so. How should we calibrate what things these should be deployed for and what things they should not be deployed for based on the situations those are the kind of things that I’m really really. Awesome awesome awesome awesome is that your next section yes and that sort of ties in very well with what I think is exciting next which is I’m going to call it sort of understanding the relationship between the data, the pre training data and the output of the model. I feel like there is again a few different threads here, but there is one that came out of my group that I think is a simple idea that really sort of captures exactly what you said the cutting pasting versus a reasoning thing. So this paper is called impact of pre training term frequencies on few short reasoning. The idea here is we were looking only at numerical reasoning right now so we started looking at all of these examples of a GPT 3 can do addition and multiplication and things like that and we started looking at the instances and turns out that they it doesn’t always do it right is not 100% at those it’s 80% or 90% or whatever the number is. So we started looking at what what differentiates the one it gets correct and do things that it doesn’t get correct. So we so for example we saw that if you ask it what is 24 times 18 the model gets it right it says 432 if you say what is 23 times 18 the model gets it wrong right so 24 times 18 is correct 23 times 18 is not correct is this random like what’s going on here. And did you just interrupt there did you find that consistent across invocations I’ve run into that kind of thing you know we’ve all run into that kind of thing playing with chat GPT and other things and sometimes it gets certain things consistently wrong other times it gets the thing wrong sometimes and not wrong other times I get to random seed kind of thing or something else going on in the model. Did you explore that at all. And then we saw that so it’s both like if you’re doing few short prompting which examples you put in the prompt would sometimes change the output or how you phrase it like you do you say what is 24 times 18 or what is 24 X 18 you know things like that. Definitely made a difference of but even after averaging these things out. And 24 times 18 was in general more accurate than 23 times 18 and even more than that we did even further analysis and it turns out that all of our instances that involve 24 the model was much more accurate on then all of the instances that involve 23. So we decided to do this for everything from 0 to 100 so all two different numbers essentially single and two different numbers and no it’s a whole spectrum and we didn’t see a clear reason why some things are low accuracy some things are high accuracy. So then what we decided to do this is the part that I think I quite excited about we started decided to count how many times do each number each of these numbers appear in the pre-gaining and turns out and you can see the plot in the figure. If you plot the log of the frequency of these terms and now how accurate the models are it is pretty much a exactly like a. Which is intuitive the model does better on things that it sees a lot of right yes yeah but yeah so it’s also expected yet disappointing because you don’t want it to be such a nice strong curve right like you want to do like if it’s doing mathematical reasoning it should know that 23 is one less than 24 and you know all of these things right so. I think it’s it’s one of these things where it was expected that the model would be better things it has seen before but you also at the same time hold this thing of Lego it is able to reason it is able to do these things and it’s kind of difficult to resolve both of those right so this one is one example I think this we are barely scratching the surface but this was an example of paper that sort of started looking at some of these pre training. So not just single term frequencies but biogram frequencies and things like that and show that the model is quite sensitive to what these things should be right and I don’t want to sort of make a claim that there is cutting wasting going on or any of these things but this this effect is so strong that at least when we think about reasoning and when we are evaluating reasoning in these language models we should be taking this effect into account. And this maybe a side note it looks like the model that you evaluated with GPT J and the you clearly that’s a model is an open source model that you had access to the pre training data kind of ask questions about how do you get the same kind of insight into these models that are behind APIs. Yeah so I think the question is like I kind of don’t mind that models are behind APIs to some degree that’s commercially that that kind of makes sense. I feel a little bit disappointing that the training data also is behind sort of close world right so like I know that there is a lot in the training data but if you want to be able to understand like why chat GPT works or why GPT works or even generally like when do language models work when are they safe to deploy all of these get a question. I think it’s okay if the language model we only have a black box access to but it would be good to have access to the training data it would be good to have access to a bunch of these other things that can help us sort of do simple kind of analysis like this and maybe more complex ones and and actually be able to sort of decide what to do with the model. So I think this whole direction of trying to understand what’s in the pre training data I think is key and then something that will persist for the next couple of years. Do you think we have the right tools to do that at scale? I’m imagining that was not an easy task to do just for simple mathematical problems. That’s true but training GPT 3 is also not a simple problem and people have solved it right so I think as yeah so I think the tooling is something that everybody right now is sort of excited about building tools that actually give information and insights into these language models and I think you know even idea we are at sort of early stages of trying to do these things of building some tooling that can support this kind of analysis. But you know if the data set is available I think people will do amazing things and I thought this would be impossible and it seemed like crazy like hey this is almost a terabyte of text like how can you do anything with that and it was not trivial but it was easier than impossible. How do you identify you know so you identified some behavior the relationship between accuracy and frequency in the training data. How do you identify what that is a consequence of meaning is it specific to the way GPT J was trained is it all transformer based language models is it you know maybe something about that particular data set like have you are you able to say that it is a broad characteristic of LLM’s in general based on the work that you’ve done thus far. So that’s a little bit difficult to do sort of yeah that’s a little bit difficult to measure partly because we don’t have data set available for too many models right so at least we tried the whole slope of the Luther models that were trained of the same data set and we saw similar. Similar effect on different model sizes essentially and yeah as data sets become pretty and that’s become more standard it’s fairly to do sort of extend this stuff since this paper we also have sort of an online demo where we have a bunch of more tasks that try to go beyond mathematical reasoning it’s a little bit difficult to sort of even define what these sort of terms are and what you should be computing frequency of. But yeah I think we should be able to do the stuff for other tasks and for other models and to me I think this is somehow a consequence of a language modeling loss that encourages us in some sense right so. Like yes the model has seen more and it’ll be more accurate but even the ones that it has seen less like it has still seen billions of times so there is no reason for it to be wrong on it except for the fact that the language modeling loss would sort of want you to be more right of the ones that has seen. Yeah another paper identified out of the goal board group oh yeah so this is work led by you know I think I’ll quickly talk about this so this is this had a similar sort of intuition for like trying to look at things in the data and trying to figure out like why the model is has certain biases of has certain errors and this was sort of a little bit more on trying to identify when to entities are related right so. If you say where was Barack Obama born the model tends to say Chicago or in some sense it can say Washington and you know depending on how you. And why does it give the wrong answer is kind of a question why does it not say Hawaii or something and I think to be able to answer this question you have to go back to the pre training data and try to see like okay what did it even see so what I like about this paper is it kind of tries to build use causality tools and build the whole causal graph for where these kind of predictions might have come from. And and then tries to estimate all of the edges in those positive graph and tries to do some causal inference to sort of attribute that to specific. Statistics of the pretty so this causal graph would each individual document in the pre training data be an intervention of sorts so they sort of worked they worked or at the level of. I guess triples or something like that right so let’s say you see Obama in Chicago being a senator there or something right so this is kind of a triple and so they work on statistics of those triples to sort of make it tractable and make it sort of allow this interest to work. But in applying the the causality machinery like are each of those interventions relative to you know some prior relationship between the the things the triples yeah so so there is the true relationship between these triples and then there is the observe relationship between these triples. And and how many of these things how many times it appeared in the pre training data and so the idea would be when you’re doing it over many different entities and many different relations. Do you so those kind of become your your whole data set in some sense so Obama has appeared with Chicago but Hillary Clinton has appeared elsewhere and on all of these things. And then together which which of these which of these relations seem to affect a specific prediction the most awesome awesome kind of continuing on in the the data theme has been a ton of work looking at the need for clean data. I think maybe one of the most surprising things for me is like the return of supervision at the you know the scale of LLM talk a little bit about this category. Yes, this was somehow the most surprising category for me for this year I will say that like after GBT’s came out and at the end of last year you know everybody was kind excited about language models but the solutions for what’s next always seem to be like hey let’s get more data and let’s get larger models. It’s a train train longer and those are still sort of useful things nobody’s denying but this year has shown that like hey you can actually do a lot if you’re a little bit careful about your data and maybe if you start cleaning up your data and I do think a little bit about where the data your pre training data should come from pre training data itself. You know that could be quite interesting. So when you think of like RLHF as an example do you think of that as fundamentally just cleaning up your data being more careful about your data as opposed to. Yeah, so no I think I was thinking more what happened with the LUM language model which which was trained which was trained on sort of a lot more thoughtful process of gathering the data said because I mean partly because they documented it and we know what what. What what they went through but now like RLHF the and and those kind of things I think are examples of showing that the language models are not quite ready for use case just based on pre training on sort of large data that has been gathered you need to read like you can call it like hey cleaning up the data but I think of it is like maybe reinforcing some of the nice signals in the data by having these examples or in some sense. You know people have been fine tuning on on the sort of super ways data as well and the gains that you get from RLHF have become extremely evident this year right so somehow that has become the secret source of open AI and all of these companies that want to have really strong language models rather than scale and just raw pre training. And for completeness we talked a little bit about RLHF on the show but what how do you think about it as a researcher I think it’s quite exciting I think it sort of addresses a lot of my concerns with language models I don’t think pre training data can be trusted right and and you shouldn’t just train something and expect the model to have clean output or have. You know your values and any of these kind of things whatever that means the context of large language models but essentially if you want real users to be interfacing with with language models you need to make sure that there is some sort of check and RLHF is not the solution like a full solution but at least there is a way to sort of say okay this is the actual task your actual task is to be interfacing with humans not just. Not just regurgitated what you’ve seen in the pre training corpus right and so that intuition sort of is captured by this. And do you do you remember of any of the if they were even published the stats in terms of the number of prompts like human generated prompts that were used in chat GPT or instruct GPT yeah I don’t think they were published as far as I know yeah I don’t remember exactly what they are. I think until GPT had the documentation of sort of how they were gathered but the size was like you know how much how many of them was sort of generation does was just classification does things like that but I don’t think the exact data seems to be. Do you do you have a guess as to like the relative cost of you know gender of collecting the human feedback relative to the cost of training the models. I think it’s much cheaper order magnitude or is it like much much much cheaper because we always say like that you know collecting the data label data is the most expensive part of machine learning is that still true at the scale of LLM’s. Or is it that RLHF is like extremely efficient and you just need a little bit of guidance on top of the you know the pre training data. I feel the true answer is somewhere in between so I don’t think it’s like it’s nowhere very little data like I think you need a lot of data to be able to do it but I don’t think it comes close at least the way these are trained right now I don’t think it comes close to sort of training the model itself right so but like when you think about. You know chat GPD it’s it’s been released publicly and a lot of people are using it a lot of that data is going to go into in some form back into the model and improve it so was that expensive to collect in some sense because they had to run chat GPD but you know they’ll probably pay some interest to clean that up but I don’t think that’s prepared to actually training it’s also a really interesting example of like bootstrapping like there’s a certain amount that they collected themselves you know the instruction GPD work and then they you know created something that was good enough to set loose in the world and now they’ve got this virtual cycle where you know I’m imagining it’s a lot cheaper for some annotator to clean up what you know millions of people are creating them for them to create that themselves and I think like I think this year has also shown maybe even to people at open air that the value of these things right like when they release GPD 3 they probably know how valuable this would be and then they sort of collected data release in start GPD and yeah on their benchmarks it was good but when people started using it you realize how much better it is I think similarly which at GPD they probably knew how good it was but they probably know how good it actually is right and and I think this this idea of human feedback being a secret source that is that it’s proprietary I think in sort of will continue to be a bigger piece in the future. So I think this is a little bit about roots. Yes the roots is this nice dataset that was gathered by the big science group and I’ve been following the big science group and you know a bunch of interesting things there it’s just I guess I’ll jump in to refer to the interview that I did with Thomas Wolf that I don’t think the roots came up explicitly but we talked about that work and you know that eventually resulted in bloom which we’ll talk about a little bit more as well. Yes the roots I like because I think I really like what you thought I have done with the pilot dataset by releasing the dataset that was used to train all the GPTJ models and I think the big science group sort of took the intuition and sort of went further with it where they have a really well documented and not just well documented I would say a very thoughtful process of gathering this dataset. It’s it’s multilingual over many many different languages they’ve been careful about sort of listing which sources they want to even crawl in the first place before so it’s not like a post hot cleanup of the data it’s very sort of thinking about it. They gathered a dataset that is huge and they have we talked a little bit about this it also but hugging faces sort of it tools on top of it to be able to quickly search it to see what’s in it and stuff like that and I kind of like like that approach to the life language model. So I think getting the right dataset is crucial for these language models and doing this documentation and stuff is good for in the long term. So your next category is decoding only talk a little bit about what that means. And this is a theme that I like about some of the work that has come out here and partly it’s because you know we have these language models where we have this black box of interface to them and a lot of it is just prompting so changing things on the input side to see you know what the model generates and the only thing most people are changing on the output side is like oh we let’s change the temperature a little bit and we get part of the things. So we’ve been a bunch of work looking at okay let’s not just do that let’s actually think about what’s happening in the output of the model during decoding of the text and maybe we can do smart things there that actually sort of change the output considered right so some of these sort of came out sort of late last year so there was this work on Newtius sampling that’s a little bit older but then there was this stuff on sort of constraint decoding as well. Where the constraint decoding paper came out of semantic machines they showed that you know you can have suppose you want to want the language model to generate programs right so the programs come with a certain grammar like there is a syntax that they need to follow so you could actually constrain the output of the language model as it’s generating token by token to sort of adhere to that to that syntax in some sense right. And just by doing this constraint you can get firstly obviously you will get programs that are syntactically correct but you can actually get the right things out of the model. And so there have been a lot of sort of works looking at how can we decode by having some constraints on the decoding. So one of the papers that came out this year that we’ve got the best paper award as well is called neural logic a star a star s decoding and the idea here is that instead of just doing left to right decoding where you’re being greedy or where you’re being doing some kind of beam sort of sampling or any of these process. Why don’t you actually use some of the computer science ideas that we have like a star search and try to find the best possible decoding. And then when you’re doing this kind of thing you can also think about constraints that you might want to put on the decoding so you want to say look I want the decoding to have these three words in it. Like a you’re generating a recipe make sure that it has these five ingredients somewhere in the generated text. You can also flip it around hey generate whatever text you generate make sure it doesn’t have these specific words. And this paper sort of uses a start during decoding to to didn’t generate that text that sort of you know your constraints are satisfied. And this paper show that yeah once you do that properly you can actually do a lot of the tasks much better just by controlling decoding rather than changing much on the inputs. It seems like this is another example where it’s predicated on having open access to the model internals and you potentially lose a lot if you if you don’t. Yeah I think so from what I understand you can still do these kind of things with GPT three to some degree I think what you need you can. Okay so you can do this with black box model as long as you get the probabilities of all of it all of the tokens at every step right so I don’t think you can actually does that but you know you could imagine an API that says okay the next here’s the distribution over all of the tokens. And you should be still be able to do these kind of items. So you know some of the concerns is like if you want decoding to be fast then it’s difficult to use some of these ideas the A star one specific is is a lot slower. But it’s able to satisfy your constraints so it can be where you’re okay to trade off some time but let the model take take more time in making sure the output is seen and satisfies your constraint. And now you often you see hey we applied one method A star in this case let’s go back to the computer science toolkit and apply everything else have we seen that here. Not yet this I think they were late enough in the year but I guess it came out sort of early in the year but yeah we haven’t seen that much yet because but I think yeah that’s the kind of thing that will happen next is like okay now this is. Yeah this is attracting a whole different kind of thinking where people were not thinking about decoding at all and now they will be in this light which is always. Awesome awesome well those are great themes to kind of reflect on as we think about the past year and and I’ll be research. Our next category is to talk about some of the new tools and open source projects that we saw in the year we’ve already talked a little bit about data sets which is kind of related. But I think the first thing you have here is OPT that’s all about OPT. Yeah I think OPT came out fairly early in this year and I think it kind of surprised everyone because the. Sort of looking back at last year they weren’t that many open source reproductions of large sites. So I think Luther AI was sort of leading it GPTJ was X billion and you know they were sort of growing it slowly and slowly and they had got to 20 billion parameters and then OPT sort of came into the scene and they. There were a bunch of nice things they documented a lot of their whole training process in a log work with sort of all kinds of insights about what training a log. Yes it was released by right and that was also not not to say too much against better but it was also surprising that reproducibility and open source seemed to be key aspect of of OPT as well so that was kind of nice. And they also released a lot of models and you know like all different sizes including 175 billion which which hadn’t been available at all and even right now I think it’s probably the most useful model if you want to do stuff with 175 billion is to use the OPT model right so I think. The idea of documenting the whole training data gathering process has documenting the whole training of the model process and then releasing all of these models. Available for research I think has helped the research community a lot and I expect that if there are people who want to build models and potentially find you in language models and do all of these things. So you have pretty big answers have you seen much in terms of benchmarking it against GPT 3. Yes so I think people have been benchmarking it and I think it performs reasonably well. The tricky thing is of course there is instruct GPT which is you know when you call GPT 3 on the API right now it’s often defaults to the instruct one and that one is a lot more difficult to beat but for all of the purposes. I think of it as like yeah OPT is basically same. We talked a little bit about about the big science project and one of its outputs and other is bloom. What did you what was your take on bloom? Bloom was again a really big data model that was I think 180 billion parameters so similar sizes GPT 3. So I think it’s a really good thing to do with the work of the company and the training process. Combined with the fact that this was done by a group of people it’s kind of just volunteering their time to do so. So I think the company is to say like hey you can do this because we have done this kind of thing. But what Bloom has shown is that a bunch of people enthusiastic and excited folks that are enterprising can actually do things that maybe even a year or two ago would have seemed impossible. So in our trends conversation from last year or maybe it was a maybe it was prior. But you know in these kinds of conversations you know there was a point in time where we were lamenting the kind of the loss on the part of the individual academic researcher to contribute to fundamental model research because of the resources that were required. And you know hugging face and the big science is team like they showed that you know not not necessarily not so fast right right right exactly. And the other thing I like about this the bloom effort is and the corpus that came with it they were also focused on being a lot more inclusive in terms of having a global perspective. So they were trying to cover many many different languages. Very principled in the way they pulled the data together yeah yeah and also multilingual in a way that none of the existing models have. So yeah, it’s quite great. And so like conceptually this is a great example of how you know one model at 175 billion primers and another model you know the same number of parameters could be very different. You know at least in the data that they were trained on and you would expect that to result in very different results using the model to what extent have we characterized that like at that scale of date it’s still a lot of data still a lot of like raw internet data. Is it all kind of fall out in the wash and all their efforts at being principles you know kind of just get get lost or do we do we know how to care to compare that. Yeah, so there have been a bunch of benchmarks and including in their papers but in general also and that’s where sort of the you know I don’t don’t hold me to this but I would say like bloom is not the go to language model for people if they want to do English things right now. So I think maybe some of the tradeoffs they made in collecting the data or even just having all languages. It’s definitely really good for multilingual things but that’s not what our benchmarks have been designed for unfortunately and so if you just look at the benchmarks like you know which are traditionally designed for English bloom. I don’t think white is at bar with the OPD or GPDC and different JNAR with Instruct. And when I mentioned benchmark you know there’s that aspect of kind of applying the traditional performance benchmarks you know for LLM’s to bloom and comparing their results to the others. But I’m also curious about how we characterize like qualitative differences between the way bloom responds and the way GPT responds for example you know in terms of you know like the kind of fairness considerations or that kind of thing or you know are there qualitative differences in the kinds of responses that you get that aren’t picked up by the traditional benchmarks or are the traditional benchmarks like so you know expansive at this point we’ve kind of characterized a lot of that stuff explicitly. Yeah again I think the answer is somewhere in between so I don’t know if people have thoroughly compared the two to see like it was level of toxicity and people like that you know like I think when OPD came out they did a lot of this analysis in their paper of like how toxic is their model how safe is the model and and they realized that yeah in some things they were worse often then some of the existing models. But I think the with LLM specifically I don’t know of the head how it’s all compared in terms of these these other sort of other aspects. Okay talk about the inverse scaling competition. Yeah so this was a pretty nice thing that gave out and I think I suppose it’s still going on even though the submissions are down so I’m kind of hoping to see what the actual effect of this was. But this was sort of introduced sort of in the middle of the year and the idea here is the thinking of things like what sort of scaling laws was showing right like when you scale up your models performance goes up for everything and that’s kind of exciting to see but it also tells us that. Okay there are many many things that just the models would just get better on as time goes by because they’ll get bigger they’ll have more data set. The inverse scaling was this intuition to see okay what are can we characterize the phenomena that don’t the phenomena that don’t have the same right so other aspects you know you created a data set which is something everybody will agree is a reasonable data set. But when you give them to larger models they actually get worse and so this this price in this competition is an effort to identify what those. What those tasks would be and sort of the better your inverse scaling is so the worst the bigger models are on your on the data set that you’ve contributed more likely you are to. And so yeah they’ve had the submissions and they’re kind of evaluating them as opposed to they haven’t quite announced it but I think a lot of the stuff on you know a lot of the interesting things could come out of this this this effort so one thing I could imagine is sort of deeper levels of misinformation where the model is relying so much on what it has seen in its training data. Not that’s not calling misinformation just not being able to update its information and sometimes right so these large language models have memorized so much about the pre-telling data that they kind of reject evidence against that right maybe if they’re smaller there’s less memorization and more. But I think it could be pretty exciting to see what are those things that actually get worse with scale I think it’s quite interesting question and next up you have the Galactica can we call it a debacle. Galactica is this L alone that meta released that was tuned to generate scientific and research text and you know was it was it even up for three days it got pulled down pretty quickly right. Yep, yeah I think maybe a little bit more than that but yeah they’re about see I think to me it’s it’s a story about how not anything in terms of on the Galactica team did itself right like I think the model training it everything was was the right thing to be doing. But the tricky thing was just how it was pitched and how you know there was just not clear caveats about what what this model is capable of doing and what it’s not capable of doing that led to such a backlash right so I think the it is it was a language more like language model train on a lot of science papers so it’s going to produce papers that look like scientific text I think that was an expected thing but again the backlash it got and stuff like that essentially tells everyone and I hope the message is not not to demo language models anymore but I think the message should be how to make sure that you’re not doing things more than they should be. If you reflect on chat GPT which came not very long after Galactica and and the launches of those respective products are there clear is there a clear like do don’t do list. And the chat GPT itself was also not you know not completely without hype attached to it even some sort of how they are right somehow they managed it it was a lot of hype. I will say that they were fairly clear about the fact that like hey don’t trust you know maybe they could have been clearer but like don’t trust the factual stuff and things like that like it’s not a look up engine. I think they kind of could have done a lot more of that but you know the thing is that at least some caveats but more than that they part of their our alleged stuff was to make sure that the model is not producing at least obviously sexist. Yeah there was a lot and maybe we’re jumping into chat GPT which actually is the next thing we’re going to talk about but it was definitely a lot of especially early on things that it just would not a lot of time on like yeah no you’re not going to sucker me in and go in there. And I think when you’re building something that’s public facing that you’re selling as a tool as a product that is necessary right like I don’t think you should should be doing otherwise. Galactica should not have been a public facing tool for every scientist to start using to write their papers. And the product is what the tool is is a gap that other people can help fill in right so that that was sort of the missing piece when I think about chat GPT versus galactica is like yeah chat GPT has the has some of the caveats about what it’s doing has some of the caveats about the model not the product to some degree and galactica was missing it right so yeah well we’re there we were talking about open source next up is kind of commercial developments top of that list is chat GPT yeah let’s let’s talk about it. It’s you said early on that hey even without chat GPT you know this was a huge year that’s clearly not to say that chat GPT wasn’t a huge contribution to the year I mean certainly one of the things that I found most interesting was the degree to which it kind of broke out of the MLA echo chamber to you know just talking to random friends and are like hey have you try this chat GPT thing what yeah have yeah so that that’s been the I guess the most surprising and in some sense the longest longest term impact for chat GPT is going to be the fact that it made it commoditize it made it mainstream in a way that nothing before it had right and whether it deserved it or not what the actual innovations are and all of these things is a different question right like it is clearly even for research point of view qualitatively better than GPT tree whether it met some threshold for becoming the big thing that it did it is sort of difficult to sort of in hindsight try to evaluate that but I think it was yeah it is definitely something that became mainstream and and everybody is talking about it there is still a question in my mind whether that’s a good thing or not in the long run because you know we can talk about some of the problems with the biggest one being we know it’s a language model like to some degree we’ve been figuring out last couple of years what these things like go off and what these things are not and I can say that in like couple of minutes come up with tons of examples where it would fail that’s not quite the case when you sort of start putting it out in the public so most people don’t know what a language model is and I have played around with you know I’ve got a bunch of my family to try it and things like that and the biggest thing I’ve had of the biggest difficulty I’ve had conveying to them is the fact that it’s not looking up anything when you ask it something right like that that is a conceptual jump that is very very difficult for to get over yeah and so people like yeah of course it should know about these things because it happened yesterday and for such a big news items like why would it not know and now like no actually it doesn’t know anything beyond a certain time and even saying it knows anything from then is a little bit difficult. So I think the best analogy that I’ve you know this applies to my research as well but the best analogy I’ve had in trying to explain people what chat GPT does is do not think of it as a stochastic battered or anything like that but if you have to think in terms of animals think of it as like a chameleon like it’s trying to sort of fit into a bunch of humans right and it’s trying to just write things that will make it pass as if it sort of knows all of those things right. This is in a Twitter exchange about I’d ask chat GPT to explain RLHF and it came up with this acronym that was like oh I forget it and it was really funny it was like something leaderboard you know humans human something it was like it was so far off. Interestingly enough I’d asked it about I’d had conversations you interactions with it about RLHF and then knew what it was like to your point it’s about kind of where it sits in the context of the prompt and I just kind of posted you know is it trolling me or is it like just trying to BS me. And one of the responses that I got that you know reflecting on is like really insightful like it’s always trying trying to BS you that’s all it’s doing is trying to BS you to produce some text that you will think is reasonable and you know to its credit a lot of times it’s right but that’s all that’s trying to do. Yeah and especially when it comes to factual stuff or even like you know it is a very useful bullshitter in some sense right so because when it’s right or when it’s partially right that’s still useful because you know it is what it is but that when you put that label like if they had sold it as like hey we have built a really good bullshitter right like and sold that as a product then people would know okay not to use it for a bunch of tasks that they’re currently thinking of using it right and so yeah so that that’s the sort of divide that in messaging that somehow researchers and NLP folks know oh yeah language model loss obviously all that is doing is blah blah blah right and yes RLHF can help to some degree but clearly it’s not going to be able to do these bunch of other things all the time and that kind of thing is missing from general public but also how a lot of people are planning to use it for exactly right so I think that aspect is the part that we need to think a little bit more about it. So we’ve got Palm and Arpa Flan down tell me a little bit more about your take there because I hear of them you know in a vague research context because no one really has access to these but Google as much more so than you know something that is huge from a commercial perspective is this a prediction or is this a reflection. So I think I think of this as a palm was a huge commercial development this year like Google build this really really large model now there are obviously they haven’t released it right so what’s the ideal situation they completely release it open source everybody gets access to it that’s not going to happen. Another possible thing is they put an API on it and charge people from a Google perspective that doesn’t make sense right so it is something that they’ve built it’s valuable internally I’m sure it’s sort of has you know there are reasons not to make it public but it is it also has a lot of research inside because nobody else has such a big language model trained in a similar way and I guess I want to give them props for at least publishing and evaluating it. And evaluating and doing things like that with with palm because it is doing it is of a size that we will not see for maybe another year or two to be sort of publicly available but yet we get to hear about some insights what to expect what are the emergence behaviors coming out of those language models right so yet it would be ideal if we could audit it and all of us and I could contribute and finding out what the problems are and when it works and it doesn’t work but given that I think they did a good job specifically what I will say is that that size has brought up a bunch of capabilities like the whole chain of thought thing that we talked about at the beginning that somehow became possible at that size but was impossible at other size right so that’s why that research is also all coming out of Google because it is not a big deal. Google because it applies mostly to you know to language to L alarms of that size palm is 540 billion parameters something yeah 540 years yeah yeah so I think you know they have access to it and they can produce a string of papers and yes nobody else can write those papers but from a consumer of research as well as producer right so from my consumer side my I love to read research and I’m glad that they’re writing those. Because there’s a lot of interesting stuff in all of the papers so yeah there’s a whole string of papers that I would recommend and I can point you to them offline but yeah there’s stuff that you know we’ll see happen publicly next year or maybe another year after that when those models become so yeah no I think I think that that’s been kind of key. So I’d say for that commercial but not commercialized yes right yeah yeah or soon to become a slice I’m sure but maybe not right Google yeah awesome next up kind of the intersection between search and L alarms what do you see in there yeah so I think that’s been kind of an interesting it’s been a commercial development again questionable to some degree but because I don’t think the research is and these models are quite up to some of. But this somewhat coincided with chat GPT well chat GPT certainly raised a ton of questions about hey is this a Google killer right right right yes exactly and along the same time there were at least three search engines that I know of there was there was publicity that I don’t think existed before what the product they came up with which is a search engine which sort of gathers all of the results from a typical search engine but then uses GPT 3 like models to summarize the content of those links and produces a paragraph that actually answers your query. You dot com is again a certain that has been around for a while but they brought this whole chat aspect to their search where you’re sort of chatting and trying to come up with an answer and again it’s sort of not just showing you a bunch of links but composing information into text that’s Richard socials company and we’ll drop a link to my interview with him in the show notes as well okay cool yeah and Niva is another. You know it’s a private search company it’s a start up that also has an AI agent that you can talk to it and things like that right so I haven’t played around with all of them I played around with them a little bit and again it’s very easy to find problems and sort of realize that okay these language models are you know this interface is great and it would be great to get the right paragraph if it could get there but oftentimes it don’t quite work because of fundamental issues with language models but I think it from a commercial development I’m pretty excited about what search would look like in the future and where language models would fit into that product. Yeah one of one of my thought experiments with this in the context of chat GPT not that it was particularly deep but like there was this early meme you know along the lines of hey Google searches crap now it’s all ads chat GPT you know I love this interface you know it’s going to kill Google and so I asked chat GPT to basically build a response with an ad in it it works it can do it I wouldn’t be so sure that you’re you know LLM based search won’t have any ads right yeah yeah I think yeah where the ads would come in and how subtle the ads will be once you throw in language model into it yeah that’s that’s kind of interesting. And I guess next up on your list of commercial development is what I might call the LLMing of all the things yeah so I think you know it’s been two years or so since GPT 3 came out and it’s the question of like okay where is where is the world changing products that that are using GPT when it came out hey it was going to change everything has it changed everything and I would say like for the most part no. The products that did seem to show some promise and some of these are one that will appear in the future but have been kind of semi announced is the notion of writing assistants right so I think notion notion AI is the one I think about where a lot of people it’s a mainstream product anybody can use notion and notion has this GPT 3 thing built in where it can write to do lists for you and things like that so I think that is a pretty strong. The first version of GPT 3 as a commercial product that anybody can use that I’m quite excited about I feel like the timing there is very chat GPT influence obviously they’ve been working on it you know they saw it when GPT 3 came out but I think they made it available right after chat GPT and a lot of these you know Jasper’s been around for a while but there’s a lot of people that are working on it. There’s a lot of new kind of writing assistant types of things and it just does seem like there’s a step function increase and kind of energy in the space of using LLM’s since chat GPT even though they’re all based on GPT 3 which has been around for two years right. Yeah so I don’t know exactly why that that thing seemed to align well right so it’s like yeah GPT 3 was announced but it was a while before the API was rolled down to everybody and you know and maybe after that it takes a while to make the business case for these things so yeah maybe it’s just timing of white work or there were people like already kind of working on it in a sort of on the side and they were like hey now we got a sort of right this wave and sort of introduced things right so I don’t know exactly what that looks like. But yeah no I think the fact that it aligns also gets a lot more excitement and people know okay chat GPT is something I’ve played around with this is now chat GPT that’s working on something that I do and there’s a lot of value in that. Am I detecting an underlying pessimism maybe about like you know kind of you know where’s the flying car that I was promised all I have is this GPT 3. Well it’s not so much the pessimism because when I saw GPT 3 it became evident to me that this is a great language model but it’s not clear as it is how it can be made into a product right but it still came with a lot of hype and yeah it can generate a bunch of things but we quite haven’t quite seen what the product version of those look like I think the language models are extremely powerful not just as language models but they can be. They can be converted into products I don’t quite feel like we are at a stage where it’s just going to be through prompting and you know let’s just tweak it a little bit I think there are a bunch of products that will come out of just by doing that but there’s a whole slow product where the language models need to know a lot more about the context where they’re going to be to be able to be effective. Yeah effective tools and you mentioned Microsoft what you have in mind there. This was sort of a news that came out recently where they’re trying to have a bigger stake in open AI but also just generally thinking of having open AI like tools available in word available in PowerPoint and all of these things they don’t have it yet but I think those kind of things that this sort of comes from. Do you think a chat GPT based Bing is a Google killer. Oh I don’t think with that branding they would have to call it something else or yeah at this point yeah. I mean that seemed to be the suggestion right you know chat GPT comes out they’re going to. Take a big stake and you know it was mentioned if not in the official announcement it seemed to be the conjecture that it was going to be you know some tie up with Bing you know explicitly to target search right. I think there needs to be a lot more fundamental work and we can talk about this in the future predictions but there needs to be a lot more fundamental work. Before we sort of are able to kill search just by putting a language model right like I think that gap is not as simple as replacing something or just augmenting existing search I think you would have to think about what kind of things can language models actually do and you still want to rely on sources and things like that but yeah so I think it’s a it’s going to happen at some point but it’s going to be. Like search as of it won’t be replacing search because it’ll be a different thing right like it’ll be it’s not going to be search because not in the way we think about literally means yeah exactly it’ll be question answering or it’ll be something else right like it’ll be a helper or whatever but but searches may not. Well one quick thing before we jump into predictions you kind of reflected on your top use case for the year and that was code pilot tell me a little bit more about how you thinking about that I think co pilot came out probably not exactly in this calendar but I feel like it got a lot more adoption this year and started becoming part of the tools where people are coding and and firstly I started using co pilot this year so I’m going to put it in. I will say before notion dot AI co pilot was probably the only use of large language models that I saw anywhere so from from that point of view it was interesting that you know you can use to came out and then nothing nothing that can till co pilot but from a use case point of view it has been incredibly useful right so I’ve been able to do things that has made me a lot more interesting. I do things that has made me a lot more effective as a quarter not that I code much but when I do I want to do a lot and co pilot has let me sort of do that and that’s been amazing it’s it’s I feel the right combination of having a nice user interface having the right data that is trained on to be able to sort of really help people in what they want to do now of course co pilot has issues it’s producing. You know code that can be dangerous that can be buggy and of course there are questions of copyright and plagiarism exactly so I feel like I hope those things will get resolved but those are again when you start using a language model these are the issues that you have to solve and then I’m glad that co pilot is bringing all of these things into the discussion by having it being out there. Yeah I’ve had the same experience with it I think I’ve shared this on social or podcast in a conversation I you know saw all the code pilot demos played around with it with you know kind of the toy problem things but I don’t do a lot of coding necessarily but I do tend to binge on coding everyone’s and all like and usually like that end of your holiday thing I’ll have some project. And I did that this year and use code pilot it was amazing like the productivity you can it helps you know the productivity helps create for you attacking you know a new problem with new tools without the context switching of going to Google and stack overflow like it’s incredible I’m a total believer. Yeah and I think that exactly is the kind of thing I expect language models to be useful for they are not going to and you know which at you be going back a little bit people are talking about hey people are going to lose jobs and it’s going to change everything and you know we’ll replace X Y Z with JTPD and I don’t quite see that happening but I do expect a lot of people in many different areas becoming a lot more productive because of JTPD. And co-pilot is you know is an example of how language models can make you a lot more productive without replacing I don’t think it’s replacing specific programmers it’s just making allowing them to do a lot. And that I think is the best use of technology. Awesome awesome. Well let’s jump into predictions. What are you most excited about kind of looking into your crystal wall. So I think the chat GPT is the one that sort of everybody knew language models they’re just trained on data and making predictions what chat GPT really did was remind everyone like okay even if the language modeling part is quote unquote salt right even if you get it really really large language model that doesn’t mean you’re done right. And I think one of the biggest aspects of that was making sure that what you’re generating is not just BS it’s somehow valid somehow the truth somehow something that you can cite and rely on right. They definitely sign the light on how challenging that is right exactly. Yeah, so I don’t think this is going to be a prediction necessarily for 2023 maybe 2023 is when we’ll start seeing the first attempts at this but being able to generate text that’s that’s not have misinformation that differentiates factual from. You know creative hallucinations that that is able to cite its sources and sort of point to like low this is the piece of paragraph that I’m based on which I’m generating a piece of text I think those things are needed and it’s probably going to be the next aspect of language models that’s going to be a big topic of research. So in a sense for where how we get there is it is it kind of applying the same tools RLHF for example attacking this specific problem or do you think is you know we don’t have the tools and it’s going to need to be kind of new invention that gets us there. So I’m going to have to be new inventions and I want to sort of think of it as not just you know how do we attribute it to specific pieces of text but I kind of think of it as like being able to use other tools being able to use other things available to the language model when it’s being trained as well right. It should not rely on memorizing facts to any degree should just rely on using existing tools including search including maybe calculations maybe even Python interpreter whatever else it needs to do but still be able to do the language modeling does right so I think there is some some combination of being able to refer to external stuff and still do language modeling that that we quite haven’t quite attracted. And that would be something that I think will come into picture I’ll give you an example of how. Sort of some people have been thinking about it there’s this whole idea of retrieval based language modeling where you’re you’re still generating the text token by token but you’re always retrieving some set of documents and you’re conditioning on them when you’re generating each token. That’s sort of one step towards what I’m talking about where at least you’re trying to look at the deep documents when you’re generating but that doesn’t guarantee what you’re generating is actually based on. So you spoke earlier about the decomposed reasoning do you is this prediction that those ideas become more real in some way in 2324 or is it that you know what we’re doing with a trained model. To kind of get decomposed reasoning we’re going to push even deeper into the fundamental creation of the model like at train time and other things. Yeah so more of the latter right so right now we are expecting the model to be able to do decouples reasoning but we only do it at test time in some sense right let’s actually try to start thinking putting that stuff during training right so like. Again I don’t want to make this and I’ll see too much but when you think about when you’re training a human on how to do things you don’t just give it pairs of input and output you know you give it a little bit more of a decomposition and then based on that they they’re able to do what they do if you want them to use the python interpreter you don’t just expect them to finish it on their own they can use the interpreter when needed get a thing right so I just think of language models as yeah maybe they’re still doing the language modeling task. But they have access to a bunch of other tools and maybe this is more far faster than 2023 but I think in the long run you want a system that’s able to do those things. You got your next prediction is around diffusion models kind of surprising that that term hasn’t come up yet so far. Yeah I guess it is surprising but also in NLP in general I feel like we are barely scratching the surface of what what diffusion models can do. So yeah I think clearly in the image generation space we’ve seen a lot of progress with diffusion models and we’ve seen some in NLP but not enough. I guess what I find attractive about diffusion model is that it’s trying to generate more than just a single thing at a point right so if when diffusion models are applied to text the way it would look like is not just producing one token at a time. It will try to produce a whole sentence or whatever we decide is the right right kind of right and that idea of a model that is trained not to do one token at a time but to do something bigger really appeals to me because I feel like a lot of the issues we talk about the language models. Fundamentally come from the fact that it’s trained to do one token at a time and and sort of and that’s kind of the loss right so if we can have the model be trained to generate more and then give it a loss I think that’s that’s fundamentally interesting in the previous models of provide one way of. You would you kind of visualize this as you know model like you know in a first iteration spitting out bullshit and then successively like iterating towards truth I guess that one way that this could play out. Well I mean probably not probably it’s going to be somewhere in the latent space up but I think the the way I think about it is like if we were doing this token by token thing for images it just wouldn’t make sense right like. And so that’s pretty certain what would be produced the images that we see coming out of right stable diffusion and yeah or even what it’s going to learn is going to be something different what it’s going to learn is given the image that I’ve seen so far let me predict the next pixel of the next piece right that somehow feels like a fundamentally different task then being able to generate an image fully right and so I feel like. I think you want the same idea for text just kind of make sense like you write the summary in one shot and and realize how wrong it is feels like or something fundamentally different then hey you got a bunch of tokens correct but you also got a bunch and in some sense there are some analogies to our little chef and using ppo for training for example where you try to make sure it’s. You know slow and things like that these are all losses designed on not just token by token basis but something that’s longer and so we know how useful they’ve been so I feel like there may be something in taking that idea and applying it to appreciating something that interesting interesting I expect a lot of people will be wanting to figure out how to do that and online updates to models. Yeah so I think one of the problems with language models so you know let’s keep aside the ground vision of like how language models will use search and all of these other things but one of the fundamental problems with language models is that the word changes but they don’t. And this seems to be a fundamental sort of issue with with language models right so I think thinking about how we can update language models every month or every week or every day right I think is is an interesting problem to be thinking about and becomes increasingly relevant where but doesn’t know anything about COVID so it’s not useful for a bunch of applications even though otherwise fundamentally there’s nothing wrong with it right that kind of stuff is just not not fun and I think there’ll be research on trying to sort of fix that. So what’s the current, sorry state of the art but kind of current approach for doing this you know at the scale of a GPT 3 like is it you know collect more data and retrain from scratch or how did they approximate or approach some kind of incremental training ability if at all. Yeah so there hasn’t been that much work on on that front I would say this is something that’s that’s yeah needs needs a lot more attention yeah but I think you know there’d be parameter efficient training on sort of how can we you know slightly improve the change the model but not completely change it find the set of parameters that we should update so that it’s not updating the whole parameters but updating a little bit of it things like that I feel are around with the kind of the you know one way to think about the fundamental problem is with the transformer it’s not like a layered architecture like a CNN where you can just chop off the you know and layers and retrain from you know that point it’s just a much more complex and interconnected model so that kind of incremental updating doesn’t doesn’t work not so easily yeah I think there’s been some work on sort of taking like a 1% of the parameters sort of spread over the transformer and updating them with new text but you know I think yeah solving this problem is going to be something that that needs to happen pretty quick. So to be clear taking a step back like this is all the looking forward section those three things kind of misinformation and attributable generations diffusion models and online updates specifically in your category of the greatest most exciting opportunities in the field you know areas where you know what we’re likely to see a lot of research attention and you know possibly some really interesting results coming up in the next year or two and also sort of fundamental problems that need to be addressed by language models. And so that brings us to your top three predictions for the fields proper what do you see there. Yeah so I think and maybe some of it is little bit a disappointment as well so the first one here is multiple modalities I think there’s been a lot of exciting work so I don’t want to sort of take that away but to me after GPT 3 came out and then you saw clip and Dalie and whisper and now there’s video models and things like that to me fundamentally I know I understand technically why they’re not the same model but it’s still a little bit disappointing that they’re not the same model like why is there not the same model that change over the same data GPT 3 is trained on but also on the lion dataset that does all the images and text and audio and video and stuff like that right and I think this is a sort of near future prediction is that we’re going to see ways for pre-training models that cuts across multiple modalities and I think clip was a good example sort of early example of like what you can do when you have a lot of text and images but I think you know it still didn’t have access to a lot of text only data and I want a model that can do chat GPT like things but also generate images for me and maybe read them out and things like that right so I feel like multiple modalities is an exciting sort of kind of an opportunity but definitely something that’s going to happen when I first heard you describe this I thought well multi-modal like that’s that was the big thing we were talking about in these trends conversations last year but you’re going a level deeper you you don’t want multi-modal use cases or outputs you want a single architecture to do multi-modal things that’s what I want my prediction is going to be a little bit more grounded so to say but yeah you know like video for example is a more concrete one like text to video we’ve seen some initial versions of those that’s probably where a lot of initial stuff would go in but when and you know I’ve been really excited about sort of the mind dojo world of like playing with text and Minecraft and and having an agent that can do a bunch of things in Minecraft I feel like there are things that models can learn from images even for language modeling it would benefit to see a lot of images in some sense like there are just a bunch of things in images that we never talk about in text and but so so from a from an AI agent I think it’s useful to think about something that has access to everything but yeah more concretely we’re just going to be pushing them sort of pairwise like it’s going to be audio and images and there’s going to be a bunch of other pairs that will happen first but eventually I think having multi actual multiple is not just greater than one but I think it would be exciting awesome awesome next up next I’m kind of excited about better training and better inference and better in the sense of being more computationally efficient I think this is an exciting work that you know a bunch of people are already doing but I think this is just going to become increasingly important from a sustainability part of you but also from like university surviving and doing interesting things and small companies contributing to research I think it’s important to be able to train these models to be able to run these models and and there’s going to be a lot of research in trying to do those and you’ve got a few examples that we’re going to and the show notes any anything that you want to point out. Yeah so let me mention to that I saw recently one of them is this paper called paper called cramming and the idea here is to you know they think about the scaling laws paper like hey what can you do when the models get larger and stuff like that the cramming paper sort of turns it on his head and decides okay what if I have just one GPU for one day what’s the most I can do with that and it’s a very sort of different question but it somehow is a lot more relevant to many more people because a lot more people have a single GPU for a single day and they show that you can get almost sort of bird level performance if you make the right choices and this sort of detail what those choices might look like you know it’s a paper but I think I like that idea of like hey what if we were you know scrappy about training these models how far can we get I think that’s a very interesting question that Google and open AI is not going to be asking but might be relevant for a lot of other research the other one I want to talk about is this petals work that came out of the big science thing I haven’t read too much about this but it seems like a really interesting idea of the problem of running really large language models so even if opiti releases 175 billion model how do you actually run it you know it doesn’t really help most people even if you have a big cluster it’s kind of difficult to run it so what this petals does is they’re they’re building this framework for using the ideas behind bit torrent of sort of distributed computing and bringing it to language models so like hey you should be able to run these hundred billion size language models distributed over a bunch of commodities of consumer computers so yeah I think this is an interesting idea I haven’t dated on with you see how how far you can push it that’s partly you know you need a bunch of people also running petals but once we get there I think that could be a pretty exciting way to run language models interesting interesting so your third prediction is editing and revising models what do you mean there so these are these family of mortals that are not so much interested in generating text but taking existing text and editing it and I think this is a very interesting idea that can become increasingly important and in some sense this could be the way you fix language model output eventually is to have another model that takes the output of the language model and fixes so some of the work here there was a paper out of ULIAs group from YouTube now that sort of looked at summarization and there are systems that generate some and how can you take the generated summaries and edit it to correct all the factual mistakes it has made and editing is somehow much let’s not say definitely a simpler problem but it’s a in some sense it could be a simpler problem than writing the whole summary from scratch especially when you do the writing you do left to right generation you can’t go back and revisit something that you’ve done before with these editing models they have the whole picture to some degree and all they have to do is fix it so that the picture is consistent and so this idea seems like potentially simpler than generation so you could generate something and maybe this is also attached to diffusion models where you write something that’s maybe not so correct but you revise it and it becomes better so there is a bunch of work along these directions that came out essentially this year and maybe second half of this year some of it early on that tries to gather data sets where you have edits or try to maybe even generate data sets where you have edits and create these models that are able to fix those edits and so the predictions specifically is that teams will build on this and produce models that can actually kind of deliver on the ability to do editing and revising and I think this could be for example there will be an editing model that can fix bias issues there will be an editing model that fixes toxicity there will be an editing model that fixes factuality and these editing models can make web searches and sort of take that information and edit the outcome so I could imagine that this could be a practical way of solving many of the issues it is a really interesting idea that I don’t know if it’s like a separation of concerns or something like the language model doesn’t necessarily need to do everything if we can compensate in a way it’s like decomposition as well let it generate if the way to get something that’s not toxic that’s accurate is to have another type of model support it yeah I think that’s right and then for at least for summarization and things where it’s supposed to be factual and stuff like that I could see it sort of addressing those problems of course if it’s generating a long text and there are longer range sort of consistency issues and stuff like that it might be a little bit difficult for editing models to come into the picture there what I like about editing is also it’s something that we can imagine not only working on language model output but working on a human output or text that’s been written with the writing assistant and things like that you can still go back and do a post-processing editing step to polish it up and I think that could be great as well so our last category in the NLP predictions is top people’s company companies organizations teams to watch in the field 2023 of course the caveat of you know you’re not just any omissions here are you know not to slight the work of any particular team but like who you know who’s got your mind share and who are you expecting to see interesting things from in the upcoming year yeah so this has been a little bit difficult question every year but one thing I would say and this is maybe the most obvious answer is to sort of keep and keep an eye on open AI and what they’re up to I think if people once they do something people always come back and say look what they’ve done is not so exciting oh they only scared it up or oh they only did this and this no thing but the fact is that they are the first ones to do it that the first ones to bring it out make it available and that is and get people excited about language models in a way that they were before that happened with GPT 2 GPT 3 and and JGPD and I’m sure GPT 4 will have the same thing I’m sure retroactively we will all talk about what the problems the GPT 4 are and how it’s incrementally only training on more data or has you know more parameters or whatever it is but I think qualitatively it will bring something interesting to the table and I’m really curious about what that next interesting thing is going to be do you think the general the general predictions that are kind of floating around you know basically spring and you know 100 was 100 trillion parameters do you do is your money on those I mean to sort of have a completely different perspective I think this is also a nice model that came out this nice paper I came out a little earlier called the Chinchilla paper this was a paper that show that these models are extremely under trained and they are data hungry so one version one version of GPT 4 could be potentially not even a different architecture not even more parameters like exactly let’s keep it 175 billion and let’s just somehow get 10 times the data if you can potentially get that’s possible right I could totally everybody that shared the image with the little dot in the big that would be totally wrong yeah they’re just sort of replaced up with data and it might still be true for those not on Twitter that is dominated LLM Twitter over the past couple of days is there even I think that you know when GPT 3 came out the kind of colloquial articulation of what they did was like train this language model on the entire internet like is there 10x more data to train on yeah I don’t know I don’t know how much they’ve trained on and how much there is I mean there’s definitely 10x more data there is a lot of stuff on the proprietary proprietary maybe even proprietary like transcribe a bunch of videos and audio and books and I guess yeah they do have that whisper model that that’s yeah that’s really good transcribing so they could use that they didn’t create that for no reason yeah they also can go into scientific papers and like I don’t think the 48 million that I’m papers that galactica was trained on was something GPT 3 was trained on and I think that is a pretty valuable resource that galactica paper also showed that even on mathematical reasoning and things like that they were actually better so these scientific papers may be useful for a bunch of other things than that we don’t realize so yeah I think where that data comes from is unclear to me but it’s clear that more data is somehow maybe even more interesting than more parameters and more data could include more RLHF style things right like I don’t know what to open it got the other top company to I would say again continue taking a look at is hugging face I’ve been constantly sort of amazed by how much they’ve been doing one of the sort of key insights is like e-mnlb which is the stop conference in NLP has this demo track where they highlight sort of not research papers products or demos that are relevant for research and for the last three years I think at e-mnlb hugging face has got the best demo paper work and that’s that kind of thing sort of shows how they’ve been doing very different things but also doing things that are impactful and interesting so the two I want to highlight this year is again they’ve done many many things but the one I want to highlight is the evaluate system where they had this whole evaluation framework for reproducing evaluation and evaluating models and making all of this stuff really easy so you can introduce a new metric and evaluate it in thousands of models make it really easy to compare models make it really easy to reproduce papers and I think that that’s a really valuable service to research and the other one that I sort of we also started with this of like hey what’s happening inside the pre-training data one of the tools they have is this rules search tool that takes the rules pre-training data but allows you to search it and find all kinds of things that are happening inside that pre-training data if you have a specific prediction then you want to be like hey is there anything in the training data that look exactly like this you can do that search and get some results so I think they are just being pretty creative and thoughtful about what is useful and building tools and that’s really exciting and the last one that I’ll bring up and this is you know something that was in my on top of my head this week with it can change it’s a group called Ott it’s OUGHD I believe it’s Ott.org it’s a website and this is sort of a research nonprofit and they’ve been doing sort of interesting things related to sort of building tools so they have this tool called primer and this is going back to decomposed reasoning this tool called primer sort of you can give it a question and it tries to come up with an answer but in the process of coming up with an answer it can do a web search or it can write a small program and it can do all of these things and they’re very sort of nice to do to be able to visualize what the decompositions are and what sort of things are being done so it’s a really interesting use case of language models and then they also have another tool called elicit which is in some sense it’s a little bit like galactica but it’s not so much interested in generating you know papers for you but helping you do research for your paper so you have a specific question it’s going to find a bunch of relevant papers take out snippets from those papers and be able to do that so I don’t know they’ve had to have a bunch of tools that when I’m looking at decomposed reasoning it comes up and I’m looking at okay a specific like the systems it sort of comes up and so it’s really interesting to see and I’m curious what those are doing I’m really curious about about that and I’m going to look into that in more detail awesome awesome well I think we are done like you’ve been a champ this has been awesome it’s been fun yeah yeah now I mean you rose to the occasion of kind of capturing an amazing year in NLP for sure so thanks so much for for joining us thanks for inviting me I think the time sort of justifies how much this year your your had in NLP this year and I’m really curious to see where NLP is going to go I will mention that chat GPT came out right before or I think maybe even during EURYPS so I attended EURYPS and I saw the first hand experience of the whole machine learning community there then I flew to Abu Dhabi to attend the EURYPS and that’s where I saw the reaction of the whole NLP community and it’s been interesting to see sort of how the reactions have sort of span both optimism and excitement which is kind of where I am like to see like hey what can we bear it with this stuff to pessimism where they’re like oh you know it doesn’t really yeah it’s not going to change anything it’s just a bigger language model all the way to essentially I want to say some form of denial where it’s like look it’s behind propriety closed off system and therefore it doesn’t matter to research and that’s definitely not a big guy I agree with so yeah it’s been exciting and there’s also a fourth which maybe is less so and I don’t maybe less so in the research community than in the general sphere which is fear of the implications of it did you find out less so on the research side? I guess less so definitely less so on the yeah because I think we’ve been there is a little bit of fear becoming a little bit more obvious but I think the community because of a lot of people who think sort of pointing out problems with large language models for a while we are kind of we know what not to we as a community we should know what not to do but it is a little bit scary when people are using it for things that clearly at the onset should be like hey why are you doing this straight? yeah awesome well once again Tamir thanks so much really great session and conversation and appreciate all the work you put into perfect for yeah thank you so much it’s fun

AI video(s) you might be interested in …