Reinforcement Learning for Personalization at Spotify with Tony Jebara – 609

All right, everyone. Welcome to another episode of the Twilmo AI podcast. I am your host, Sam Charrington. And today I’m joined by Tony Jabara. Tony is a vice president of engineering and head of machine learning at Spotify. Before we get going, be sure to take a moment to hit that subscribe button wherever you’re listening to today’s show. Hopefully Spotify. Tony, this conversation is a long time in the works. We’re super excited to have you welcome to the podcast. Thank you Sam. Great to be with you and with all the folks joining us here today. Very excited. I’m super excited. I mentioned to Tony before we got started recording that I’ve got this memory of I think 2017 or something like that at the rework deep learning conference in San Francisco. I think it’s the hiate that’s up on the hill just just by Chinatown. And I think you did a presentation while you were in Netflix about maybe machine learning for for breakfast or something. And I kind of harangued you. You were on your way to a meeting out of the hotel and been trying to get a conversation going for a while. So super excited to finally connect. Yeah, great. We’re doing it finally. You know, thanks for for keeping up and you know, hopefully there’s more folks in the audience today than in 2017 as your podcast seems to be taking off. So that’s great. Yeah, absolutely. Absolutely. There’s no better time than the present. This episode is going to be part of our NURP series and you participated in a couple of workshops in NURPs mostly around the topic of reinforcement learning. And that’ll be one of the main things we focus on in our conversation. But before we dig into RL in particular and or at least the the workshop participation in NURPs in particular, I’d love to have you share a little bit about your background and how you came to the field of machine learning. Sure. So I was you know excited by what computers could do a long time ago before machine learning really was a thing. I was looking at computer vision, looking at a bunch of computer science in general and saying how do we automate what people seem to be doing so well and how do they recognize faces and images. And you can write a bunch of rules, thinking here’s how I would do it as a person and those were quite brittle. And so you know when I first started in the 90s, it was write rules for things to make computers intelligent. And then that didn’t really seem to scale. It turns out it’s much better to show the data and have an algorithm that learns all by itself what to do from examples of real data, real examples of intelligent behavior. And so that was I think how I really started and I was an academic because there wasn’t much going on in industry in the 90s in the early 2000s. But then of course, industries where a lot of the big machine learning is happening these days. And so I went to Netflix and then at Spotify now where machine learning really is driving a lot of our advances and as a huge part of our business and our personalization offering. Maybe talk a little bit about your role as head of ML at Spotify. What are some of the aspects of machine learning that kind of fall under your purview there? So I’m focused on how machine learning can help across all of Spotify. But also in particular, I lead the teams that build a homepage, the search engine, the programming platform, the user evaluation and understanding models, the content understanding from music understanding to talk understanding. And as well, some of the technologies that help us with playlists. And then I also advise all the business units about how they can incorporate best practices of machine learning. So our ads business unit, our let’s say messaging and user membership growth business units. All of them really have to work with the same machine learning infrastructure and best practices. And so I’m also tapped into those conversations, but you should that we’re pushing for the right investments for the long term across all aspects of Spotify’s business. Any thoughts on how the company’s use of ML has evolved since you joined? Yeah, so ML has been an important tool in our tool chest, but now we’re increasingly relying on it as we have it making so many more decisions more often in every part of our personalization and our let’s see connections between the user base and the content. And what we see in this huge scale growth of both sides. So we’ve not only added music to our catalog, but we’ve also started looking at talk and audiobooks and podcasts and live and video as well. And our user base is grown across countries and cohorts and different plans. So the complexity on the on the people that want to listen and the creators of the content that should be listened to that’s growing on both sides. And so machine learnings actually increasingly important to figuring out what is the right valuable connections to make between the listeners. And the artist’s creators of content. And so that’s been growing at a great clip. The other really interesting thing is Spotify kind of started off as a curation engine. You would come to Spotify and organize your own playlist and curate what you thought would be great audio experience. But over the years we’ve evolved from a curation first experience to a recommendation and machine learning experience. So we’re still having people curate their playlist and build them and share them with friends and the world. But we’re also creating playlists for you automatically creating a homepage for you that’s tailored to what you would be interested in. Driving discovery, promotions and search that are tailored for you. And so we’re moving very much from a curation first product to a recommendation first product. And then even beyond that, a recommendation first product that also explains to you why you should care about this recommendation. So giving meaning to the recommendation is another aspect understanding why this makes sense and how to explain to you and convince you to give it a try. So that’s been the journey. So more and more machine learning throughout every stage of that journey. Can you talk a little bit about the kind of the business value that those recommendations provide for Spotify? I’m thinking back to a paper from Netflix actually. It may have been while you were there. I forget the names of the authors, but one was the head of product and there was this note in the paper that even back, you know, whatever that was 2015, 16, they attributed a billion dollars of value to the recommendation systems that were built at Netflix. Can you talk about the, you know, how you, how the business thinks about the value of these kinds of recommendations? And I’ll maybe contextualize this a little bit with my personal experience as a Spotify user. I’m not, you know, I’m not a big music person necessarily. I’ve got some things that I listen to and you know, that I like and I’ve got enlists in Spotify. And I usually just go and, you know, play one of those and or play one of, you know, the same couple of playlists like I don’t necessarily, I’m not your best consumer of recommendations necessarily. And so I’m wondering kind of how broadly how they play out across the business. Sure. So we’ve been investing in a series of AB tested ways. Each, each tests improves retention, reduces churn, increases engagement. And we can take all those tests or layered on more and more machine learning and say, well, what’s the some value of all of those? The problem is that’s happening while users are coming to the platform and being acquired even, not just retained, but acquired by machine learning. And in terms of you go and survey people, users and you know, premium subscribers, which is the majority of how we generate a revenue. And you ask them what drives you to Spotify? What makes you sign up for Spotify versus consume and find your music elsewhere or through some other channel or medium? 81% say it’s because of the personalization. And that’s really heavily driven by machine learning. So you can think of it as the majority of users are coming to Spotify’s thing at Spotify because of the personalization. And of course, the content is crucial. You can’t personalize when you don’t have content. There’s nothing to personalize, but the bigger the content catalog becomes, the more the personalization matters. So we’re now at, you know, over 100 million tracks, over 100 million podcasts episodes. We’ve added hundreds of thousands of books and we keep going. And so that means personalization constantly gets more and more importance. And the machine learning algorithms define the right thing for you, especially when for you means one out of half a billion people almost, that value becomes much bigger as the user population grows, as the content catalog grows. And as we improve and roll out more intelligent algorithms with better wins that are A, B tested. I guess that brings us to your presentation at the offline RL workshop, which talks about some of the ways that you apply offline RL for personalization. But before we jump into that, talk broadly about the technique you use. Have you, are you kind of mostly using RL for personalization or is that one of many tools that you use, depending on the specific scenario? I’d say it’s one of many tools. So we leverage a lot of machine learning tools. Think of the machine learning treasure chest as having, you know, many, many buckets. RL is one of those buckets. You’ve got deep learning in another bucket. You’ve got causal techniques, it causal inference in another one, probabilistic models. For one of the things we’ve realized is we kind of iterated our way and realized that we need more more RL. And that’s because we started doing heavily, let’s say multi-arm bandits at the beginning, where you basically do things like try things out, do a little exploration, and then start taking action that seemed to get good responses from the user. And we did that in all sorts of places in our systems from our, you know, surfaces like our homepage, to our vanners and our promotions. We use these techniques from baby RL, which is multi-arm bandits. This is like if you have a forgetful RL agent that doesn’t know what state it’s in, then basically it’s a bandit. But that’s maybe chapter one of the RL textbook. There’s many more sophistications after what you start to say, well, there’s a state. People change as they consume and discover and help new habits. And so you can’t just think of it as a multi-arm bandit in the casino, which kind of doesn’t really remember what happened before. You have to really understand users are impacted by your recommendations or what they consume, that changes who they are. And then they come back the next day and maybe the decision that is actually a little bit different. So then you understand that you’re really not transacting in the moment only with the user. You’re building a journey that’s going to last many months with them, especially if you’re a subscription service like we are. Users are with us for many months, many years. We’re now thinking of our business more about building a journey rather than getting you to just click on something with a bandit and building a journey is much more of an RL-style problem. If you look at what you’re actually trying to do with RL, you’re trying to play games and get to the end of a maze and so on or get a robot to do some useful tasks. Well, it turns out RL is also about getting a user to go on a journey and discover new things and enrich the way they use Spotify in their day-to-day life. And one of the themes that recurs in your presentation in the workshop is this idea of kind of transitioning from a single-step reward to optimizing over the lifetime value of a subscriber. Can you talk a little bit more about that and the complexities that it presents for you? Yeah, absolutely. So we started by just optimizing for the next click. And if you look at our old-style homepage, old-style search pages, old-style playlists, it really was about just getting that next consumption, that next track, that next play session, which is really kind of click through a maximization. You want to get a good click through it. Which makes sense if you’re serving ads, but I guess it’s better than the customer closing the app. You’re right. So you don’t want to completely give up on instantaneous rewards. You need to do something when the customer opens the app. However, you can’t just always think about what’s the easy thing that gets the immediate click. Because for us especially, the easiest thing to recommend is just listen to what you listen to yesterday. Here’s the exact same playlist. Here’s the exact same tracks. We know it’s not going to be a trust buster. It worked yesterday, so that’s a pretty good click through it. The problem is if you keep doing that over and over again, and you don’t worry about the user tomorrow and six months from now and how happy are they with Spotify. You realize you quickly wear out this user because you haven’t layered the familiar recommendations with discoveries and long-term growth. You’ve got to go for a short-term instantaneous reward, but also set up for long-term success. So the user keeps coming back and in the way is enriched and feels more long-term fulfillment at Spotify because they’re building new discoveries of content, new habits. Maybe now they’re listening to a weekly podcast on Mondays that is a new habit for them. It’s not just listening to Dance Music on Saturday, which might be why you started your subscription with Spotify to begin with, but we want to add new habits. Monday afternoon is listening to this podcast. Maybe Thursday evenings, meditation podcast on Sundays, maybe an audiobook. Those habits really keep you coming back for the long-term. Maybe they have lower clicks at a rate immediately, but once you do click, you keep coming back afterwards. We’ve unlocked not just the next reward, but the sum of cumulative rewards into the future, which is what really RL is all about. You’ve got this interesting pictorial illustration of this where you talk about machine learning, kind of moving you in circles around your current state versus RL, which does a better job of getting you to that higher value state. Yeah, so I like that picture because it really captures what we actually did see in our data where users would just go around in these circles of they play yesterday’s thing and then they play their typical kind of routine stuff and they just circle around through their five, six frequent playlists. And that’s great, but if that’s what you’re doing eventually, you’re going to get worn out by that and you’re not really getting more value out of Spotify. You kind of stuck in that rabbit hole going on the circles and that’s happening a lot with a lot of, let’s say, recommendation engines. So how do you make you deliberately break out of that rabbit hole and go to a higher altitude location where you are expecting many more future rewards because you’re now open to content categories. You didn’t think existed before you, you know, you’ve discovered jazz and now you’re going to open up that new source of reward, which is amazing, jazz discoveries for many, many months to come. So that’s how we think of altitude. You’re at a higher point where you know, much more about the audio landscape and you can consume for the future much more easily. I’m not sure I thought of myself as stuck as a Spotify user before this conversation, but now I’m going to be super self conscious about it. And if not the recommendation systems changing my behavior, maybe this conversation will take it, we’ll take anyway we can. When you start thinking about, you know, this broader longer term journey with the user and trying to make decisions around that, do you run into challenges with attribution? Yeah, absolutely. Attributions a big problem because it’s very easy to attribute an instantaneous reward to an action because the reward shows up, you know, a few hundred milliseconds later. You presented the thing I clicked. Exactly. So then it’s pretty clear that, you know, our recommendation that got a click, you know, a hundred milliseconds later was a good one or a bad one. But when we’re talking about, you know, the user retaining longer and building habits, it’s harder to say it was one specific action, which triggered this thing that may actually have needed several nudges, we call them. You know, I nudged you to try out a podcast once. It got it into your mind. You thought about it by Yinevrajji followed through and then some other time you typed a search query that was related to that podcast and they’re showed up again. And so it takes a few steps before we actually get the user to build a new habit and then you have to reinforce that habit by showing it in shortcuts at the top of the page. And so that’s kind of a long term outcome. And it’s harder to say, here’s the thing that actually created the reward. It’s several things in sequence and it’s pretty easy to lose the attribution when it’s several sequences of actions that led to the reward and the reward is delayed. So causality and attribution become much trickier. What are some of the ways that you apply causality or causal modeling in trying to model attribution? Well, I mean, we do track lots of actions. So we have the ability to stitch together an entire, let’s say, set of actions that led to an actual stream. So we have linkages through our data sets, which lets us follow the user throughout the app and the sequence of actions that led to the final stream and not just think it was the last last page that triggered it, but maybe you searched and went to an artist page, then came back and then searched for something else, came to a different artist page and then finally pushed click and play. We connect those little steps long away. So we have little trajectories and those trajectories get stitched into longer trajectories. So we’ve built trajectories for all the users, the history of what they’ve done and what they’ve consumed and that trajectory data is actually a rich data set that’s perfect for things like offline RL because we see not just action reward, action reward, we actually see the whole trajectory of state action reward triplets in a time series. And then we can say, all right, this time series clearly is leading to great outcomes and long-term rewards and consistent rewards. This other time series doesn’t lead to a bigger sum of cumulative rewards. So we actually are now working with sequence data and are able to do things with our offline sequence data before going into an A, B test. And we also are building simulators, which is another way to capture this kind of long-term attribution problem. We simulate how our homepage will look and how users will respond to it. We simulate playlists and how the users will skip and play them. And so simulation is another key technology and both those were topics of the workshops. So how do you work with offline sequence data, which we have and we’ve logged across our user base. And how do you work with simulation, which we’ve built for our homepage, at least, and our playlists? Let’s jump into the offline data. How do you do that? So for every user, we don’t just keep track of, let’s say, the last session. We actually look at the series of actions they’ve taken, the recommendations we’ve made, the rewards they’ve generated for us. And so we actually have time series data. And that’s been valuable. We also use this to build lifetime value models. So we look at users on the service for many months and say, okay, who’s retaining after X many months and who isn’t? And those models look at the history of consumption and also some sequential aspects of consumption and engagement with our app. And that sequence that is used to make a prediction saying, how much longer is this user going to stay? How long will they survive on Spotify, for instance? And those are things we were able to build also because we have long term historical data on retention and what led to retention. So we look at sequences of actions, we look at long term consumption histories and how they’ve led to retention and survival. And then of course, we build the simulators. But one of the things we’ve converged on is, in a way, lifetime value models have been around for a long time. They’re used in kind of subscription services. We’ve realized lifetime value models are really just the reward function in RL because it’s the sum of cumulative rewards with a discount. And that’s literally what a lifetime value is in businesses, in subscriptions. You know, the link goes a little different, but they actually turn out to be the same thing on, you know, in equation man at least. Yeah, how does that translate to implementation land? Like, is there a you building some model that you know, at the end of the month, if the users are still around, there’s a 995 reward or whatever that is. Nowadays. Yeah, so one aspect is we literally say, each month when you subscribe, that’s a big reward for us because you say, hey Spotify, you did a good job last month. I’m going to keep betting on you. Here’s my 995 or 10 bucks. And so what we’re trying to capture is, you know, we got 10 bucks for this user, but how many more months are they going to keep giving us that 10 bucks? And if I can change that, make it go from 15 more months of 10 bucks to 17 more bucks, then I’m really happy. That might not show up for a while, right? That user is going to be around. We’re not going to know it until 15 months go by and oh, wow, they stayed a little longer. But what we’re trying to do is calculate the sum of those monthly rewards. We actually do it over a rising of 60 months. So we look at over the next five years, what’s the probability each month that you’re going to stick around? We sum up all those probabilities multiplied by the dollar value of that month. And it’s roughly 10 bucks for subscribers a month. But for free users, it’s coming from their ads and their ad load and actually it’s a little more complicated. So it’s summing all their future months and with various degrees of ad load as the dollars. And we also applied discount factor because a dollar today is worth more than a dollar tomorrow, let’s say. And that’s exactly what a, you know, the RL textbook problem is, it’s the maximize the sum of discounted cumulative rewards. And it turns out LTV is exactly the sum of discounted cumulative dollar rewards. Is that in a lot of areas in machine learning applied to kind of business types of problems, you’ve got to create these proxy metrics because it’s, you know, either your actual metric is, you know, too opaque or too difficult to, you know, turn into a metric suitable for a machine learning model. Is this an instance where you’re able to more closely map the business metrics to machine learning than in other examples in your experience? Or is it just different? It is a good proxy metric, as you say, because at the end, yes, a business wants to opt them out. So it’s still a proxy metric is kind of the first thing you’re saying here. It’s not a holy grail of like we’ve, you know, fully captured the business need in this computational model here. Yeah, it’s not the perfect metric. It doesn’t capture every aspect of the business. There’s all sorts of other costs and revenue and, you know, it’s not, we’re not trying to put the CFO and his entire organization into one ML model. But it is a very good proxy, let’s say, because it really is capturing, you know, especially for a business like ours where we have subscribers and actually multiple plans and free and, you know, churns and premium. We’re trying to capture all of that with a model that really summarizes, let’s say, a good portion of the revenue and the margins for the business. But not all of it. It’s not a simulation of the entire business, but it’s actually the best we have considering, you know, it’s a very complex business at the end. So we captured, I would say, a good chunk of the business complexity with this proxy, but it’s not, it’s not as good as the actual real data that’s showing up every day when we actually see the real dollars and the real payouts to the artists and to the creators and so on. So it’s, there’s still some proxy there. It’s not perfect. A big portion of your talk is kind of reviewing some of your teams, papers over the past year. So the first one you talked about is LTV and survival models. And I think that’s what we just kind of talked about this idea that LTV is really the sum of the probabilities or this weighted sum of, here it’s expected gross profit, but I think you map that to the survival model and the probabilities there. What’s the right way to say that? Yeah, so it’s, it’s basically the sum of survival probabilities scaled by the profit for each month. And then you also do it at discount factor because, you know, there’s a, you know, the capital has a time discount. Yeah. So you could think of it as net, net present value, lifetime value. Or you can literally think of it in the RL normaclature as the value function, V of S, where S is the state. So if the user is in this state, what’s the value? And it’s basically the sum of expected rewards. If you play well from this starting state, according to, you know, a good policy. So it depends on the policy, of course. So if you continue to act, as we’ve acted, this user in this state will generate the following future rewards. And that’s, and we’ve been modeling that with these, we call them beta geometric survival models because you don’t want to just use geometric, geometric is kind of the, you know, users don’t really flip a coin each month and say I’m going to stay or not. They actually have these more complicated probabilities that actually depend on something more than just a single coin. And so we look at everything we know about the user and we actually describe their survival through basically two numbers that are computed from everything we know about them. And those describe the shape of their future survival. And that’s, that’s been something we’ve published in a paper in 2021. And then we recently in this past summer published a version which extends beyond survival to multi-state. So it turns out, you know, users aren’t in just one of two states that’s boughtify. They’re not just either subscribed or not subscribed. They actually can have many states. They could be subscribed. They could be in a free state. They could be in a family plan. They can be in a duo plan. They can be churned out. They can be churned out, but still registered. We still have information. They still have emails and we can still, you know, potentially resume their account where they left off. And then there’s users who just have never even interacted with Spotify whatsoever and have yet to enter any information into our, you know, into our logins, let’s say. And so, and then furthermore, you can slice those states into more granular states of is this user in this country or that one and we can keep going. But then we’ve extended the survival modeling to multi-state survival. And then that starts to look like multi-state reinforcement learning. And also a lot of the lessons learned from survival map to this kind of multi-state world. So, you know, LTV was really about a binary survive dot survive. We’ve extended to multi-state. And it actually now looks much more like a nicer connection with RL because RL almost from day one was multi-state to begin with. It never was just a binary state. And that paper you get into talking about talking about categorical distributions and Dirichlet distributions. What’s the, where did those come into play? So just like, you know, in, so if you’re going to stay subscribed and not stay subscribed, we said it’s kind of like a user flipping a coin. Each month, the user’s flipping a coin and if it lands on heads, they turn off, if it lands on tails, they stick around for another month. That model is not perfect because it turns out users aren’t just flipping a coin each month. They’re, think of it as they’re drawing a coin from a coin factory and flipping that coin each month. And so that’s how you think about it. And that’s the coin factories call a beta distribution. And then the coin flip itself is like a Bernoulli event. And it’s a coin factory. They’re next state and the flip is whether they go there or not. So what we’re doing for each user is trying to predict what kind of coin factory are you writing as a user. And each month you grab a random coin from that coin factory, you flip it and that decides what you do. And so that was the analogy. And it turns out that fits the data way better. If you, if each user is described as having their own coin factory. And that fits the data way better than saying each user has a secret coin that they flip. So it’s analogous to the, the beta survival model where you had these two parameters that you’re trying to figure out for each user. Now you’re trying to figure out a coin factory number of parameters. How many parameters characterize a factory? So the beta is the coin factory. And then the Dirschley is the, is the dice factory. And so, okay. And so when you have multiple states, you don’t just flip a coin, you roll the dice. And so I’m a state one, two, three, four, five or six. And it turns out users don’t transition by rolling at, you know, the dice. What they also seem to be doing is they have their own dice factory. They grab a dice from it every day. The dice are slightly loaded differently. And then they roll the dice. And that actually fits the data better. So it turns out, you know, human beings are not a single dice or single die. They’re not a single coin. They’re acting more like a factory of these things. And there’s a distribution of dice or a distribution of coins. And that captures the dynamics of multi-state transition better than what we saw with just the simple models, like the Markov models and so on, that, you know, our single dice and single coin models. And that’s what we’re using in our systems. Got it. Got it. And so the next paper you talked about is the RL and temporally consistent survival. So this sounds like an extension of the idea to temporal consistency. What were the challenges that you were looking at there? So this was kind of like the last, you know, put the bow around the connection between LTV and RL. And so this was published last week at NERIPS. And what we said was, these survival models are great. They look like RL kind of as well. But there’s one aspect of RL, which is missing, which is when you estimate a survival model. Let’s estimate my survival model for today. Tomorrow, you’re also going to look at my data and estimate my survival model. Those two survival models should be consistent. You shouldn’t estimate complete different survival from one day to the next. And yes, maybe they can start to change a little bit because today, maybe I discovered one more great podcast, the Twinnell podcast with Sam. And so maybe now I want to survive much better. But there should be some consistency in time. And if you enforce that consistency, you actually get a much better estimator for these survival models that works better than just fitting them to the data with maximum likelihood, which has been how we did this before, or Bayesian kind of marginal likelihood. So if you’d force temporal consistency, everything also seems to work better. And that was an aha moment. And that led to again, a performance improvement in our models. So you add temporal consistency to survival models. And you go from point flips and dice rolls to factories of coins and factories of dice. Both of those two ideas really seem to improve how well these models fit our real human data. And those are the lessons learned. And it turns out those bring survival modeling very close to RL. And we feel like now, you know, there’s almost a kind of a one to one, the source between the two communities where you can say, OK, I’ve got this concept of temporal consistency. Oh, that’s related to, you know, how RL enforces Markovian dynamics and Bellman equations and temporal difference learning. So there’s kind of a nice the source between these two technologies that have existed in very different communities, kind of all mapping into one one real big framework that’s consistent. That strikes me a lot of evolution of the around the sophistication of the way you’re applying RL to your problem over the course of just a year or two. Yeah, I mean, we’re we have researchers thinking about this and trying to connect it. What’s great about Spotify is we’re not just building, you know, science for science to say, we’re really thinking about the business, thinking about our users, how we can give them the most value and understand their behaviors as opposed to just building, you know, algorithms in, you know, in a, you know, isolated way. Yeah, yeah. We’re also testing some of these things now in production and seeing the benefits. So some of the learnings are now that we’ve understood these LTV models and we start connecting them to value functions in RL and Q functions from RL. We’re now understanding how they, how they can help us better make recommendations now that we think about our recommendations as a, an RL problem and what are we trying to maximize in RL? You’re trying to maximize the sum of future rewards and or maximize the Q function really. That Q function now we can start to understand better and realize for our domain that the Q function is a combination of getting you to click, but also giving you something that’s very valuable for the long term when you do click. And so this was in the last paper that we presented and it’s actually encouraging us to view kind of recommendation as not just maximize click through rate, but maximize the click through rate of something that’s going to continue to generate, let’s say a long long term sum of rewards, so some high value consumption item. So don’t just show me something I’m very likely to click on, but if I do click on it, it’s going to increase my lifetime value by big amounts. And so that’s how we’re shifting our recommendations now. That’s the optimizing audio recommendations for the long term paper. That’s right. So we’re realizing that we’re going to just get things that are clicky, but get things that are clicky and sticky. So once I click on it, I’m going to keep coming back to it. It’s going to be something that becomes a habit for me as opposed to I click consume and forget. And it’s going to there’s no real change in my long term value once I do click on something. So we’re trying to show you something you’re likely to click on and try it. But if you do try it, also will increase your LTV. And that that’s kind of how we’re shifting our recommendations now. So can you talk about the process of going from the research to actual recommendations? You you know, there’s this one idea of hey, you know, we’ve got these research, they identify these methods, we take these methods, we implement them against our data, they produce these models, and poof the models will recommend the content. What you’re describing here is the models are informing you about ways to think about how to make recommendations. And then you make different different recommendations. I’m not hearing you put models in production that make different recommendations. Like bridge the gap for me around this. We just talking time scales or no, you’re right, this is a longer process. I’m kind of, you know, jumping to the to the conclusion. But the reality is the way this starts is we have the ideas who write down, you know, some modeling assumptions and some, you know, aha moments, maybe a paper, we build a prototype. Then we maybe take that prototype and refine it with the offline data that I talked about. We get some good offline results. We may even put the model through the simulator and see how well it does in simulation. Then we actually get it to be a productionized model that we can run, run live on real users. And what we then do is we A, B test it. We say let’s run this model in kind of a side by side horse race against what we’re already doing from last year, let’s say. So we got model A and model B running on half users get model A, half users get model B. And then that’s really how we evaluate. We don’t just stop offline or stop at the prototype. So then we have these two production models running side by side. And then we actually say, do we see after, you know, ex many days of running this model side by side better engagement. Our people sticking around longer, retaining better after a couple of months of horse racing these models side by side. And it turns out we do see better long term metrics with these kind of RL inspired models. And it turns out what the models seem to do is they have lower click the rate, the new models. But long term, the users are streaming and retaining better. So I’ve given up on showing something that, you know, you’re going to click on as often. But what ends up happening is I get to show you things that maybe are tiny bit lower click the rate. But once you do start clicking on them, then you keep coming back to them and they become habits. And then you’re listening more on Spotify and you’re retaining better on Spotify. And so that’s kind of what we’re going after. We’re going after the long term outcomes. It’s okay to go from a click through rate of let’s say, you know, 47% down to 45% click through rate. What I must rather say is, okay, but then two months later, I’m actually getting more listening total even though my click through it’s lower. Because I’m actually building longer term habits and coming back to that same podcast. It’s a habitual podcast. It’s a new way of engaging with Spotify. Maybe now I listen to a, you know, a meditation podcast in the evenings. And I listen to my, you know, my news podcasts in the morning and my way to work. So all these new habits now have been added. They might have actually had lower click through rate in a moment, but long term, they generate more engagement for that user. The user spends more time on Spotify and retains better on Spotify at the end of the 60 day or 70 day trial. And then we roll it out so that everybody gets that better experience. And so is that cycle something that, you know, the models produced by this RL approach kind of been through that full cycle and, you know, that happens on some frequency or, you know, given that we’re talking about long term value here, is this also a long term assessment process and the jury’s still out on the models. You’re liking what you’re seeing, but it’s not a full pledge to commitment to this particular approach. So yeah, that’s it. That’s a great question. So some of these things are actually fully rolled out in parts of our product. So we’ve got a fleet of machine learning systems. Sure. Some of them now are now completely on this kind of approach. So we tested it, AB tested it, it was a win. We rolled it out. And so now this is the default approach in some parts of the app. Other parts of the app were still testing. Other parts of the app we haven’t even tried it out yet. So the approach has legs. It doesn’t mean it works everywhere, but we’re past the stage of just prototyping and trying stuff in simulation or offline. It’s getting real users, giving it thumbs up. And actually in some parts of the app, it’s actually fully deployed as an approach. Yeah. So this is this approach that is kind of being pioneered on the research side. Are there ways that your data scientists and machine learning engineers need to think differently about modeling to, you know, to use the RL types of approaches or to kind of embrace what you’re doing here or is it just another tool and toolbox for them? So it’s a great question. We were obviously trying to make some of these things, you know, let’s say easy to reuse and try out in different places. So you’re not starting from a blank slate in other parts of the product. What do you want to try out these techniques? But really, this is a multi-disciplinary endeavor. We worked first off with researchers and we spoke to users and got user research even to tell us people liked this idea of, you know, recommendations that aren’t just click-vading for the moment, but they’re actually great for the long term. And then we took those intuitions, fleshed out research prototypes and those prototypes have to look promising. We brought in engineers who could build the scalable productionized versions of them. We av tested them, we get data scientists to look at those results. The data scientists say, okay, this is what’s happening in the metrics. This is how they move. This is what we recommend. Use this setting of this algorithm. That’s what we would recommend rolling out. You know, and product managers are also involved. So it really is all the expertise is coming together. And it’s not just researchers doing great research and then throwing it over the fence. They really sit down with the engineers, the data engineers, backend engineers, machine learning engineers. So I would say that’s that’s kind of the spotify way. It’s bring all the skills necessary for the problem to the table. So then we go from end to end. We go from an idea to an actual user productionized value add. Awesome. Awesome. Those are great case study and real world applicability of reinforcement learning. It’s been an interesting topic for folks for a while. And I also hope is the first of many conversations. I’d love to have you kind of close us out with just what you’re most excited about in terms of the future of this particular work where you think that goes. Well, I mean, I think reinforcement learning is about doing things with human feedback and what really matters to people for the long term, not just building algorithms like click optimization algorithms, which is maybe where the internet started. But where’s it going? It’s going to get the long term user feedback and human feedback. And we’re trying to do that now. It’s spotify. We’re seeing other companies do that, you know, for example, people are fine tuning with RL. They’re large language models with human feedback. So they start to do more intelligent things. So we’re really viewing reinforcement learning as a way to incorporate more valuable human feedback in how the algorithms behave. And I think this is maybe now a nice and fletching point where RL is moving out of the textbooks into the real world more and more. So I’m very excited that it will help us build algorithms that are actually more long term intelligent and not just kind of clickbait like myopic click chasers. That’s awesome. All right, well, thanks so much, Tony. I really appreciate you taking the time to chat with us. Thank you, Sam. What’s up, I’m talking to you today.

AI video(s) you might be interested in …