Engineering Production NLP Systems at T-Mobile

All right, everyone. Welcome to another episode of the TWiML AI podcast. I am your host, Sam Charington. And today, I’m joined by Heather Nollis, a principal machine learning engineer at T-Mobile.

Heather, welcome to the podcast. Thanks for having me, Sam. I’m excited to be here. I’m excited to chat with you. It’s going to be a fun conversation about your journey with machine learning at T-Mobile. But before we jump into that, I’d love to have you share a little bit about your journey in machine learning more broadly.

How do you get into the field? So my undergraduate degree is actually in neuroscience. And when I was studying neuroscience, I had this study where I had to keep rats alive for a year and measure their blood pressure. And at the end of my study, I had this notebook full of data where I very diligently had written every single thing about all of these rats. And I was so excited about my results. And I gave them to the PI of my lab. And I was like, here we go. And she was like, great. Now you can pack this past this off to our analytics team who will analyze your results.

And being like a micro-managy person who likes to know everything, I almost blew a gasket. They’re like, what? My data. I’m perfectly qualified. And she’s like, are you? And I was like, I will become qualified, watch. And so I went, I took my first Python courses. I took some bioinformatics courses. And at that point, I was like, what do I really want to do? Well, I kind of fell in love with computer science at that moment. And imagined I would end up with a Ph.D. in molecular neuropharmacology doing big data that way. So I went to get a master’s in computer science.

And while I was getting that masters, I started working at team mobile. And while I was there, the team that I was on said, oh, we’re thinking about doing some AI proof of concepts. And my hand just started straight up in the air. I was like, big data. It’s the whole reason I started programming. It’s like all of my side projects are in LP like, please pick me. And so that’s kind of the story of how I got into doing this professionally.

Awesome. I think we’re going to have to call this episode machine learning, comma, cooler than rats. It is. It is. So yeah. And so I guess we can start from the beginning then. You were there. Was that was this proof of concept? The beginning of all ML at all of team mobile or in your particular quarter of team mobile? Yeah. So what had been happening is I think the same thing that many large companies had where you had huge data warehouses, lots of analytics, people doing decision science to put numbers into power points to help executives make really smart decisions. But we didn’t have any real time models.

And we weren’t doing anything that I thought was super cutting edge. So at this is five years ago, so at the time, it was deep learning at all. And so my team’s goal in this proof of concept was to put team models first real time deep learning model into production. And that’s kind of where we’ve we’ve staked our home. It’s like we do things real time. We focus on on more cutting edge style solutions. And we don’t do a lot of that like batch analytics. We are the real time AI team. Got it. And how big was the team at the beginning?

So the initial proof of concept, there was me and one other engineer and we were like we need it. We need a very strong data leader in the mix. So we regretted my wife Jacqueline Nolis who has been on your podcast before. And so she helped us with that original proof of concept. And so it was the three of us that took that first API into production. And then wow. So how that happens kind of a cool story. If I can digress, how we even got to do this proof of concept was that team mobile has like an internal shark tank innovation round. So we pitched all these ideas. And then they gave us $100,000 in three months to pull something off.

Okay. And so that’s when we developed our first our first NLP deep learning model. We got it released into production. And then at the end of the innovation round, you pitched to executives trying to get them to buy your product forever. And we were bought. So after that, the we said, okay, now we have to build a full scale product around this and get it in front of our front line to actually be used in contact centers and scale it. And so maybe we’re going further down the the rat hole coming back to rats again. But like, how was that $100,000 spent?

Was that like salaries or were there other hard costs associated with this POC? It was mostly salaries. Some of it was also training because we had the question of should we be building the stuff ourselves? Of course, I think that. Like, um, but we wanted to make sure that we weren’t doing to mobile a disservice by not considering a lot of these like big vendors that claim to offer intent modeling as well. So some of it was in in trials with those sorts of things. And then a lot of the rest was just staffing and research. And so talk about this thing that you pitched the the the products.

What what what what was it seeking to solve? What was the use case? So so the team that I was on was focused on building all of the software plumbing that connects the experts in contact centers solving team mobile problems to the customers that are trying to talk to us via digital means. So this is mostly Facebook, Twitter, our app, however, they’re typing to us. We built the plumbing for all of that. And so when it came time to say, well, team mobile wants to do some AI stuff.

What should we do? Of course, I say we are sitting on top of tons of conversation transcripts. We have all of this data of our customers telling us exactly what their problem is. And so our very first like just proof that we could even build a deep learning model was just a simple intent model. So we had 88 different intents that we identified or topics really throughout team mobile that people are really calling in about.

And so it was developing and deploying that first model. And the product that we stood up around it was one messaging expert who’s responding to Twitter, Facebook, app messages, web messages might have 10 different windows open at one time because these chats are asynchronous. And context switching is really difficult. So actually showing the topic of the conversation in whatever quick facts we can pull up about this person’s account related to that topic is super super useful because then they don’t have to prep themselves as soon as a customer comes in.

We’ve already got their first message. We’ve calculated what their intent is. They’re coming in. They’re looking to talk about upgrading their phone. We’ve actually went out pulled the data to say what phone they have now. And then we’ve surfaced links that say in case you don’t know how to upgrade someone’s phone here’s actually how you do it. And so that’s our first product that we came out with and we called it expert assist. But now blank assist is industry standard terminology, but I will confidently say we coined that one. And so what were your you had a bunch of data was your data already I presume it wasn’t already labeled for something or no.

What was it? Yeah, I’m so glad you asked about that because when we talk about where did that first $100,000 go so much of it went to the most expensive labeling job of all time. T-Mobile’s very into customer privacy so no way where they get a lot of money. So we’re going to let us have an external vendor label our data. And at the time, they were like we don’t even want you to bring on anyone new because they might not have the business knowledge to accurately label data. So we paid data scientists to label our first 20,000 conversations manually, which yeah, most expensive labeling job of all time.

We now have an internal labeling team of five data annotators. So much different story now. And so what were some of your first steps in kind of building this out? The first thing that we said is we wanted to try unsupervised learning, right? We didn’t want to have to take a supervised deep learning approach if we could do something with unsupervised learning. But using LDA and almost any unsupervised approach that we could, it really came up with that there are two major types of conversations that happened in T-Mobile.

About 80% of them are about the topic T-Mobile and about 20% are about the topic phones. So that’s what we said, okay, I don’t know about this. And I actually, I found a status. Even if you set your number of classes to be much bigger than that, like they still, it’s just different ways to say phone and different ways to say T-Mobile. Yes, or different types of phone. And that’s not necessarily business actionable, right? Like we weren’t seeing any business actionable classes shake out. And I talked to another team who has managed to do this sort of modeling in an unsupervised way.

But it involved like a Rube Goldberg machine of data pipelines to get somewhere. And so we said, we don’t have time to think about that. Let’s at this point, let’s label 20,000 conversations and see what we can do. And you are ultimately hoping for your intent to these topics to pop out of this unstructured, unsupervised process. Yeah. And the thing that was really important is because we wanted to do real-time stuff. Topics of like phone T-Mobile, even the type of phone someone’s talking about, not useful real-time to an expert because you have a human being sitting there also listening to the conversation.

Like they very well know that it’s still about iPhone, you know? So any of the classes that we create, we want to make sure there’s actually business actions we can take there. And so the unsupervised thing didn’t pan out. And so what did you move to? So after that, that’s when we decided to do supervised learning. But then we came to the problem that everybody has, which is what is our tax on? And we’re going to be. Like every legacy company we had taxonomies that existed. I believe that each taxonomy should be created for the specific problem that it’s looking to solve instead of shoe-horning as many meanings as you can to classes that are already created.

And so we had to spend a lot of time advocating against some taxonomies used for post-coloneletics. So I had to prove that the topics that we need real-time are not the same things that are useful after the call to say what people are generally talking about. And that took a very long time and was very difficult. So proving that we needed to actually develop our own taxonomy for this problem, very difficult because from a business perspective it’s introducing confusion.

We have one dashboard. We know what all these things mean. Please don’t let us make us learn another one. But after we won that argument, we asked our product and business people to go basically lock themselves in a room and come out with some sort of hierarchical taxonomy for us. And then we refined that taxonomy during our labeling process. So since we had data scientists having the label stuff, we were able to point out when classes were going to be easily confused based on the language being used or whenever they called out a class that was like nobody ever talks about that.

So an example is they had five high level categories for what different topics would be about in T-Mobile. And one of the categories was network. And customers don’t have a lot of nuanced language to speak about the network. They really just say, hey, I have a question about my network or my signal. So they had spent time creating like 16 subclasses of network conversations for us to have to come back and say you wasted your time. People really only, our customers only vaguely speak about network because they’re not engineers.

Yeah. Yeah. So going back to this idea of, you know, the building the case for your own taxonomy, in retrospect, was there a silver bullet? Like, how did you win that argument? Or for someone else who’s maybe embroiled in that now? Like what? Yeah. Well, the first thing I will say is if you’re going to have this argument one time, you’re probably going to have to have it a hundred times. So for me, like, I won this argument once and then I’ve had to win it every time since. And what I keep going back to is I think that business users often forget that modeling is kind of the easy part of data science. So when we talk about creating new models, new taxonomies, they get very nervous.

But for me, what takes a long time on our project is figuring out the business case. Can we actually bring value if I build this model? I can build shiny models for forever. That sound really cool. That aren’t making anyone money. And so that’s what takes a long time. So one of the things that’s really helped me is, is being able to document that and saying, you’re scared of us changing the taxonomy because you’re afraid it will take time.

That’s not the bulk of the time. So you don’t need to be scared about that. And then the second thing is just trying to drive home a culture of small models for small problems, build things specific for your use case to answer it exactly. Otherwise, you will get a deteriorated product. And so there’s that we also took all of the enterprise taxonomies and lined them up against each other. And I was able to show places where I needed a piece of information that the old taxonomies did not have inside of it. So a good example there might be sometimes when people can’t pay their bill, they set up a payment arrangement.

And sometimes they can’t pay their payment arrangement. And for me, that can’t pay payment arrangement is very important. And I had that as a separate class. And in the other thing, it was all just billing. And it’s like, I actually like billing might be useful for a power point when you need to know how much time call center experts are spending on billing, but it’s not useful for helping an expert solve a call about billing while it’s happening. And that kind of like light bulb moment. But I’ve had to, every time we go to build a new model, I do have to have this fight again. And so I have like a pre-planned deck that just kind of lines it all out.

Hmm, nice. Talk more about this idea of small models versus big Uber models. Yeah. So our very first model that we released, I mentioned it, it had 88 different classes and was kind of a disaster. It still exists. It’s still running. Like, like, it’s still running. We are doing the giant re-factor on it right now. But I learned a lot in doing that. So the first thing was we were really dedicated to doing a representative sample of the data. So we did not sub sample anything. It was like a truly representative sample of conversations we labeled.

And so we labeled this. But it turns out that many business actionable things are small and rare. So great that I picked the 88 most frequent topics. But they actually need when somebody is requesting a password update. And that’s nowhere in my 88 topics. And so that’s when we started saying, well, if we need, if we need password questions, let’s build specific models and then talk about how we can write route to those specific sub models, if necessary. But just because it’s kind of difficult, I’ll give you an example of my favorite small model that we have, that the natural language model.

And I like it because it’s an example of a time that we removed a chatbot from people by being smart. So when you message us, we used to have… Well, thank you for those of us who didn’t want to use that chatbot. Right! Yeah, that’s why I like it. I built chatbots before. I think that there’s a place for them. But sometimes when you can get away without it, don’t. But so it was whenever you would message us for the first time, like if you were normally someone who called T-Mobile, but this time you’re texting us, we would send you a picker that said, are you a customer select yes or no? And the reason why we had to do that is some…

You would think T-Mobile knows your phone number. You should know if we’re a customer or not. But we actually don’t in many situations, depending on which app you’re coming to us from. And what we quickly realized is like asking people yes or no there is very silly, because we can really tell from their first message whether they’re a customer or not. If they message us, they say, I need help with my bill. Nobody’s just not a customer needs help with their bill. But if they say I want to switch to T-Mobile, nobody who is a customer says that. And so we were able to just take the first messages and the picker responses that were already selected and build a really quick like shallow neural network between the two and eliminate that picker for 80% of customers to chat with us.

Going back to the previous example, you had this 88 class model and you’re kind of in the process of refactoring it out to smaller models kind of along this idea. What do you hope to replace it with? You would think the router is still a class, an 88 class classifier. If it’s a learn router, are you looking at heuristic approaches as opposed to learn approaches? For us, one of the major things was that when we created this out of this representative sample of data, we aren’t able to get those classes that business users are asking us for.

If they’re asking us for something that occurs 0.001% of the time and we keep pulling this representative sample of data, we will have to label millions of conversations before we have enough to get the enough data to even get what they want to show up. One of the major things in our refactor that we’re focused on is reducing the number of pieces of label data we need to create these new intense or topics. We still have the same, well, we’ve done some taxonomy refactoring since then, but we are shifting from our own in-house neural network to using distilbert from hugging face as a baseline and then retraining on that.

Then we’re able to reduce the number of label data pieces we need per class dramatically before it would be 5 to 10,000 and now we’re at 1,000, which makes us be faster for our business. That’s the major thing that we’re focused on. The other thing that’s changed since we released is we originally released our models in R. So it’s just R in a Docker container as an API. They ran pretty okay. We were doing two million returns a day, but we are this kind of is where we switched to voice. So we were building this all in messaging. We had to have 20 containers of our model, but it was serving two million responses a day and the team of all the business, they came to us and they said it’s really cute what you’re doing here for all of the customers contacting us in messaging, but 90% of care traffic is in voice.

So we say okay, so we want you to build this where it will work for people who call in as well. Okay, well, we can’t have over 200 pods running of our topic model when this goes through production. What do you mean just sprinkle some Kubernetes on there? It’s the most expensive of all time. Well, and the thing is is that messaging conversations tend to be pretty succinct. If you’re chatting with somebody, you’re direct on a voice conversation. You’re going to talk about the weather and your kids. And so even 200 is not a good example because they almost just depends on how chatty people are. And so that’s when we said, okay, we need to do something smarter with how we’re serving these models. And what we ended up doing is they are now deployed as Java Spring Boot services that run Python as a sidecar. And they are, they have both API endpoints, but then they’re also Kafka consumer producers.

So we use Kafka streaming architecture because we have so many predictions that we’re making at this point. Well, Java is Java just an engineering standard there that folks were comfortable with running in broad or yes, like we have people who like Java, but we did try in Python first because like we’re like our data scientist, no Python, let’s do this in Python. At the time when we were doing it, the Python streaming libraries for Kafka were not mature enough to do the joints that we need to do to get all of our data to be like filtered correctly. And so we ended up switching to Java because of that, but it’s something that we do look at every once in a while to see how mature those Kafka streaming libraries for Python have become because once they can do very great streaming joins, then we would love to get Java out of our stack.

I don’t feel the need to keep languages around just because they’re they look good enterprise wide. But yeah, there’s a whole other rat hole maybe around streaming joins. Where did the complexity come in there? Well, at the time we were trying to build a dashboard for the business operations center that helped them with staffing. So I’ll kind of set it up where we have all sorts of bells and whistles on networks that tell you whatever those towers are going down, but no matter how fast we make those, we will never be faster than our customers who are going to tell us the second there is any problem. And when cell phone towers go down, people flood to messaging.

You want to do like some kind of event correlation across all these things to try to figure out what is the actual thing happening. Even that or even being able to tell the operations center before the network engineering team has figured out what is going on, right? So network engineering team, they might get a blip. There’s an issue. They’re still investigating, but we can read in customers words and we know where that customer usually is using their phone because they have a primary place of use address. And so we can make some assumptions there.

They also wanted us to look at the topics that people were coming in on and say, are there any anomalies because we had like if there’s a billing systems issue, it could take a while for us to figure that out, but our customers might be telling us, bright this second. And if we just look at the how often customers are coming in with billing issues, maybe we can predict some of those downstream issues. We are also looking at it for staffing. So if we can predict how many conversations we’re going to get and what sort of topics we’re going to have, how do we make sure that our contact centers are staffed appropriately? And it was in doing that. They had a bunch of different filtering. They wanted to do on that dashboard, like click a state and then see all the topics.

They also wanted to do trending topics for those states. So unsupervised utterances we were pulling out of there. And it was in all of that clicking that like the Python was not performing well on the joins to do that type of filtering. Even taking a step back, going from text to speech, it sounds like yeah, there’s a ton of complexity that I’m imagining is introduced and going from going from the one to the other. Can you talk a little bit about how you dealt with that? Yeah. So I’ll tell a personal story first because I think it’s really fun. I had worked in NLP but only on text and we had one other machine learning engineer who was our speech specialist.

When we finally got green light to start working on speech, she was like, Heather, here’s an initial reading. I’m going to be out for two weeks. You study speech to text and when I come back, we’ll have a conversation. And she came back and I sat down and I was like, so I’ve been thinking a lot about tropical semi-rings because you can use them to optimize a lot of speech to text calculations. And she just looked at me and said, you’ve gone on the wrong path. You will never need any of that math. I was like, oh, whoops. So that was my first lesson and everything here is different.

There’s a lot that you have to think about, well, first from the technical perspective, which text, a computer gets text, how does your computer get a phone call? I didn’t know the answer. I do now. But the signals that phones are sending are, I think, RTP and STP streams and those are not web sockets or anything that a computer can consume. So you actually have to take those phone streams and route them through something like Zoom that can take computer calls to make them into phone calls in vice versa. Use pieces of that to actually convert the audio streams into web sockets so that way we can even get the audio to start. And so at that point, it was kind of mind-blowing.

And then in the speech space at all, you have to start thinking about not only the language that customers are saying to you, but the acoustic environment that they are in and how that’s impacting the transcription. And for us, most speech to text modeling that’s done is for, well, it’s either done by, it’s done on data that is men reading audiobooks slowly, how it’s trained, and then it’s normally trained for people speaking slowly in their acoustically nice living room. But that’s not our customers. Our customers are in line at Starbucks and they are trying to do this on their lunch break. They are yelling and their dogs are barking in the background.

And so figuring all of that out and learning about that and about the way that different accents work and how we can try and make this technology we’re building work equitably for all of our customers has been like, there’s nothing like it in the text world whatsoever. You don’t have the features. You started talking about the compute environment pods and all that stuff. How are you running out? Is this GPU heavy workload? Where is it running? Yeah. So our transcription stuff is incredibly GPU heavy.

Our topic modeling less so, but the transcription is. And so we started out with a like an on-prem provider specific for GPU hardware because at the time we were getting in like ridiculously good latency. Because that’s one thing that’s really important is a lot of time speech to text. It doesn’t matter if it’s slow because you might be transcribing stuff for post-polling analytics. But for us, if we’re trying to build experiences that pop up to help an expert solve a problem, we have to be faster than the human being hearing this sentence.

So if someone’s like, I need to set up, I would love to pay my bill. And I say, great, I’m here to help you pay your bill. Let me dig around. And then I click the app, I start to pay their bill and then a pop up comes up and says, would you like to pay their bill? Because our transcription is slow. I’ve created a list. That sounds like clippy. Yeah, exactly. And so so latency is like super important to us to build trust. And so we originally had an on-prem provider. Because we were getting good latency. And we took it to AWS and we were like, hey, like, what can you guys do here at the time? Their latency was not very good. But since then we actually have been able to switch to a cloud provider. So we do use AWS to do this. But we have our own internal hosted Kubernetes that is then hosted on AWS if that makes so. Okay.

But we do. It is a GPU heavy workload. So we do have them running on GPU instances for our topic models. Right now we’re in the process of converting those to AWS inferring GTIPS, which we’re still we’re still waiting to see what the actual improvement will be from using the inferring GTIPS. This may be going back a bit, but you you mentioned that the with the current version of this, I think in as part of the the text to speech based system transition, you went from something to distilbert like pre-trained distilbert and fine tuning that. What was the thing before?

It was a hand a hand crafted network using Keras that was a like a shallow convolutional neural network. Okay. So very bespoke. Right. Exactly. And so it takes a lot to say we were we’re not going to do that anymore. But we still do like bespoke neural networks for other problems, but for our giant topic model distilbert’s good. We in our first our first experiment with distilbert, we just like pulled it off the shelf and took our already labeled data and trained against it. And after one epic, we saw very good results. So we’re like, okay, this is probably the direction that we’re going.

If it can get this in just one epic. And have you had to do anything special to to get it to deal with the number of classes that you’re working with? Not so far, but as I kind of mentioned, we’re we’re in the process of maybe breaking apart the models into a bunch of different sub models and having having a better router at the top. So we might end up doing that. We have zero interest in introducing new classes into this model because it’s it’s so old and giant. We’ve one of our data scientists calls it the Toyota Corolla models. Like it just it’s reliable. It does what it does, but no one’s super impressed by it. So we want to kind of let it continue do what it’s doing.

But any of the new type of topic modeling that we’re looking to do, we are we are building other models to do it. And we we will never have an 88 class model again because maintaining it is awful. And I don’t know if you answered the question about what what you’re anticipating to use for that router. Oh, yeah. So we don’t know yet. It really depends on the type. Well, it depends on the type of experiences that that are our care stakeholders want. So I think a writ like for the first go that we have, we have two different experiences that we are rolling out.

One is about network and one is about customers who are dissatisfied. And we think it’s just going to be a logistic regression, right? Like just which one do they go to dissatisfied or network or otherwise? Maybe which we can get in that. But we’re not sure how that’s going to scale. It really depends on their roadmap. And so we’ve got to be in tight collaboration. So the idea then is that as opposed to kind of this flat 88 classes, you’re going to have a kind of a hierarchical taxonomy and you just have to figure out which of the which of the handful of branches they need to be routed to. And then you’ll do the kind of fine grain classification lower down.

Yes, yeah. So it will be a hierarchical taxonomy. We will pass it off. And then within each of those general experiences that we’re trying to roll out, we want to have like a proper like state engine where we have intent being recognized and then we’re kind of deciding what to pop out for the for the expert next using an LSTM of some sort. So that’s kind of the future vision that we’re building where we’re able to like walk experts through these flows smartly using the NLP that we’re doing to check off any checks or fill in any boxes for them as they go through these workflows. And now there are tons of these off-the-shelf kind of intent engines and chatbot engines and conversational AI tools. How do you think about kind of the build versus buy decision?

So what I tell my stakeholders always is if there’s something out there that honestly works better than what I would build I would love to buy it. But I’m very hesitant to put something that works less well than what we have in front of our experts. And so I can give a very easy example here which is recently we had a vendor solution that we were looking at that. Identified promises and commitments inside of conversations. And that’s very exciting because we always want to know what our experts are promising our customers what we’re committing to.

And if they call back in later and they say you promised me we want to be able to say yes we did promise you you’re right I’m sorry. But when we dug into it the out of the box accuracy was like 53%. So he said okay let’s dig into some more of these and they didn’t make any sense to us. They weren’t what we would consider promises or commitments from a business perspective. Whereas we put one of our data scientists on the task for one week and we can come up with something with about 80% accuracy because we know what we are looking for.

And that’s the very hard story that we’ve been telling our stakeholders over and over again is that yes general language models work very well for general English. But we don’t really speak English at T-Mobile like we speak T-Mobile leagues. There are so many words in the English language that will never appear and there are so many things that we are going to talk about that are completely different. So an example I like to use is the word jump has a literal different meaning at T-Mobile because jump is the name of our insurance plan. So like it’s an argument for custom embeddings and so for me I think it’s totally appropriate to take an off the shelf solution to prove a concept out and see if there’s any value.

But then I always like to ask can you do something cheaper and better. And so I can kind of speak about our speech detects here where we originally did roll out with vendor partners. So we did a huge RFP every major speech detects writer in the world that exists. I have reviewed them. And we launched our original proof of concept with AWS transcribe. And so we did use a vendor but immediately once we had the audio data we started looking at open source solutions and saying what can we do on our specific data. And the word error rate on our data that we have right now is nine. State of the arts like seven to five like state of the art for like standard slow English is seven to five and our word error is nine. Anytime we test a vendor product against it it is 18 human like 18 ish and you lose measurable meaning after 20.

And so I just kind of all the time even when vendors say we have we have state of the art speech detects when you run it against our calls it’s not because our acoustic environments are very strange the language that we use is different. We have a lot of strange factors in our business. And so just trying to reiterate constantly like like yes the numbers that you’re seeing in the media are great if we were Wikipedia you know like if this but we’re not our conversations are not Wikipedia. So you mentioned dealing with accents for example and you know that representative of being representative of other biases that you might encounter how do you approach that whole that whole spectrum of factors. Yeah so so for me when I think about it if I’m building efficiency tooling for the frontline I’m trying to make them smarter better at their jobs.

Most contact centers compensate people by how happy the customers are afterward how quickly they solve calls how quickly they go through them. And so the nine thinking it is so incredibly important for me personally to build models that serve everyone equally because if not if I’m building models that help some people that’s literally going to increase their paycheck. And if I’m building models that don’t help other people that’s going to literally decrease their paycheck. And so we talk about that all the time on our team like it is it is the most important thing to us. And so we actually we have a full time AI ethics specialist who she’s an engineer and she’s focused mostly on our voice models. And so there’s a data set called the Mozilla Common Voice Data Set which is anybody can go you can speak and transcribe your audio, tag it with your demographic data.

So we started with that testing our models against Mozilla Common Voice Data Set. We found it to be pretty insufficient in some striking ways. So an example is they have multiple different Asian accents listed in the data set for Africa. They just have African as if there are yeah as if there aren’t multiple African accents. And so we use them originally to to kind of get a benchmark there. But what we’re working on really now is building our own team mobile version of that like what is the data set that accurately represents all of our customers and all of our employees that we can use to measure different word error rates against and see how we can improve.

And the one of the reasons why this is super important to us is our speech to text solution is very hard to scale like by the time we are at full scale we will be the largest speech to text for contact centers in the world. Or at least that was true a year ago I haven’t checked so but but so it’s very hard to scale and so we are right now live with 7,000 agents but we plan to be live to everybody in the US by the end of 2022 and everybody globally by the end of 2023. And so for me what’s very important is before we roll out to global partners so anybody who is answering the phone who is from any other country for team mobile I need to make sure that my models work well for them. And so we have a huge focus on those specific areas in collecting that data to make sure that we will not release models that do not perform within an acceptable window of standard in this situation.

And so that’s really kind of how we’ve done it but there’s not there’s not a good open source data set that has all the data that we need to accurately bias test it so we’re just having to curate our own. You’ve been talking about speech to text speech to text. Is the transcript an actual product that you need or is the transcript an input to downstream things and that’s kind of a you know the next question is you know do you think about like some Uber you know deep network that goes straight from speech to intense and skip the whole text thing. I’m glad that you asked that because that was my dream originally. When we said speech to text I was like and we’re going to build a tim modeling directly on audio signals and that’s when my speech scientist was like Heather no okay she’s like the computation for that would be so ridiculous but so what our speech to text mostly does right now is it’s it’s it’s powering a product we call auto memo.

After the end of every call our experts have to spend three to five minutes typing out everything that it was about and we said actually we think we can summarize some of this. Doing that took a very long time. It turns out that human speech is not very good to even do an extractive summary from so like not only. Summization is still hard. Right well and especially when people speak we’re not very succinct so that took a very long time to figure out what that product will look like and it is a combination of we add the top three what we call called drivers like the things that we think made them actually call in to the memo and then we also have it where experts can click and view the entire transcript if they need to dig in for some reason.

I don’t love showing the full transcript because I’m like we’re going to make mistakes and people are going to give us feedback and it will be embarrassing but so far so good the experts seem to really like that and we’re we’re even toying around with the concept of if they can see the transcript in real time while they’re talking could they flag things for correction. So for us to go in and correct the transcripts later but so but the ultimate dream is that the speech of Texas powering like all of these Jarvis style pop-ups just just automatically drive and do all the silly things where the expert while they can just focus on having the human conversation the thing my robots will never be good at probably but. What are you using for summarization?

That’s what I say it’s it’s another internal model that just pulls out these three call drivers but we did over six months on an extractive summary thing the thing is is that it’s very hard to get it to be business actionable it would come up with stuff like I’d be like I really like my phone plan I want a new phone you know like and that’s not a good summary because a good summary for that is upgrade call but yeah one thing this kind of interesting is a lot of and maybe it’s not interesting maybe it’s just expected but you know you haven’t mentioned the terms but a lot of what you’re experiencing connects to a theme that we’ve been exploring on the podcast over the past few weeks a couple months data center AI like you started with this heavy focus on models but a lot of the things that you ran into require that you refine the data refine the data and is that a does that term resonate for you or is that something that you’ve you’ve looked at at all yeah yeah I would say that I know that there’s like a official like data driven AI movement and I would say I think for largely an alignment right like like don’t go creating AI for the sake of doing a stunt and and make sure that what you’re doing actually delivers value and so yeah I do think it resonates can you can you talk a little bit more about the the ML ops aspect of keeping all these models up and running it in production like what kind of platforms and tooling have you built out to support all this yeah so our team is like so focused on product development that we honestly have not done our due diligence in many ways for model maintenance and so right now we have like we do have model audits that will happen so we have some lamb does that automatically put a percentage of conversations into AWS ground truth it for our data annotation team to actually check on and then we have some reporting that can be done around that but for the most part we haven’t seen significant enough data drift to to touch our models except for when a team mobile and sprint merge together because all of a sudden sprint is no longer a competitor and people are talking about sprint they’re talking about us and so that required a that was significant enough data drift for us to retrain everything but for the most part we are just like it works until somebody tells us otherwise which which is not the best strategy but mm-hmm and otherwise it sounds like you’re fairly invested in AWS’s various offerings for from a tech perspective although you did say you you kind of built your own world-drawn Kubernetes cluster that you just happen to be hosting on AWS yeah yeah and so I would say we are interested in AWS’s offerings only in their bare components so we are rarely a consumer of like a cognitive services service like we our team doesn’t really buy those we mostly build our own models and we have them end up deployed on AWS but we recently moved to Google and in my spare time when I’m doing side projects I use GCP for most of my staff and so hopefully we will have a tight-nour partnership with Google in the future and maybe I’ll be talking about TPUs instead of infarantiate chips but we’ll see and do you have opinions on either of their like data science workbench you know environments SageMaker or Vertex?

So SageMaker I feel like it works like it’s fine I feel like I what I really like about working in GCP is they separate compute and storage completely all the time and so you always have to really think about your compute and your storage I also like that it’s like a push architecture so you’re always waiting for pushes instead of other things so that’s why I like doing software but as far as like the different platforms for individual development I feel like it’s all mostly fine like it’s all it’s all kind of apples to apples we end up having to build so many custom components on top of whatever we’re doing it ends up the same we’ve done trials with Azure Databricks too so like Databricks through Microsoft Azure and we’re like we already have a AWS fill so I guess we’ll just keep doing that one for now but awesome awesome any thoughts on kind of you know future directions what’s next on your roadmap?

Yeah yeah so there’s definitely the rolling out speech to everybody making sure that it works great for all of our customers we also have some interesting stuff potentially coming up with legal you know on the phone people always redo terms and conditions and you have to accept when when you are audited to show to the auditors that that happened you have to like pull conversations and show them like audio recordings the thing why can’t we just build a dashboard that does a cosign similarity between terms and conditions and what was actually said so guess they read them and then they agreed so that’s one of our really quick wins that’s coming up that I’m I’m pretty excited about but for the most part everything that we do is trying to move to this dream of the autonomous desktop so like I said so our care experts can sit back chat have the human interaction while we are taking notes for them opening the right apps at the right time and so what I’m hoping is we are standing up a very robust click tracking infrastructure for our care desktop and I’m excited to predict clicks right like to open windows at the right time do things of that nature that’s outside of the natural language and tent stuff that we’ve been focused on we also are doing a bunch of stuff in what we call the virtual retail space lots of people went to buy phones and don’t want to go into a store so we’re like how can we use AI to help sales people sell and that’s like helping create positioning statements helping them write fit accessories to a phone that this particular customer is trying to buy so that that recommendation space we’ve also just begun to dip our toes into that’s very exciting and then measuring the impact of recommendation systems is a whole separate bag that I’m excited to unpack nice nice well Heather thanks so much for taking the time to share a bit about what you’re up to very cool very cool anecdotes and lots of interesting lessons learned in there yeah thank you it was a nice chat.

AI video(s) you might be interested in …