11-785, Fall 22 Lecture 28: Deep Learning in the Real World – Agot.AI (Guest Lecture)

So, everybody, everyone, all seven of you in this room, eight, and hopefully the other 200, one of you can. So, this light lecture of the course, it’s my last lecture. It’s my pleasure and honor to introduce my old friend and previously student at the year Alex Litzinberger, Alex Hussiengier. And after leaving CMU, he actually founded a startup. Vishnap has a large number of employees down in Wilkinsburg and other places around the world. And his startup is all about applying computer vision technologies of not a very particularly complex kind. Basically, the level of sophistication that you have already learned, if you look at what you’re doing this course, he’s going to over the next hour. So, he’s going to talk about how you can take what you have and take it out into the real world, the real product, the right. And then, he’s going to be talking about the development of his own design, actually making impact. So, let me go down to the next. Thank you. So, as Bikro was saying, talking about my company, Agat, and kind of making a bit of a comparison with the things that you’ve learned in this class and how we apply it all. And in the startup in the real world, not necessarily generalizing to all those cases, but at least our experience there. I’ll start out with a bit about Agat and so we are kind of a computer vision company, but more broadly, the vision of it is humans are imperfect and can only see a portion of what’s happening. If you are using cameras in AI, you can see everything that’s happening in a restaurant and use that information to maximize the effect. So, you know, this can be used in a number of ways. A few of kind of our current products are advanced analytics, stream line drive through, and real time order accuracy. And sure, if any of you have ordered fast food, you have some time or another Agat and a case where you didn’t get exactly what you ordered. And basically what we do there is we watch the preparation process with cameras and we intervene to prevent losing correct orders from going out the door and make sure they’re corrected, get delivered to the customers correctly. And then some of the advanced analytics looking at how people are actually working in the restaurant, how up procedures are being followed, what up procedures are effective and you need to hire through put accuracy, labor utilization, all of these things. And then the drive through as well, using the information about where cars are and the state of the kitchen to make informed decisions about things like if a car should be routed to a waiting bay, if somebody should be, you know, prepping on the drive through line rather than the install line. And really trying to make use of this data, not just to analyze and find out what happened, but to make better business decisions in the moment. And so some sample video of or accuracy and what that looks like just to give you all a bit of an idea. That is also not the correct links somehow or another. No, I think it gets comes in it always because you’re on YouTube, ad usian ad before you do anything. So all of you guys are going to get a free ad along with. There we go. A God AI provides interventional order accuracy technology that helps your team consistently delight customers with accurate preparation of their orders no matter how customized. A God observes the entire order preparation tracking each ingredient that is added to a meal item. The employee finishes preparing the order. The agat system checks that order against what was communicated on the kitchen display system to confirm that each meal item was prepared correctly. If an error is detected, the agat system alerts the employee through a message on the kitchen display system, allowing them to correct the error before the order is delivered to the customer. In this case, the employee forgot to add pickled onions to the rice bowl, but a God’s error notification allowed him to correct the issue. A God’s platform captures previously inaccessible data and provides unique and analytical insights to help drive operational improvements in the restaurant. Visit agat.ai today to learn more. So basically just to give you a little taste of what that looks like. You know, we are having cameras overhead watching the preparation process and kind of importantly, the what what is actually prepared is a function of the entire preparation process. So this isn’t a case where you can just look at the meal at the end of preparation and determine if it’s correct or not. It’s really a function of that it was the same bowl that had rice in it that was supposed to have pickles in it that did not. That was kind of a demonstration video from a partner restaurant, but a lot of the restaurants that we are working with have, I would say significantly more more difficult operations where you have a lot of small items moving quite quickly. And I’ll kind of get into that a bit more. And so, you know, particularly with the order accuracy project, there’s lots of occlusion. So not only with people’s arms, other objects, there’s also kind of kitchen elements, a lot of these kitchens for maximum efficiency are very vertically stacked. That a lot of the prep actually happens kind of under equipment where you can’t can’t effectively see. And you know, imagine if you have several burgers and they all have the buns on them, you can’t do too much with their appearance because they they look very very similar. And also more broadly, we need to compose a lot of kind of low level vision information towards some higher level objectives. And there’s a very high level accuracy that is required to do this effectively since most of the time the employees are preparing the meals correctly. And so in order for most of our notifications to be correct, we need to be having a as high or similarly high level of accuracy in terms of our accounting of what was prepared to allow that level of accuracy as far as the interventions basically. Just because the base rate of errors is higher than anyone would like it to be across the fast food industry. Typically about 15% of orders have some sort of mistake. And so on the level of the item and the ingredient, then that corresponds to needing a very high accuracy to produce the interventions well to not slow down their operations, but still allow them to correct the errors. So we move relatively fast and erratically over large use again, this was relatively kind of calm video there, but a lot of fast food preparation people are moving as quickly as they can. And kind of over the course of a few minutes preparing a large number of food items and as I was mentioning, it can be surrounded by functionally identical other objects as well. And couple of that there’s a lot of data. So you can’t just treat this as a function of the end state of preparation. So you really need to understand the preparation process and know that the that it was the same meal that had one item had one ingredient and also had another. So somebody order a burger with with pickles and no tomato. It’s important to for us to distinguish that from some other burger that was not supposed to have a pickle and was supposed to have a tomato. By the time they add the tomato that that pickle may have been longer longer ago included. And so we need to keep track that it’s the same object over kind of significant periods of time and significant occlusions, multiple cameras, all of these things. So we operate on like 1080 by 1920 frames. There’s three color channels. There’s two to 15 cameras that meals are kind of prepared over. The meal preparation takes about three minutes. So in terms of the raw data, for instance, like 56 to 419 gigabytes and the actual order that we are trying to predict to make sure it’s correct is about 10 bytes. So it’s a lot of data for for the end signal. You don’t have to store that amount of raw data because the code X are fortunately very efficient. But there’s also hundreds of instances orders per day per store and thousands of stores per brand. So you really got a lot going on and a lot of data to process you and ensure that this is effective. And just compared to some other notable data sets. It is a lot of data relative to kind of the end signal that’s predicted. And as you kind of have all learned in this class, inductive biases are kind of helpful for dealing with a large amount of data per signal because you are imposing some some structure on the solutions to the problems. You can do image recognition with an MLP, but you’re much better off doing it with a CNN and imposing that structure on it because you know there ought to be things like like positional invariants. So kind of having even more data accentuates the need to do these sorts of things. All of those are kind of rust rough. And as I’ve been saying, we need to be very intentional about how we do it. We need to be efficient as far as compute. And we need to build some inductive biases as far as the structure of the problem. And the representation and the labeling front, this is kind of a very, very large amount of data to just get fully, fully labeled. And it’s in terms of the way the problem is represented and what is actually labeled to be efficient and to also get a breadth of instances. There are a lot of customizations that can be done to the the meal items and you want to make sure you’re seeing at the some of those seeing all the ingredients and if you’re getting things kind of very densely labeled, then that’s going to be kind of a very expensive proposition. And so, you know, one of the considerations about what’s different in a class like 1175 and working on this stuff at a startup is you are spending a significantly larger amount of time asking these sorts of questions of, you know, how do I actually wrangle the data? So I want to represent these things so that I can make the most efficient use of my resources, be that compute or labeling time or bandwidth to get the most bang for your box. And so I thought I might just kind of list some of the techniques we use and kind of talk about the overlap with this class and so use MLPs of course CNNs, RNNs, transformers, auto encoders, no games. And so, just about the whole kitchen sink of the things that you’ve learned here. And there are kind of pieces of the overall system. Each of these has their own kind of accompanying appropriate representations and inductive biases that I mentioned are kind of very important in terms of structuring the problem in a way that is learnable and workable in terms of the amount of kind of labeling and compute et cetera that’s needed. And so, some of the principle ways that deep learning and practice in industry or in a startup varies from classes and academia, there aren’t any public data sets. So, you know, we get this raw data from the point of sales and from the cameras and it’s really at that point up to us to decide how we actually want to represent the problem, how we want to pose the problem, how we want to train our models, how we want to get our data labeled. We effectively have got launched with that that there are not really for our tasks relevant public data sets. So, the ball is really in our court for how to best do that and obviously kind of what exists informs that as far as the network architectures and representations that have been shown to be effective. But it’s a much more kind of two-stage process and let me just skip ahead actually. Yeah, so in academia, you know, if you’re working on a project like in this class or in your research, you might kind of typically find models that work on your data set and build on them. Whereas in a startup kind of the process can resemble something much more like this where you look for vaguely similar problems, representations that sort of get up what you want to, they figure out how they map onto your problems, and then you make modifications and build from there. And that’s just to say it’s not necessarily available to you to have off the shelf data sets and models that are really getting at what you want to get up. So, as I was mentioning before, there’s these trade offs in terms of the closeness of the representation to the raw data and the tightness of supervision versus the kind of time and effort to get that labeled. You know, of course, if you have a lower level representation, then you’re also putting more onus on getting that translated up into these higher level objectives. Is there’s kind of a trade off there and there’s also kind of very practical concerns. So, what I would like to do is actually match what my data looks like in practice if you have a different processing pipeline for the compute running, the edge versus in your training. And then you can skip around here a bit. And so, yeah, not all data kind of neatly, neatly fits kind of well studied molds. And so, sometimes it has a unique structure. And as far as kind of dealing with particularly a problem that has a very large amount of raw data to the amount of kind of and signal that we want to predict. And so, you know, it’s a very important thing to represent that structure well. Because if you don’t, you end up just kind of overfitting and not appropriately generalizing or just failing to fit or using exorbitant amounts of compute. And so, principally, it’s kind of different notions of locality and the relationship that it has with other data. And that matters a lot for how you learn. Like I was saying before, you can train a MLP on ImageNet and maybe you could get it to fit your data. And you probably aren’t going to see the amount of generalization that you like. You probably aren’t going to have the efficiency that you like. And you’re going to have a model that’s kind of more, more probably just over parameterized a lot larger. And on top of that, not appropriately kind of capturing the crux of the problem. Yeah, so not all problems are kind of very well studied. And I for one was kind of surprised and are are journey so far as a startup, the degree to which there are problems that really come up that are not well well studied that there isn’t really material on as far as the literature or effective. Architecture is for kind of off off the shelf. And so that sort of leads back into this that the process is pretty different. And for us at least has involved a lot of composing different pieces of research and bring that all together to have a question. And kind of more broadly characterize this as like figuring out what questions you should be asking. But before actually asking the questions and just kind of trotting ahead. Is this the right representation? Is this the right way to go about it? Is this the right way to label the data? Is this the right data to label? Is this the right data for us to get that there is a very large degree of control when you are the one that is installing the cameras, placing the cameras, deciding which kind of product problems are being tackled, deciding how the data is processed and labeled and the model represents the problem. And how the training process is run, there are kind of a lot more degrees of freedom than there typically are when you are working off a fixed data set with a fixed set of inputs and outputs. Yeah, as I mentioned before, I have been kind of very surprised on this journey relative to my expectations going into it, the extent to which despite the kind of thousands and thousands of papers being published on deep learning, there are kind of significant green field areas and blind spots that kind of retroactively seems like pretty significant. So in academia, typically, you know, there are various on the particulars, but you’ll be working with a well-defined task, inputs and outputs, and in a startup you’re optimizing kind of a money-save pipeline, you have to worry about things like how does your training sampling vary from what you have in practice. You know, this is actually something that happened that at one point the training pipeline ran through JPEG and you had some JPEG compression artifacts in the training pipeline that were not present in the inference pipeline. And you sort of need to make sure these things are consistent. It’s a very practical problem, say. And so as I was saying, there’s a lot of overlap with what you’re learning both in terms of the subject matter, but I think also in terms of the approach, you know, breakdown problems, simplify, test on small cases, and make sure you can fit on that before worrying about generalizing, create some toy problems and do them. And kind of the second part of the stroke actually wanted to go over some toy problems that are a bit analogous to some of the problems that we’re doing in some cases kind of help me work out some of the problems that we were facing. But I think actually maybe I’ll stop a bit for questions if there are any right now. So, I think it was needed to train me down that training it up with that specific domain. How do you start collecting the data already on a day and much time to get to the end of the day to just be able to come up with a basic system? Yeah, so we started out kind of in some restaurants where the problem is sort of simpler than in a lot of the large fast food chains. So unfortunately, we are not able to kind of publicly share videos from some of our larger corporate sponsors, but the kind of problem is significantly harder. And there than it is in either that video or where we started out, which is actually a local restaurant here on Craig Street, Sushi Fuku. And, you know, it being a simpler problem made it much easier to get a relatively small amount of data labeled. We sometimes either labeling it ourselves or hiring some folks to label that data to get things started there. And basically we built a proof of concept there and presented that to some of the larger brands and started working with them. So, how much data have we speak of? What does it leave me across the staff? Right now we work with kind of a third priority labeling vendor. I think for our initial kind of models, it was probably on the order of like a few thousand single digit thousands of images that were kind of labeled, but thankfully due to it being like a relatively kind of simpler problem in that restaurant that was kind of sufficient. But still, the number of classes you have in this image is not insignificant. I’m sure it’s the double digits and the variation. I mean, when you’re looking at Waka more, there’s so many different presentations. Yeah, yeah. So, I think in sushi Fuku, it was about 30 classes with our customers now. It’s going to more on the order of like 80 or 90. And that’s actually something I forgot to mention in the earlier slide there is another challenge is a lot of the ingredients on the items are very small. So, you can’t necessarily kind of very easily see it and it also looks very different if you have like guacamole in a taco versus a burrito. So, you kind of effectively end up with a lot of sort of variations of the same ingredient and that’s definitely kind of a big challenge. So, part of dealing with that effectively process, integrating the information over time. So, not just trying to in one shot get all the information, but using different signals for when an ingredient was added and when an ingredient might be visible to determine the composition of the order. Also, given the nature of ingredients, I got to feeling that you’re a classifying based more texture than on shape. So, you have two stage problem here one is segmentation, the other is classification. And segmentation and sense is reasonably well defined when you have objects of shape. And the notion of segmentation itself gets much harder to define. So, how do you solve these two problems? Because, the sectors are fixed to the 100 plus five. And when they have such a variety of and then. So, I’d say as far as it being the texture versus kind of the color, then that actually varies pretty significantly based on the ingredient and the brand. So, there are some things that are basically like spouses and they have functionally the same texture, but just different kind of colors. And, yeah, you mentioned that you have kind of multiple ingredients there. It’s not necessarily a clear segmentation boundary. You might even imagine kind of sauces that they can get mixed up. And, not even have any well defined kind of boundaries visually between them. And so, that’s kind of certainly a challenge. And I think what you’re saying kind of hits on what I was describing before that it really matters picking the right representation to actually make the problem effectively, both learnable, but also labelable. And, some of the challenges that we face with that are to do with the fact that there are multiple ingredients on an item. Some of them may or may not be included at a given time. It’s hard to define a kind of solid line boundary before between when an ingredient is visible or not. And it’s hard to define good like segmentation boundaries for a lot of these ingredients. So, again, when you define your classes, now we have a label on the side on the label set. Yeah, yeah. I don’t think I’m just labeling basic palette of ingredients that you’re given. But you have to come up with your labeling separate defense, which is some functional. So, what is the setup label? How do you come up with labels? And the instructor labelers? Yeah, yeah. So, I think there are a lot of considerations with labeling. And I think I could probably give a whole talk just on that. And some of the major ones are, you know, for one thing, while the interventions are on the level of like what in the menu did they get wrong? A lot of places will allow you to order the exact same thing in multiple ways. Or either get like a cheeseburger or you could get a burger with cheese. And there is no logical way to visually distinguish those. They’re preparing the same thing. And so, you, for one, need to make sure that what you’re actually representing them as is something that like exists in the data that you’re looking at that the distinction between a cheeseburger and a burger with cheese exists in the point of sales data where they are representing that and displaying that to the employees. But it simply does not exist in the real world. So, there’s kind of no functional way to establish that. And then you get into the particulars of the representation that’s being used for labeling and also the loss function. Because it’s not just the representation that matters there that effectively the loss function dictates the semantics of the of your network outputs. So, you know, the only difference between kind of several sigmoid outputs that you have is the label for those binary outputs that we’re actually using. And so, if you are kind of likewise representing it in a way such that the loss you’re evaluating does not actually correspond to something that it can functionally learn or that exists in the real world, then you’re going to run into some problems. So, I think that the guacamole is probably looking like I’m not sure with like all the lighting conditions. Most of our agreements with customers include sharing data between brands, but functionally kind of much of the work we do is kind of on the architectural and process kind of level of how we are actually representing things, how we actually get them labeled, how we get things validated, how we integrate with our point of sale systems. And so, all of that is transferable. So, while it would entail getting kind of different labeled data sets, a lot of the work is tied up in those other things. So, when they add things to the menu, yeah, so, yeah, a lot of fast food restaurants have limited time offers quite often. And so, this is kind of an important thing to handle well. And we have kind of several methods of dealing with it. On one level, we need to make sure we’re working with them and we’re appropriately integrated into their point of sale so that when we actually kind of see it come across the menu, we know that there is like something different that we should expect as being being made. And that we’re actually able to, interestingly, a basic level actually parse what it’s saying there and kind of be already digital information. And then the kind of approach that we take to that broadly is one of continuity of if something just shows up, then it’s hard to correct errors in terms of how it’s prepared if we kind of haven’t seen it before. But we can definitely avoid making kind of false positive interventions on it and make sure that we are able to relatively quickly kind of bring into scope making corrections on the preparation of those new items. And first one is relatively, maybe, what is the name of the company is a guy. So the answer is it stands for a got good order technology. It’s a recursive acronym, but truthfully, it’s a recursive acronym because we came up with the name before we came up with the acronym. So, yeah, basically we kind of looked through some short Greek names and weren’t too picky so we ended up towards the beginning of the alphabet. So, do you have to consider some, I think, issues maybe a fluke of the month to get his own card base recorded? Yeah, so basically because our kind of camera views are top down, we don’t actually see employees faces very much, but also kind of more broadly, I think there is just a matter of like the expectation of privacy. Broadly, there isn’t kind of too much of a baseline expectation of privacy in a fast food restaurant that you’re around other people, they see what you’re doing as well and more broadly, they have kind of cameras installed that are seeing things anyway. But it’s extremely important to us that like we are something positive for the restaurant employees and more broadly, something positive for kind of all of the stakeholders in our value chain. So, you know, the line employee, the shift manager, the general manager, the area coach, the franchisee, the brand. And, you know, for instance, if the interventions aren’t something that people kind of pay attention to and respond to by correcting the orders, then, you know, the order accuracy product does not really achieve its end goal of. Getting those orders corrected and so it’s very important to us towards that end and just for its own sake that we are something kind of positive for the employees. So, at the end of the point, you were talking about where you can have convenience with very good opportunities and those looking features. And sometimes the lane and conditions may change. Have you considered other sorts of input data apart from which have you been on the internet? Yes, we have kind of at very points looked at other input services. Some traditions are kind of equipped with some IOT sensors or kind of the control mechanisms for them are kind of digital already. We have looked at doing some things with audio, the kind of other data stream that we use is the point of sales data. So, the answer is yes, but I think also one of the principles we operate with is functionally cameras can operate as relatively kind of universal sensors that, you know, you could figure out if somebody, you know, put their hand over here by having some sort of IAR barrier or you could have a camera equipped with computer vision. To just figure that out and a whole litany of other things. So, this is when I see that I continue to share this story with others and not just for people who or others I work with. We have actually worked with some teams for 11, 7, 8, and 5 with some of our data in the past. The kind of trouble with kind of releasing public data right now at least is principally more one of the contracts that we have with brands and the ownership and kind of privacy of the data they want rather than kind of any sort of competitive considerations. I think to the extent we’re able to, you know, having kind of some public data and kind of stimulating some thought about kind of our problem that space more broadly is something we’re interested in. I think there is a trade between making the system working and we were timing and we were working with the system. How is your deal with the performance? I don’t know. I was wondering if the system was in the market for a first? Yeah, so I guess on the first question, the accuracy of the system versus the amount of compute required, there is certainly a trade off there. I think it can be kind of less of a trade off than it seems. For several reasons, one of them is, you know, if you’re kind of thoughtful about which data you are processing and how, you can kind of get, extract a lot of efficiency from things. And secondly is, I’d say broadly, typically, if you’re running neural networks with kind of floating point 32 operations and doing full forward passes, you’re probably using somewhere on the order of like 200 times more transistors than you need to actually do it. And to a significant extent, there are kind of good compute products that, for inference, take advantage of that and do kind of quantization and sparsity. And so that allows us to do a much higher number of operations with kind of less silicon and less cost for the compute. And sorry, what was the second question? I don’t know if there are multiple workers in the game. Yeah, so more things, I think, I forgot to hit on the slide of difficulties, is the kind of typical situation is there’s multiple workers working together to make it. And there’s also multiple overlapping orders being prepared. And we have to kind of deal with that not as an exceptional case, but as the standard case. And there’s kind of an even greater extent to which it defies simplification because it’s not some sort of rigid segmentation where this person always adds the pickles and this person always adds the tomatoes. They work together. And if this person is kind of backed up with their work, the other person will come over and do that. So yes, it is something we deal with. So the model set you’ve done, what different restaurants, the process is obviously different to the VLSF or the changes. We change the model set you’ve done that. The model set from the completely become well over all tasks, or do you still have to make changes and make decisions on the larger future frame? Yeah, so we have to make some changes in decision on the architecture front. But broadly, the approach is kind of one representation, one model architecture across all brands, different kind of data sets and integrations on a per brand basis. Sometimes when we’re working with a new brand, there will be kind of aspects of it that come up that are new. But the approach for that is broadly integrated into the single system and continue to just use one system across brand. We’ve mentioned the skill of the data that is started and how it was in self-providest. So was that much data for vision? All our new chips about that of some kind of re-training on the computer, data so that you could test and manipulate different kinds of objects that could happen to you. Because there can be different things on the new world, you can kind of obviously make that on film, can you just have a concern about who or who is going to do it? Yeah, so I think as far as like pre-training, if you’re using an architecture that like exists publicly, which for kind of the feature extraction, there isn’t necessarily much reason not to. Then there likewise isn’t much reason not to use just like a pre-trained initialization and even if there’s something different with the architecture, it learns helpful features by learning the things in a public data set, even if you aren’t predicting any of the things that are in the public data set. So yes, on the pre-training for the specific kind of use case we were starting out with, that kind of smaller amount of data was more or less sufficient. But I think a lot of that had to do with the relative simplicity of the problem in that restaurant in Tsushima Fuku. And so I guess kind of circling back to some of the difficulties there. Typically in Tsushima Fuku, items are prepared almost entirely by one person. There is not too much occlusion. You can see all of the ingredient containers, items are moving relatively slowly. They’re large items. They’re usually well separated from others. The ingredients are reasonably well separated from others. You can see them grabbing the things from the bins. And there’s probably a whole lot of other things that make that situation kind of easier. Yeah. So in the product that you’re solving, it’s not really a computer vision problem, it’s a combination of a computer vision problem and a plastic satisfiability CSO, optimum, no, black sector problems. And you actually explain the different components of overall solution. How are you breaking down and what are you guys actually doing? Yeah. So I think at this point, a lot of what we’re doing has to do with the higher level of way we are dealing with things. And I think there is an overall principle, again, with the inductive bias of if you are trying to predict what was made, you’re kind of default assumption should be that they made it correctly because even though they get the orders wrong kind of far too much, most of the time, they’re still getting them right. I don’t think I want to go kind of too much into the specifics, just because this is going to be posted on YouTube and everything. But a lot of it does have to do with the kind of higher level processing of this kind of signals information. I’m guessing on a bad day, the system can still be mistakes. So I would just curious, I guess, was it was a problem easy to track, but maybe two negatives are harder to track. So how do you get it back? You put back off the problem for the system. Yeah. That’s a good question. And so I guess, yeah. First off, we are fortunate in our problem of space that unlike some other applications of deep learning, nobody dies if we get something wrong. And so on occasion we can, and effectively, we waste a bit of the employees time, we undermine their trust a bit in the interventions as far as their accuracy. But that is kind of an acceptable trade often on the level of the brand. They make different trade-offs between how much they want. The false positives versus the false negatives. We track kind of the false negatives several ways. So there are validation data sets that we get labeled for what was actually prepared. So we have people watch the video and mark that down. What was actually prepared as opposed to what’s in the KDS. And then we can see of the orders that where what was actually prepared differs from what was ordered, what percentage are we intervening on. And then we can also look at for things like review and refund data the impact we have and get kind of a statistical sense of the impact we’re having there. I have a couple more questions, but I’ll get for the end. Yeah, I’ll have some more time for questions at the end. And so, yeah, one problem they had recently was to do with multi-set prediction. So we have orders and they have the various items in them. And you can have two of the same or some items. So to set up kind of a toy problem, let’s start out just keeping it nice and simple. And suppose that you have a multi-set with a maximum size of two. But orders vary in size and the sets vary in size. And again, just for a lot of simplicity, suppose our model is just f of x equals a. It’s just outputting some constant set, but it’s a learnable parameter of the model. So it doesn’t pay any attention to the inputs, but it has kind of learnable outputs. And one thing you might do is you might read DTR and hear what they say about sets and how to approach it and say, you know, should use the Hungarian algorithm, do linear sum assignment, find a bipartite matching between the two. Ta-da-da-da. So let’s go ahead and do that. And so if you have kind of two instances of ground truth data one and two. And so you say you have many batch size of one and you’re kind of doing gradient updates. And this is how your predictions start out. Then you start out with instance one. You assign it, you get a gradient that looks like that. You move it over, you go to two, get a gradient that looks like that. You move it over, and you keep on going. And you end up stuck like that. So each time are just alternating between this one being assigned to one and two. And this one not getting assigned to anything. We’re getting assigned to kind of the empty prediction. Maybe it’s moving over to some zero somewhere. And this is obviously not what you want to happen. You want something like that to happen if you are caring about just their existing some predictions that are associated with the items, the particulars of that might depend on your loss function. But you can end up with a problem like that. And interestingly recently had, functionally this kind of problem where the learning procedure was set up in a way that did not appropriately kind of deal with these unassigned items and had a broadly similar kind of failure mode. And I actually do not know if this is something that kind of happens more broadly or how kind of on a practical level something like DTR actually deals with this and avoids these scenarios. But at least with our situation for this particular network, this sort of thing came up. And more broadly there’s got to be a better way. I think there’s probably some dynamic program in doing the optimal assignment problems we haven’t seen. Yeah. It’s very analogous. And so you have an assignment to be made while computing the loss? Yeah. And so I had to discuss this a few weeks ago with a picture. But I think there is probably some sort of dynamic program in there somewhere that can do this in another way. But not sure exactly what it would look like. And probably some degree of like a temperature aspect of it, too, of how certain the assignment that was made would be to be made. But not sure exactly the structure of it, but it’s interesting that this is a problem that came up. And coming up with this toy problem and working through it was the way I had gotten that result. So I have that another one. So everybody wants clean and perfect data. It’s pretty expensive. And what he really wants is a model that works well. If you had bad data and your model works well, I don’t think you would necessarily be too bummed out about it. Make sure you have at least good validation data to make sure it’s actually working well. But if you can make do with bad data, then you may want to. Because broadly, you can spend money on laborers, or you can spend money on engineers. And more broadly, you can have engineers on making sure the data is good and clean and analyzing it, or experimenting with different architectures, doing other work. Or have lablers work on higher quality labels or just more. And so can you actually learn with noisy labels? And is it a worthwhile thing to consider to make these sorts of trade-offs? Or is it something that will just really screw you up if your data isn’t clean? There’s much to do about clean data and data centripet A, A, I, and everything. So certainly the data matters, but what actually matters more. Have that more data, putting your resources towards improving architecture and things, or making sure the data is as clean as possible to what extent does it matter to what extent does it matter in your particular use case. And so to see about it, it’s a worthwhile hypothesis. Just set up a very simple thing with MNESTAVAL things. And set up a data set that just randomizes some percentage of the labels. So it’s actually something that’s known that some of these larger data sets have a degree of label error, even if they have been very heavily scrutinized. I think don’t quote me on this, but maybe 1 to 3% on ImageNet. And networks seem to learn just fine on that. But what about significantly higher proportions and what’s the relative impact of that versus having a shorter data set? And with the full trained data set, this particular network got 99.3% accuracy. Things get, however many nines of accuracy with MNESTAVAL, but just the simple, stupid sort of thing. And with 20% of the labels are randomized, it gets 99%. And with 50% of the labels are randomized, it gets 98%. And with 80% of the labels randomized, it gets 95%. So there are a lot of scenarios in which getting 20% of your data correctly labeled is significantly easier than getting 100% of your data correctly labeled or more broadly getting a data set of a size that would allow you to achieve the same accuracy. And then 99% accuracy versus 93% for the data set with half the data. It’s not conclusive since it’s a toy problem, but it helps inform whether it’s something worth looking into if you really want to spend the time and effort on that incremental bit. Or if you also want to look at different approaches to doing the labeling that might give you kind of lower accuracy labels, but feasibly you could get you a lot more, could get you that a lot cheaper, a lot more easily, could help you iterate quicker all of these things. You follow up with actually evaluating it. The real problem, but something like this, is very informative, especially since it takes half an hour. I was pretty surprised by that and it raises some interesting questions. If you have a data set where half the labels are wrong, could you do better than with a data set where all the labels are correct, but half the size? That conceptually, if you knew which half of the labels were wrong, then you have the same amount of supervised data and some additional unsupervised data. And you should be able to use that, but it’s a bit tricky there. And also could you use the information about the proportion of labels that are wrong to create a better training procedure? Like if you know you have some level of noise in your labels, could you set up a training procedure where you, for instance, tell the network you can disregard the 30% of labels that you’re having the most trouble on, because there’s a decent chance that those are wrong anyhow. Do something like that. I’m sure somebody’s explored this further, but just kind of something interesting I thought it true. And yeah, open it up to Q&A and also we are hiring. So. So just brilliant. And you guys start seeing your graduates just went out at an idea and I don’t know where the idea began. And today here you are, I didn’t know what it had to do. So what was the process from stepping out and saying I’m going to start a company? So actually being very welcome. Can you give us an idea of it? Yeah, yeah. So I actually started out when I was still at CMU or over the summer there. And I guess as far as the initial inspiration, it was the very long lines at the Chipotle over on Forbes. And basically we started out not with kind of the order accuracy, but with looking at kind of automatic checkout to basically in some place like a Chipotle where you order by talking to the person and telling them what to make to just ring you up and not have a need for a cashier there. And I do need to get the order kind of more accurately and everything as well. And the process was kind of we worked on getting some sort of prototype. We kind of just walked in and got in touch with Ting, who is the owner of Sushi Fuku. He’s now the owner of Aratashi because there’s split that up and anyhow. And you know, asked, we want to work on this. Can we put cameras in your store? And he said, yeah, sure. And we did that, got some data and kind of just worked on packing on it to make something interesting that did what we wanted to do. And we did end up getting something that more or less was able to ring people up in Sushi Fuku, but we kind of didn’t end up to actively pursuing that because we ended up more actively pursuing things with some large brands. But basically, we had a kind of demonstration system with Sushi Fuku and my business partner, Evan, worked on a lot of the kind of business development there to get us in contact with the right people at these brands to start working with them. How long did the load process take? So it was about like a year and a half in when we kind of first started working with kind of our first major brand. And then you were able to kind of fundraise aggressively off of that and kind of go from there. And so now how long is the company? We have about 35 employees. What’s the best price? About half in Pittsburgh. Any other questions? I can go to the students. What is the average set up by a new store? Overnight. Yeah, and then solar couples. And then installs the camera and compute and networking stuff overnight. So is the system sensitive to the publishing of a canvas? Yes and no. So I think it’s a bit of a tricky thing. And broadly, you need to see the preparation of the food. But you want to minimize kind of the specific sensitivity of my new show of where the camera is mounted or where it can be mounted. So we need to make sure we have visibility of the kitchen elements and kind of consistent visibility of that between stores. But in kind of the design and approaches that we take, we try to minimize any of the specific sensitivity to small changes in the camera position. So broadly, we want to see this part of the line, this part of the line, this area with the drinks, but not like it needs to be 4 centimeters from here or something. Did any problems with like, refrigeration over time with cameras being in a hot or senior year of emergency? Yes, we have to replace cameras on occasion. As far as heat and steam, we try to, to the extent that we can, try to keep them away from that and have cameras that have some protection for those sort of things. But there is sometimes some cleaning that needs to be done. Yeah. You mentioned the data, the data and the architecture. I guess that was better. What has contributed more to the success of the company? What have you spent more time on? Probably the latter. Better. Yeah. So what are the, I must admit, not all customers are equally difficult, right? So to next in your matters, what are the key challenges that you encountered when you got to a new customer? As an example, what kind of challenges you ended up with, technical and media level? Yeah, yeah, so I think some of the major things are just how vertically stacked and inclusive the kitchens are, having kind of all fast moving objects, having the ingredients and their visibility, kind of small and low respectively. Yeah, I think those are kind of the principle things. There are kind of some pieces of my new show of just how well defined is it based on their ops procedures, where exactly an order is supposed to be being prepared, or how exactly it’s supposed to be prepared. And there’s a lot of information that is kind of implicit in the operations that is not necessarily centrally documented in terms of what is something they are allowed to do versus not to do, and kind of expected operations, or if it can actually be handled as an exceptional case. And so we’ll talk extensively to their operations teams and try to get as good an understanding as we can, but that doesn’t stop us from finding something 100 hours into the footage or just like what the heck happened. So I guess most of the things are you media for this, or you know, you know, we were using the camera, say, IR camera, those different ingredients in my browser are not like that, and that might show up differently. And maybe that information can also be helpful for seeing that in the media. But you’ve put in between minds of all 10, maybe on some of the things that get there. Yeah. So we have looked at IR, we’ve looked at a lot of things. So we have looked at depth and done some things with depth. We have looked at kind of high-perspectral things as well and some IR things also. The nice thing about RGB cameras is they are very standardized and commoditized and relatively inexpensive. And you are also doing the labeling then in the visible spectrum that is kind of more straightforward to do that you don’t necessarily need kind of some other source of ground truth or to give like labeless, specialized training as well. Yeah. So we have considered kind of a variety of other options, but there are a lot of benefits to kind of keeping it relatively simple with RGB. So yeah, it’s been changed. It’s been changed. It is often, say, some operand, which is a part of the media. So I guess also more broadly, this connects to the way things are represented and approached. And say big consideration for us in doing that. What things we are not sensitive to and to what extent? So kind of some of the biggest things that we sought to be like non-sensitive to are like camera positions, precise positions of ingredients. And to do you can, you know, the sorts of variations between stores with like the lighting conditions or the precise angles that the cameras are at. But it’s not something that it comes naturally like the design more casket we put into to make sure that’s the case. Could there any other lenses, A, or maybe you could like to some other main, like, for example, installing a camera and some couple and before some other people’s attention and the presentation. It’s basically a similarity strategy. So yes, there are like possibilities to skill to other industries. I think kind of most applicable things for our technology would probably be like correctness of complicated procedures. And so it’s not necessarily something that would be, I think, applicable to household use. It is something that’s potentially applicable to manufacturing, medical, other services. What was that? Get a little detail. Yeah. One last question. Well, this is a key in your life. Is that in the start of the first one that you used to it or some other one used to work on human projects at the same room? That’s a question I wanted to ask. Sometimes there’s a lot of other things and involved at least kind of the stage we’re at right now. Certainly kind of early on in the answer is yes, but kind of other responsibilities sort of start to come in a bit later. And so the work composition can change a bit. All right. Thank you very much. Everybody, let’s do that. Thank you. Thank you.

AI video(s) you might be interested in …