[ML News] Multiplayer Stable Diffusion | OpenAI needs more funding | Text-to-Video models incoming

A lot of the text video models have recently come out, but not only that, a lot of other stuff has happened too, such as multiplayer, stable, diffusion, and OpenAI is looking for even more money from Microsoft. Stay tuned, this is ML News. Hello everyone, as you can see, I’m not in my usual setting, I’m actually currently in Poland. It is the last day of the E-Wide, of the machine learning in Poland conference. This conference is absolutely glorious, absolutely fantastic. It was really cool being here, it is over now, I’m going home, but next year, please be here. Or if you’re a company that’s looking to get rid of some money and sponsor an awesome conference, the ML and PL conference has been organized at least as well as any of the new ripses or ICMLs that I’ve ever been to. And it is very likely that this conference is going to grow and become more notorious in the next few years. So it was a great lineup of keynote speakers, decorials, and other content, and I even had the pleasure of joining in to a bit of a concert at one of the poster sessions, which was certainly a unique experience. So thanks again to the ML and PL organizers, see you there next year, alright? So stable diffusion is going multiplayer, this is a hugging phase space. And there’s essentially a giant canvas, and you can just come in here and you drag this square somewhere, and you give it some kind of a description, and it will just kind of fit in what you’re doing. All of this is collectively drawn by people, and I’m always afraid, because I don’t want to destroy something, right? Because all of this is just very, very cool at what people come up with. Just another example of something that I would have never thought of, but because stuff is open and release, this is, you know, this can be built. So absolutely cool, give it a try, and maybe this inspires you to build something that is even cooler than this. I don’t know what it’s going to be, but I’m sure one of you has a great idea right now. Another hugging phase news, they introduce DOI, digital object, and it fires project sets and models. DOIs are sort of a standard way in scientific literature of addressing things like addressing papers, addressing artifacts, and now hugging phase is introducing these things for their models and data sets on the hub. So on the hub, you’re going to see this little box with which you can generate. Essentially, it’s a UUID for a model or a data set that is never going to change in the future. Now, you can outdate it so you can say, well, this one is deprecated. I have a new version of this model. But it is a unique identifier to that model that you have. And this is really good if you want to put it inside the papers, so as to make it reproducible. And given that it is a standard, it just incorporates with the whole rest of the scientific ecosystem. So definitely a big fuss for anyone who does work in research. Wall Street Journal writes, Microsoft in advance talks to increase investment in OpenAI. This article essentially, there isn’t much detail about OpenAI is apparently asking for more money, more invested. Microsoft has previously invested about $8 billion into Microsoft. And on top of that, probably really preferential access to Azure in exchange that OpenAI will provide preferential access to Microsoft for its product. It’s funny because here it says, last week, Microsoft announced it was integrating Dolly II with various products, including Microsoft design, a new graphic design app, which is cool. And the image creator for Search App Bing, is that their big plan? Is that the $1 billion investment to get Bing off the ground finally? I’m not sure. Now, keep in mind that just because OpenAI goes and asks for more money, that doesn’t mean that they’re bankrupt soon. It could also mean that they’re planning for an even bigger push startups. And I don’t know if OpenAI can still be considered a startup, but startups often do take on more money whenever they want to start scaling even more. Now, how much OpenAI wants to scale even more? I don’t know. It could also be that they’re just out of money and need more. The stack is a data set. It’s by the Big Code project, and it’s three terabyte of permissively licensed source code. So this data set is fully open. You can download it if you want to train anything like a Codex model or something similar. The data set pays specific attention to the licensing of the code that is included in the data set. The code is MIT license, Apache license, BSD-3 license. Essentially license such that you can do whatever you want with it. Now, that doesn’t get you out of the weeds legally of doing anything and everything, because you still have to do things like provide a copyright. Notice if you copy one of these codes verbatim. The stack not only pays attention to this when they collect this initially, but also as you can see on the hugging face entry and the hugging face top. There are terms of use for the stack. And one of the terms of use of the stack is that you must always update your own version of the stack the most recent usable version. And this is because they have essentially a form where you as a source code author can go and request removal of your source code from the stack. So even if you license this under MIT license, they don’t want anyone’s code who doesn’t want to be part of the stack. So you can go and request that your code be removed from the stack. They will then do that, update the data set, and by agreeing to these terms, if you download the data set, you essentially agree to always download the newest version and use the newest version of the data set, such as to propagate that removal of that code. Now as I understand it, I’m not a lawyer. This is not legal advice, but as I understand it, you are entering into a binding agreement by clicking this checkbox and clicking this button. So think about whether you want that or not, but it is good that another option is out there next to just scraping it up, I guess. Google releases vizier open source. Vizier is a black box optimizer that works at scale. So many, many different experiments that need to be hyper parameter optimized. Vizier essentially decides which hyper parameter to try next. So you can run this as a service if you have a lot of parallel workers and you want to run hyper parameter optimizations. They have APIs for users and the user here is essentially someone who wants to do hyper parameter optimization. They have APIs for developers, which means that you can put in new optimization algorithms. So if you’re a developer of a black box optimizational algorithm, you can integrate that with vizier and they have a benchmarking API. So apparently this thing has been running inside of Google for a while and now they finally decided to release it open source. So it’s certainly tried and tested. All right, now we get into the video models. There have been a few video models. Now they have been released a while back, but I’ll just summarize them briefly here. Imagine video is a text to video model. You can see a bunch of samples right here and they look really, really cool. So this is a video diffusion model, but as far as I understand it, is kind of a combination of fully convolutional networks and super resolution networks in order to get this effect. They describe this further in a few diagrams on their websites. Imagine video uses video unit architecture to capture spatial fidelity and temporal dynamics. Temporal self-attention is used in the base video diffusion model while temporal convolutions are used in the temporal and spatial super resolution models. There is a paper to go along with it if you are interested. Now also from Google Research is Venaki. I’m not exactly sure how to pronounce that, but it is a different text to video model that can produce up to minutes long videos with changing text. So here you can see a prompt that constantly changes and as it does the video changes as well. So rather than being a diffusion model, this model compresses video to a tokenized representation and then essentially uses a causal autoregressive language model to continue that tokenized representation. With that, they’re able to essentially produce unbounded video as the beginning of the video simply drops out of the context. But as long as you feed into the side input more and more text that you want to be produced, you can see that the video keeps changing, keeps adapting and keeps being faithful to the currently in focus part of the prompt. What’s interesting is that the training data seems to be mostly text to image with just a few text to video pairs inside of the training data. Now we’re not done with the text to video models yet. Then to AI actually released, make a video, yet another text to video model. And this one is also a bit special because it essentially only produces a single image from text. So this is a essentially text to image model and then an unsupervised video generator from that image. So the text to image model is essentially as we know text to image models, but then the video model is unsupervised. It simply learns from unsupervised video data how video behaves and is then able to take a single picture, a single frame of that video and make the entire video out of it. The results look really cool. What I think is cool between all of these works is that they all have a different approach for the same problem. The other results they produce are very cool and it’s going to be interesting to see how this text to video problem will ultimately be canonically solved. Let’s say I don’t know, but I’m keeping my eyes open. Now slightly different, but not entirely different is dream fusion. This isn’t text to video. This is text to 3D. Now if you think that is relatively straightforward, then no. None of these things actually involve 3D training data, at least as far as I can understand. Rather what they do is they consider the entire scene essentially like a nerve. So what they do is they start with a random 3D scene. So you pick your 3D scene, you fill a bunch of voxels and don’t fill the other voxels. And then you optimize that 3D scene to satisfy text to image models that essentially act as photographs of that scene. So it is a lot like nerve except that you don’t have pictures, but you like optimize for a text to image model rather than optimizing for an actual image. And that is a really cool idea and actually seems to work pretty great. Now there’s other works still improving text to image diffusion models themselves. Ernie, D-I-L-G 2.0 is one of them. This is an iteration of the previous model and it is using mixture of denozing experts. I don’t want to go too much into this, but you can definitely see right here that the results are breathtaking and very good with a great resolution. Now there is a demo on the hogging face hub, but as far as I understand this model isn’t released. So the demo and the code that they put on GitHub they simply cause some API where the model is actually stored. This is a neat tool not directly related to machine learning, but if you’ve ever wondered what like the difference between a B-flow 16 and an F-P-16 is, I never knew, but Charlie Blake has a very cool tool on a blog that essentially shows you the different trade-offs you can make when you choose a number format. So it shows you for the different numbers what kind of ranges you can represent with them where they’re good at where they’re not good at. So you can see here clearly the difference between a B-flow 16 and an F-P-16. One can represent a lot of numbers and the other one can represent just very small range of numbers, but two more precision. GridleyJS is a tool that allows you to interact with grid world reinforcement learning environments. So there are a number of cool features right here. You can edit levels directly. You can also try out the levels. You can debug your policies. You can record trajectories. So right now I don’t have a trajectory, but what I can do is I can put record right here and I can move this thing around here, here going to the lava and then I die and you can see the steps I’ve taken right here. So you can use this to do various kinds of things debugging, investigating, and so on. If you are into reinforcement learning and you work with grid world then by all means check this out. Meta announces their new box, I guess. This is the box. This is an architecture for a deep learning, the grand titan. Essentially they release the architecture open source. So their engineers have sat down and thought long and tired about what it takes for a great machine learning system like their bit more older DGX boxes and they essentially tell you look, we believe that this combination of hardware, this processor is these GPUs connected like this with these power supplies will be a very great base for doing research. Yeah, they’re releasing these specs essentially for you to just buy or assemble, I guess whatever you want to do with it, but I can’t tell you it is relatively hard to decide exactly on every component of the hardware. It’s really great that people who are very competent in this actually think about it and give their suggestions. So if you have a lab or a company and you really want to buy your own hardware, maybe this is a good option for you. Plugging faced fusors from version 0.5 on one on forward supports fusors in jacks, if you like jacks, if you like stable diffusion, go for it. Muse is an open source stable diffusion production server. Well it is not as much a server as it is sort of like a tutorial on how to bring up a server. This is based on the Lightning apps framework, which is open source and it’s kind of an easy way to bring together all the components you need to deploy machine learning things. And this repository is essentially a specification on how to pull up a stable diffusion server. So if you want to deploy stable diffusion yourself, this is probably the fastest and simplest way to do so. Crlx by Carp4AI is a library that allows you to do reinforcement learning for text models. So you can see right here you can give either some sort of a reward function or you can give a dataset that assigns values to expert demonstrations and you can train a language model to incorporate that. This is a relatively new domain to do reinforcement learning on text models, but it is cool to have another library to tackle the problem. CrlBaseLines3Zoo is a training framework for stable baselines 3 in reinforcement learning agents. StableBaseLines is a library that tries to give reference implementations of reinforcement learning algorithms because they’re very tricky and they’re very hard to get right. So these are good solid and performant reference implementations. StableBaseLines3 is the third iteration of it and this repository right here. The zoo contains a number of surrounding things like scripts that make it very easy to interact with it, but also repaired agents and prepared hyperparameter settings that work well in different standard environments. JacksSec is a library that allows you to train very large language models in jacks. So the cool thing is that with this library you essentially get things like data parallelism or model parallelism or pre. You can just specify them and you can trade them off however you want. This is due to the power and simplicity of jacks. Albuminations I hope I’m pronouncing that correctly. 1.3 is out and it introduces a bunch of new image augmentations. This is a library for image augmentations. So it’s good that they introduce new augmentations that fits very well to the augmentations they already have. There’s also a bunch of bug fixes and more. If you’re looking for image augmentations in Python, this might be a good library. This is a really cool thing you can do with diffusion models. These people have trained diffusion models of rain images and were able to create new synthetic brain images with a degree of controllability. Now there is a paper on archive if you are interested you can also download the data set of 100,000 synthetic brain images. Code Deeks is a multilingual code generation model. This is as it says it’s essentially something similar like codex but it is released you can actually go and you can download the model and use it yourself. Meta AI releases AI template which is an inference engine. The goal here is to make inference faster. They get a lot of speed ups over just running standard inference and something like iNorge. So this does two things. First of all it optimizes your computation graph. If your computation graph contains a lot of like little operations that could be used together into something that’s really optimal for a given hardware or just that can be expressed in a smarter way then a graph optimizer can do that. And in a second step there is a compiler to compile all of this to highly perform and C++ code that runs on backend hardware such as a GPU that uses CUDA or even a AMD GPU. So if fast inference is a concern to you this is definitely a thing to check out. Nerf Studio describes itself as a collaboration friendly studio for nerfs but it is more like a collection, an entire collection of software to handle nerfs anything from training, validating or even experiencing yourself. You can see they have a viewer that allows you to just explore in nerfs that you do and make a little videos from it but really it covers everything to do with nerfs. Now speaking of nerf, nerf back is a PyTorch nerf acceleration toolbox. This gets significant speed ups over simply using nerf code that’s out there. For example vanilla nerf model with 8 layer multi layer perceptrons can be trying to better quality in one hour rather than one to two a days as in the paper. This stack logo doesn’t exactly work on dark background but this stack is a library that wants to standardize your ML workflows that you run in the cloud. This is essentially you check your workflows into GitHub and this stack helps you to run them uniformly anywhere. So in a workflow you can specify things like your workflow name obviously but then it starts. You can say okay my provider is bash so this is essentially a bash script. Now what are the commands? I want to pip install some stuff. I want to run this training script right here but it also has things like artifacts and you can also specify things like I want to load data from this s3 bucket over there. I want to run on this cloud over there. So all of this is quite geared towards machine learning. It’s certainly not the first workflow engine or the first iteration from hey let’s check our things into source code but it is very targeted at running ML workflows in the cloud. Several people have figured out massive speed ups in the open air whisper model. For example this first here has figured out a 3x speed up on CPU inference but refers to a GitHub thread where someone else has found an even bigger 3.25 x speed up. Again it’s very cool to see what people do when you just give them the model. Lastly I want to point to a couple of databases for stuff. Mainly around stable diffusion. So diffusion DB is on the hugging phase hub. It’s a data set of prompts that have been entered by real users into stable diffusion and the corresponding images that they got out. Public prompts that’s a public prompts dot art in your browser is a database of three prompts and three models. These models are mostly trained using dream booth but if you’re looking for inspiration for prompts and what they turn out then this is maybe a place to go. Likewise visualized.ai is a website that goes a little bit more businessy so it lets you create some free stuff from like stable diffusion but then it also acts like as a bit of a marketplace for these things such that you could also buy them or sell them. It’s cool to see that different business models are trying to spring up around this ecosystem. Ultimately someone will figure out how to really make money off of this stuff but you know it’s good to be part of the time when people are just trying stuff and seeing what happens with not only on the research side but also on the business side. Lastly big science has released prompt source which is an IDE for natural language prompts. So this is a way to give people a bit more help and a bit more standardization when they use prompts to achieve certain goals. For example when they use prompts to tackle some of the NLP challenges that are now more and more phrased simply as prompts into these large language models rather than as data that goes into especially trained model for that task. So if you find yourself in this situation or a similar one then prompt source may be for you. And lastly this is a database of all Lex Friedman podcasts transcribe. This is the website of Andre Karpotti and he used a simple combination of a download script from YouTube combined with OpenAI’s whisper to transcribe all of Lex Friedman’s podcast episodes. You can go to any one of them you can click and they are here with time annotations and all. It’s a very simple but very cool project. Thank you Andre and I thank all of you for listening. I’ll be home again next week and till then stay hydrated. Bye bye.

AI video(s) you might be interested in …