Diffusion models with KerasCV
Hey everyone, my name is Devya. I’m representing the CARES team at Google. I’ll be talking about diffusion models, what can be done with them and how to enable them using CARES. So what are diffusion models and why should you care? Historically, you could generate fake shoe pictures. These aren’t really real shoes, not particularly interesting. You could learn the latent space of a data set. The top row is a real hand digital handwritten digits. The bottom row is not real handwritten digits, but it’s generated from the latent representation of the above row, which is basically a very compressed format of images. Or you could generate deep fake videos. On the left is Obama talking, and on the right is a synthesized video of Obama talking. All of this is quite interesting, but nothing really useful and it was really difficult to control the output. Until in jamb 2021, OpenAI launch stale 2 is a very powerful and popular image generated model. This was the first mainstream text image model. This is the first prompt they launched it with, an astronaut riding a horse in a photo-realistic style. It’s pretty incredible. It’s cool, but the model was still closed source. Nobody could really use it. They do have an API that you can use, especially now. They will find it a little bit more. But recently, about two months ago, stable diffusion was launched. It is a fully open source text image model, released by a startup called Stability AI, is generously licensed. It works very similar to Dalai 2, more similar to another model called ImageGen, which is also not open source. Here is an example. You can provide a text from something like Paradise Cosmic Peach, and the model outputs this image of Paradise Cosmic Peach. I have a few more examples here to give you an idea of how some of these outputs look like. These are pretty cool, completely artificial generated images. This is a genuine author in the 19th century portrait. This is a cute magical flying dog, fantasy adrawn by Disney concert artists. This is a pencil sketch of robots playing poker. You can even see the poker chips in there. This is pretty cool, but there is not all. You can also do Image2Mage workflows guided by text. Here, you can see that the model accepts two inputs. One is Image of the Paradise Cosmic Peach. You tell what part of it is and then you have a text prompt, it is a spider chip, and the model outputs something like a spider chip on top of your Paradise Cosmic Peach. Here is my colleague, Luke sitting by the river, and what we are about to do is call in painting, where you must sum out of the image, for example the boats and provide prompt keywords like Man sitting by the river bridge, river bridge and Unicorn sitting by the river bridge. This is pretty cool. So that was Image in painting. You can also do ImageOut painting and variation generation. So here is a picture of a girl with a pearl earrings. If you ever imagine what’s outside the frame of this picture and preserve the style, you can do that. So this particular picture was in size by Dali too, this is called Out Painting of an Image. You can also do other things like Generate Variations of an Image by seeing the model with the original image. So this was done using Stable Fusion and we’ll talk more about this later. So let’s take a step back and try to understand how all of this works. Unlike what you might expect at this point Stable Fusion doesn’t actually run on magic, it’s just a lot of data. It’s kind of a late-indifusion model. Let’s dig into what that actually means. So you may be familiar with the idea of super-resolution. So it’s possible to train a deep-learning model to denoise an input image and thereby turn it into a higher-resolution version. The deep-learning model doesn’t do this by magically recovering the information that missing from the noisy low-resolution input but rather the model uses the training data distribution to hallucinate the visual details that would most likely be given the input. So what happens when you push this idea to the limit? You may start asking what if we run such a model on pure noise? The model would then denoise the noise and start hallucinating a brand new image. By repeating the process multiple times you can turn a small patch of noise into an increasingly clearer and clearer high-resolution artificial picture. Now to go from late-indifusion to a text-to-immate system you still need one key feature that is the ability to control the generated visual contents via keyword prompts and this is done via conditioning. A classic deep-learning technique where it consists of concatenating a noise patch with a vector, a bit of vector that represents the text. Then training a model on a data set of image caption pairs. So this gives rise to stable diffusion architecture. The stable diffusion consists of three parts. The text encoder turns your prompt into a latent vector. A diffusion model which repeatedly denoises your latent image patch and a deep-coder which turns the final latent patch into a high-resolution image. So first your text prompt gets projected into a latent vector space by the text encoder which is simply pre-trained frozen language model. Then the prompt vectors concatenated to a randomly generated noise patch which is repeatedly denoised by the decoder or a series of steps. So the more steps you run the more clearer and nicer your image will be and the default value that’s typically used is 50 steps. Finally the latent image is sent through the decoder to properly render it in high resolution. So how would you use this? You’d simply install Kerascv package and instantiate the stable diffusion model and get creative with the text prompts which are 7 to 8 lines of code and it’s going to take three or four seconds to give you an output and you can have fun with it. Aside from the easy to use API, Kerascv stable diffusion comes with some powerful advantages. You could run the code with graph mode execution, you could enable XLA compilation through GIT Compilicles true and it also supports mixed precision computation. So when these are combined the Kerascv stable diffusion model runs orders of magnitude faster than the native implementations. So for variation generation you can switch out the text and code to an image and code and see your model with the original image. So here is an example of Van Gogh’s study night and on the right what you see are the generated variations of Van Gogh’s study night. So you could also teach your model new concepts and it’s called textual inversion. Like for example you can collect few pictures of your object and call it something like a star and you could teach your model this new concept and start giving prompts like oil painting of an star, app icon of an star, so something like that. And you can do that by collecting some images and adding your new vocabulary and training your data set with this new caption pairs. And Kerascv tutorial is coming up for this soon and you can give that a shot. Okay, so here is an example of training your model to learn your cat and you can provide prompts of photo of any special name you want to call it, maybe Tom, wearing the top hat. It’s pretty cool. So I do have a cool demo video my colleague Ian worked on a hackathon project to auto synthesize a music video given a song lyrics and timestamps and the song’s name is American Pie. Okay. Okay. Next slide please. Okay, so yeah you can go and watch the whole video on YouTube. So the possibilities are limitless and a lot of more tutorials are coming up to get us on IO page thanks to get cv team. So I encourage you to give it a try and showcase how creative you can get with this. So good luck. Thank you.