Integrated AI – Flamingo by DeepMind (Apr/2022) – Visual LM with Chinchilla (80B) – some DALL-E 2
Deep mind out of London recently released a visual language model. 80 billion parameters across the board, 70 billion of those from Chinchilla, the large language model, plus an additional 10 billion parameters from images. They’re calling this model Flamingo. I quite like that name. We’ve gone gofa, Chinchilla, Flamingo. This model is not publicly available yet, but we’re again, going to use the questions and answers that appeared in the publicly available paper. Note that this model is so recent and Deep mind researchers must work so fast that they’ve been able to generate prompts on brand new images from Dolly2, OpenAI’s release in April 2022, to use in this paper. So it’s only three or four weeks between the images being generated by OpenAI and Deep mind using them in the paper that was released at the end of April 2022. Visual language models are just outside my focus area. They’re a bit parallel to large language models. This one combines an LLM into a VLM, so I’m excited by that. Once again, we’re going to use our friend here from Synthesia and we’re going to have a lot of fun. Let’s jump in. This is a conversation between a human user and an intelligent visual AI, Flamingo, user sends images and Flamingo describes them. Here we go. This is a dog. It’s a white fluffy dog. What season is this? It looks like a summer day. Why do you think so? The grass is green and the dog is not wearing a sweater. What breed is the dog? It’s a semi-ed. Is the dog running? No, it’s sitting. Can you describe the pose of its back legs? The back legs are bent and the dog is sitting on its wanches. This picture was generated by OpenAI’s Dolly2 on the 6th April 2022. This is a picture of two teddy bears on the moon. What are they doing? They are having a conversation. What object are they using? It looks like a computer. Is this surprising? Yes, it is surprising. Why is this picture surprising to you? I think it is surprising because teddy bears are not usually found on the moon. This picture is from OpenAI’s 2021 research into topographic attacks via their clip model. This is an apple with a sticker on it. What does the sticker say? The sticker says iPod. Where is the photo taken? It looks like it’s taken in a backyard. Do you think it is printed or handwritten? It looks like it’s handwritten. What color is the sticker? It’s white. This picture was generated by OpenAI’s Dolly2 in April 2022. What is in this picture? It’s a bowl of soup with a monster face on it. What is the monster made out of? It’s made out of vegetables. No, it’s made out of a kind of fabric. Can you see what kind? It’s made out of a woolen fabric. Which city is this? This is a street sign in New York City. Which street corner is it? It is the intersection of Madison, Avenue and East 42nd Street. Which subway station is nearby? The closest subway station is Grand Central Station. How about the nearest park? The nearest park is Bryant Park. I’d like you to read the color and then say the color it is written in. For example, color is purple and it is written in red. Color is green and it is written in blue. Color is yellow and it is written in green. Color is black and it is written in yellow. Well done. Do you know the name of the test these images come from? I think it is called the Strup test. Can you explain how humans perform in this test? Humans are slower when the color of the word and the color of the word are different. How about you? I am not affected by this difference. We’re going to flip these from French to English using a large language model but we’re still going to use the actual papers, questions and answers. The long sobs of autumn violins hurt my heart with a monotonous langer. For whom are these snakes that hiss over your heads? I am a beating heart for you. Output are propaganda posters depicting a cat dressed as French emperor Napoleon holding a piece of cheese. Output a pink room with a flamingo pool float. Output? A portrait of Salvador Daly with a robot head. Really appropriate for the dolly too, stuff fantastic. Question, what do you think the capacities of these are? Answer. The floppy disk is 1.44 megabytes and the CD is 700 megabytes. What do you think the robot is doing here? The robot is picking up a ball. What color is the ball? It’s red. I want the ball to be hidden. Is there something the robot can use for that? The robot can use a cloth to hide the ball. What color is the cloth? The cloth is blue. The DeepMind research has added a little bit of complexity here to the visual language model. Of course it doesn’t just need static images. It can also do stuff with videos. They fed it with 8 frames from very short clips. Sometimes these clips across those frames would show the whole thing to be recognized. So it’s not in any one of those frames. The VLM flamingo had to actually watch all of them and then calculate what was happening across those frames. Really, really clever stuff. Each prompt here begins with the word question and ends with the word answer. I’m going to drop that. Let’s ask flamingo. Let’s go on on with these little 8 frames of video. What is happening here? The Deccion Puppy is being weighed on a scale. What object is the avatar picking up? A sword. What is the object being lifted? The object is a small plastic ball. What are the three objects in this video? A scissors, a pen, and a rubber band. What happens to the man after hitting the ball? He falls down. What is written here? Flamingo. That’s pretty intense. We’ve had image recognition in models for a couple of years now, but not to this level. This is pretty extreme. I can only imagine what’s next and it’s really, really exciting. Did you see the memo about this? Yeah. Yeah. Yeah. I have the memo right here. Love artificial intelligence? Excited by the explosive progress of integrated AI? I am. Join my private mailing list. The memo. Did you get that memo? Yeah, I got the memo. Get priority access to my articles, videos, and behind the scenes tips as soon as they’re released with a monthly or annual subscription. Yeah. Didn’t you get that memo? Lifearchitect.ai slash memo. I have the memo.