EPISODES:
Two Minute Papers: Perfect Virtual Hands – But At A Cost! Two Minute Papers: Virtual Characters Learn To Work Out … and Undergo Surgery Two Minute Papers: This is What Abraham Lincoln May Have Looked Like! Two Minute Papers: This AI Learned Boxing … with Serious Knockout Power! Two Minute Papers: Everybody Can Make Deepfakes Now! Two Minute Papers: AI Learns To Compute Game Physics In Microseconds! Two Minute Papers: DeepFake Detector AIs Are Good Too! Two Minute Papers: This AI Clones Your Voice After Listening for 5 Seconds Two Minute Papers: This AI Does Nothing In Games … And Still Wins! Two Minute Papers: OpenAI Five Beats World Champion DOTA2 Team 2-0! Two Minute Papers: 6 Life Lessons I Learned From AI Research Two Minute Papers: DeepMind’s AlphaStar Beats Humans 10-0 (or 1) Two Minute Papers: OpenAI Plays Hide and Seek … and Breaks The Game! Two Minute Papers: 4 Experiments Where the AI Outsmarted Its Creators Two Minute Papers: AI Learns to Animate Humanoids Two Minute Papers: Ken Burns Effect, Now In 3D! • Two Minute Papers: This AI Creates Human Faces From Your Sketches! • Two Minute Papers: Google’s New AI Puts Video Calls On Steroids! • Two Minute Papers: New AI Research Work Fixes Your Choppy Videos! • Two Minute Papers: Can an AI Learn Lip Reading? • Two Minute Papers: Two Shots of Green Screen Please! • Two Minute Papers: This AI Creates Dessert Photos … and more! • Two Minute Papers: NVIDIA’s AI Dreams Up Imaginary Celebrities #207 • Two Minute Papers: Beautiful Gooey Simulations, Now 10 Times Faster • Two Minute Papers: DeepMind’s New AI Dreams Up Videos on Many Topics • Two Minute Papers: How Do Genetic Algorithms Work? #32 • Two Minute Papers: AI Makes 3D Models From Photos #122 • Two Minute Papers: What is De-Aging? • Two Minute Papers: This AI Made Me Look Like Obi-Wan Kenobi! • Two Minute Papers: DeepMind’s AI Learns Locomotion From Scratch | Two Minute Papers #190 Two Minute Papers: DeepMind’s WaveNet, 1000 Times Faster | Two Minute Papers #232 Two Minute Papers: This is How You Hack A Neural Network Two Minute Papers: We Can All Be Video Game Characters With This AI ★★★★★ Two Minute Papers: DeepMind’s New AI Helps Detecting Breast Cancer • Two Minute Papers: Artistic Style Transfer For Videos #68 • Two Minute Papers: OpenAI’s Whisper Learned 680,000 Hours Of Speech! Two Minute Papers: Ubisoft’s New AI: Breathing Life Into Games! Two Minute Papers: How To Get Started With Machine Learning? #51 Two Minute Papers: Google’s New AI: Fly INTO Photos! Two Minute Papers: NVIDIA’s AI Removes Objects From Your Photos | Two Minute Papers #255 Two Minute Papers: Stable Diffusion Is Getting Outrageously Good! Two Minute Papers: OpenAI Dall-E 2 – AI or Artist? Which is Better? Two Minute Papers: Google’s New AI Learns Table Tennis! Two Minute Papers: NVIDIA’s New AI: Video Game Graphics, Now 60x Smaller! Two Minute Papers: New AI Makes Amazing DeepFakes In a Blink of an Eye! Two Minute Papers: This New AI Is The Future of Video Editing! Two Minute Papers: How Does Deep Learning Work? #24 •

Two Minute Papers: OpenAI’s Whisper Learned 680,000 Hours Of Speech!

Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér.

OpenAI’s new Whisper AI is able to listen to what we say, and transcribe it. Your voice goes in, and this text comes out. Like this. This is incredible and it is going to change everything! As you see, when running through these few sentences, it works with flying colors. Well, stay tuned because you will see if we were able to break it later in this video. Can it be as good as a human? We will test that too!

But first, let’s try to break it with this speed-talking person. Wow, that’s going to be hard. So, let’s see the result. Whoa. That is incredible. And that’s not all, it can do so much more! For instance, it does accents too. Here is an example.

So good. Now when talking about accents, I am here too, and I will try my luck later in this video as well. The results will be…interesting, to say the least. But wait, this knows not only English but scientists at OpenAI said let’s throw in 96 other languages too. Here is French for example. And as you see, it also translates into English. So cool!

This is all well and good, but wait a second, transcription AIs already exist, for instance, here on Youtube, you can also request those for many videos. So, what is new here? Why publish this paper? Is this better? Also, what do we get for these 680,000 hours of training? Well, let’s have a look. This better be good. Whoa! What happened here? This is not a good start. At the first sight, it seems that we are not getting a great deal out of this AI at all. Look. Here, between the 20 to 40 decibel signal to noise range, which means a good quality speech signal, it is the highest, so, is it the best AI around for transcription?

Well, not quite! You see, what we are also looking at is the word error rate here, which is subject to minimization. That means that smaller is better. We noted that 20 to 40 decibels are considered good-quality signals. Here, it has a higher error rate than previous techniques. But wait, look at that! When going to 5 to 10 decibels and below, these signals are so bad that we can barely tell them from the noise, for instance, imagine sitting in a really loud pub, and here is where Whisper really shines. Here, it’s the best.

And, this is a good paper, so we have plenty more data on how it compares to a bunch of previous techniques, look. Once again, we have the word error rate. This is subject to minimization, lower is better. From A to D, you see other previous automatic speech recognition systems, and it beats all of them. And what do we have here? Now hold on to your papers, because…can that really be?

Is it as good as a human? That can’t be, right? Well, the answer is yes, it can be as good as a human! Kind of. You see, it outperforms these professional human transcription services, and is at the very least competitive with the best ones. An AI that transcribes as well as a professional human does. Wow, this truly feels like we live in a science-fiction movie. What a time to be alive!

Humans, okay, it is as good as many humans, that’s alright, but does this pass the ultimate test for a speech AI? What would that be? Of course, that is the Károly test. That would be me speaking with a crazy accent. Let’s see. Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. And Dear Fellow scholars, I don’t know what is going on here, it got my name perfectly, perhaps that is a sign of a superintelligence that is in the making.

Wow. The capitalization of 2 Minute Papers is alright too. Now, dear Fellow scholars, let’s try this again. And, now that is what I expected to happen. The regular speech part is transcribed well, and it flubbed my name. So, no superintelligence yet, at least, not reliably. So, what is all this good for? Well, imagine that you are looking at this amazing interview with Lex Fridman on superintelligence. And it is one and a half hours. Yes, that is very short for Lex. Now, we know that they talk about immortality, but, where exactly? Well, that’s not a problem anymore, look. Andrej Karpathy ran Whisper on every episode of Lex’s podcast, and there we go. This is the relevant part about immortality. That is incredible. Of course, you fellow scholars know that YouTube also helps us with its own transcription feature, or we can also look at the Chapter markers, however, not all video and audio are on YouTube.

And here comes the kicker – whisper works everywhere! How cool is that? And here comes the best part. Two amazing news – it is open source, and not only that, but you can try it now too. I put a link to both of these in the video description, but as always, please be patient, whenever we link to something, you fellow scholars are so excited to try it out, we have crashed a bunch of web pages before. This is what we call the Scholarly Stampede.

So, I hear you asking, okay, but what is under the hood here? If you have a closer look at the paper, you see that this is using a simple learning algorithm, a transformer with a vast dataset, and it can get very, very far with it. You see here that it makes great use of those 680 thousand hours of human speech, languages other than English, and translation improves a great deal if we add more, and even the English part improves a bit too. So, this indicates that if we gave it even more data, it might improve it even more.

And don’t forget, it can deal with noisy data really well, so adding more might not be as big of a challenge! And it is already as good as many professional humans. Wow. I can only imagine what this will be able to do just a couple more papers down the line! What a time to be alive!

Thanks for watching and for your generous support, and I’ll see you next time!

AI video(s) you might be interested in …