Computation, Bayesian Model Selection, Interactive Articles
In this AI video ...
Hello folks, welcome back to the Machine Learning Street Talk YouTube channel. Today is an interesting episode. We are here with the two legends that are MIT, PhD, Dr Keith, ship host to Doug R, and also Alex Bayesian Stenlake. We spent an hour also talking about the theory of computation, intelligence, a Bayesian model selection, the intelligence explosion, and also for 30 minutes also we covered this really interesting distil.parv article on interactive articles. Distil.parv is an open source publishing platform which promulgates information in a really innovative way rather than just the traditional scientific papers of old where it’s just lots of mathematical formulas and language. Now the focus is on animations and graphics and interactivity and simulations, allowing us to communicate ideas more powerfully and more effectively. But there are some trade-offs to this approach because it means that it can’t be consumed by standardized channels as it’s harder to create that kind of content. So we’ll go into detail on that. Anyway, I hope you enjoy the episode today. Remember to like, comment, and subscribe, and we’ll see you back next week. Tony Hoar has this brilliant quote, there are two ways to do software design. You either make it so simple, there are obviously no errors, or you make it so complicated, there are no obvious errors. We’ve got Alex on tomorrow to talk about kernel methods. Cool. And so I need to study channel six of PRML today. Oh, do you? Yeah. Oh, they’re a very practical, practically useful tool. Yeah, they’re really going to get you state of the art, but they have their utility on any limited compute platforms. They have a special place in my heart, even if they’re a bit old-hat. Hey, what’s old is new? So maybe not. If not this year, five years from now, they could be renamed to something else and become state of the art. To give away some of the spoilers here, attention mechanisms are just kernel, they’re way to start pulling them apart, is old school kernel theory, view it as an attention kernel, and what’s going on there makes a lot more sense. Plus, you know, like all the sparse kernel stuff, it’s starting to come through with randomized kernels and stuff, but I’d really want to see some of that old school sparse kernel literature come into the attention systems that come through these days. Yeah, make sure you think that everything is everything, because someone commented on our YouTube video yesterday saying that, oh, don’t you know that transformers are RNNs? And therefore, Keith and I were basically saying that these statistical language models are shit. They just learned these kind of crappy patterns and so on, and we were using a feature visualization of a CNN as an example of this. We’re saying language models are just doing the same thing, but they are basically RNNs apparently, which means they are cheering complete, and they could actually represent a context free grammar system. So they could be learning the rules of language, basically. Maybe it’s not, it made me think because they are computer programs, but they’re not like computer programs, because they’re neural networks. They just do all of these matrix multiplications. So it could they really represent a context free grammar? What’s the intuition there? Even if we get a very good facsimile of a language or a language model, like how close of a facsimile is it and how much of the things that we feel define a language? Do we? Are reflected in how much things like nuance, things like implicit naming? That um, yeah. I have an answer to that one. So CFGs, context free grammars, they are computable by turning machines. Okay. computable by even simpler mistakes. I think push down a tomatop, for example, can recognize CFGs. Natural languages are CSGs, they’re context sensitive grammars. Okay. Right. And I think a lot of times people throw out this XYZ is turning complete or whatever, but it allied the fact that an actual turning machine has an infinite amount of memory. Right. So no actual practical machine is actually a turning machine because none of them have infinite amounts of memory. And then they also allied how things scale. Okay. Fine. If if RNNs are quote unquote, turning complete as say like a computational machine, how does the amount of memory or circuitry or whatever that they require to calculate a particular thing? How does that scale with the size of the input? If we want them to recognize all context sensitive sentences of 1000 characters versus 10,000 characters, do they grow quadratically, exponentially? That’s all just left out. Right. Oh, it’s very complete. Therefore, yeah. All that problem is going away. So no. And we care a lot about efficiency. Like the concept of efficiency is coming up with sort of shallays, concept of intelligence and what not. Yeah, I guess everything could be computed with an infinitely large number of nan gates. Are those intelligent? Well, we don’t know. There’s some cool stuff coming out of the, I guess it’s the control theory literature looking at their core equipment operators. And basically what they let you do is take this nonlinear pro nonlinear problem and reduce it to a linear problem. And all you need is infinite dimensions, right? Easy. And practice that means you truncate your SVD at a couple of hundred. But you know, all the proofs go out the window. Everyone says, oh, we can do this because it’s theoretically justified. Of course, it’s probably not so theoretically justified. Infinity. Yeah. It’s a good concept. Right. Dance. Should we jump? Should we jump in it? Because I’m sure we’re going to go down some brow at halls anyway. I’m so tempted just to continue this conversation we were having because we were talking about before Keith and I were saying do neural networks reason or not? And Keith made the argument that you can unroll them. So it doesn’t matter that it’s a neural network. You can unroll it. It doesn’t, you know, we can keep on looping semantics in conditions. Yeah, you can unroll it into something that’s like your networks. So it is reasoning. But then other things come into the mix. So you mentioned efficiency. If you have a function that sorts a bunch of numbers, right? What you could do is you could just memorize n factorial permutations. So you just have a memorization machine. It’s clearly not doing any reasoning. It’s just retrieving that sorted list. And then you have the concept of generalization, which is short A’s things. If you’re memorizing everything, you’re not generalizing by definition. So there seems to be some kind of trade off between generalization, memorization, and efficiency. So that’s why I’m going to come back to the scaling thing. So the first class I ever took that it made me realize, yeah, I have natural limitations was the theory of computation. So I took the theory of computation from Michael Sipsor actually who’s written as a well-known theory of computation guy and professor. And it was very mind bending in some parts and strained the limits of my personal neural net capabilities. The core aspect of analyzing computation is looking at scalability. So for example, the difference between an NP-complete problem versus a polynomial problem is one of scaling. Because if you just take a finite input, then it’s any problem is finite. Like we can, a traveling salesman problem with whatever 50 nodes or something, eventually you can solve that problem with just waiting long enough, computing long enough. The point is that as you increase that number, how it scales within number of nodes, scales in a much different way than a polynomial problem. So if we don’t pay attention to the scaling, I don’t think we’re really going to get anywhere. And so with the unrolling thing that I brought up in the last call, okay, a multiplication circuit. If we unroll that so that it can be computed in a single cycle and a single flow of electrons through enough nan gates. And I don’t know because I’m not a circuit designer, but let’s suppose to do eight-bit multiplication. We need a certain number of circuits. And now to do 16 bits, if that actually goes up by say two to the scale, its exponential growth versus how much circuitry do I need to do it iteratively, maybe it’s far less than that, then yes, in principle, you can do it with a split-out static circuit, but it scales very differently. And so if your concept of, quote, intelligence really is more formally defined to include some of these scaling aspects, then I think we maybe get at the heart of what we’re talking about, when we’re talking about reasoning and that sort of thing. How would you link the scaling because we’re talking about two different substrates of computation? I think you’re talking about traditional code, and then we’re talking about neural networks. So with traditional code, something like the traveling salesman problem, it’s got a, I think it’s got an exponential time complexity. I think it’s two to the end or something. I can’t remember, but it’s huge. But once you found the optimal solution, then you’ve got it, right? So it’s super quick. But neural networks are a bit different because you have a training phase, and then now with these GPT models, you need a hell of a lot of computation just to, after you’ve trained it, just to do the infrancing. So how would you compare the two paradigm? I would look at their time space trade-off, the sort of traditional, and in this case, space would be the size of the circuit. So, and if we set the, let’s suppose we set the requirements of, I want a program that that solves the problem of the traveling salesman exactly. So no probabilistic solutions, no Monte Carlo, no Las Vegas algorithms, but just actually solves it. Okay, and if I don’t have time, so if I just have a static circuit that needs to evaluate it, then if I can handle problems of up to end nodes, then I need some function of n number of nodes. I’m using nodes here. N number of states to visit, or locations, then I need n number of nodes in my neural network. Whereas if I can do iteration, I need some other number of nodes in that circuitry, and it’s going to be a much smaller number. So, I think that’s where you would look to start to compare how intelligent things are in terms of this efficiency metric of what problems can they solve per, you know, pound for pound, sort of, per circuit per, this is time space straight off, because the neural network, if it can solve the problem exactly, we’ll give it to you in this sort of single pass, which let’s just say that takes unit time. Whereas the thing that’s computing, maybe we’ll take 32 units, or whatever the number is. But I think one of the reasons why the two paradigms are different, though, is TSP or sorting, it’s structured. So you can analytically enumerate all of the possible routes, and as we said it, it’s two to the end, it’s ridiculously high. Whereas with the computer vision algorithm and a neural network, it has to be statistical. You have to have some kind of representation. CNN has this trick with the receptive field, the weight sharing, and so on these filters. You couldn’t possibly enumerate all of the pixel space, because that would just be in the same. That’s okay. So then you’re just redefining the, what the definition of success is, is it, I got within X percent of the correct answer, Y percent of the time, that sort of thing. And I think it’s worth noting that you can enumerate, for sufficiently small pixel space, you can enumerate all possible combinations. And we’re doing discrete maths here, fundamentally, it’s going to be discrete at some level. If your neural network’s small enough, you may even be able to fit it on a multi-terabyte hard drive. But all we’re doing is we’re finding an efficient approximation scheme. The reason that it gets hard with things like traveling salesman is because the underlying space isn’t necessarily regular and, like some standard dimensionality. The relationships can change, so we can’t just fit the whole thing neatly into a fixed matrix size, a fixed matrix space and solve it there. Even now, if we use a matrix representation of the graphs, we’re dealing with graphs of different sizes. And if we know anything about graphs, it’s that local and global topology kind of interacting these very strange ways that may not necessarily generalize between different size graphs. You said if we made the problem small enough. But even if you had a three by three grid and you had three color channels and you had a 32 bit, what’s the data structure for the pixels? It’ll be eight bit harm unsigned floats, because that’s typically how it’s stored. Okay, so that’s two to the parivate times three cubed. It’s pretty big. Yeah, sure. These things do grow quickly. These time space, trade-offs, there can be very high scaleings in either dimensions, quite common in space dimension. The only reason why I brought this up is just that I think we need to just be careful again, because I don’t want to be in a situation where some particular black box, okay, we deem that it’s reasoning. Okay, everybody agrees this black box is a reasoning. And when we put in inputs, we get the answer back in five minutes or something. And it’s doing some reasoning there. And then when version nine of that black, so alpha nine comes out going back to the alpha zero thing, when alpha nine comes out, all of a sudden when we put in the problem, we actually get the answer back in one nanosecond, because of just the circuit inside there. It got so complex that just some electrons pass through there and boom, like the answer comes out. Now, now this thing is stupid, man. It’s not even thinking about anything. It’s just giving the answer instantaneously. That’s not a reasoning machine. I don’t know if you guys ever read the original graph. I don’t know if it was the original graph neural network papers, but it’d be one of the first big ones that Google kind of did as part of the GNN push, but they motivated that paper with an interesting analogy taken from known known chomsky, talking about intelligence and reasoning as the ability to combine, to take two existing ideas and synthesize them into a third, potentially unrelated idea, a novel idea. And I don’t wonder if that’s kind of what we’re missing here, the ability to like, is it reasoning if we do a pass and we get an answer? We’re getting an answer, but that doesn’t feel like reasoning. If we’re going to talk about reasoning in this sense, I think it almost has to have an element of synthesis, of creation of new ideas. So things like alpha zero, when it’s, once it’s trained, it’s not really reasoning anymore. It’s just a very efficient search, a very efficient tree search. I think in the background of those algorithms, there actually is a tree search that’s just optimized using the actual search space itself for the board state is simplified using a CNN or something like that. I don’t remember the details, but to get to this idea of reasoning, it may not necessarily have to be iterative, but I think it almost has to be combinatorial. It’s quite an interesting assertion. So chomsky thought that intelligence was about synthesizing ideas. And don’t help me on that. Well, if you look at evolution or an evolutionary algorithm, but I think really because evolutionary algorithms are actually quite convergent, but if you look at real evolution, that’s doing what you said. It’s using concepts together in a new way to form, it’s a synthesized something new, and that’s super interesting because you were saying that reasoning is deductive, and that seems like a contradiction from something that could synthesize something. Yeah, I wouldn’t necessarily say it’s strictly deductive. I’d say there is probably an inductive element there to say, hey, this thing may apply to a new situation where I haven’t necessarily seen it before. But again, I’m getting into territory where I’m starting to feel a little foolish because I’m anthropomorphizing an algorithm at this point. This is the point where I normally get very cautious and start warning business people to, hey, don’t think that this is a thinking object. It’s a bunch of maths. But if we are going to think of it as a bunch of maths, thinking of as function composition, but function composition with respect to some input, maybe that’s a useful paradigm for thinking about this. And I still haven’t read Shale’s work, everyone. So I know I’m letting the team down here. Oh, yeah, that’s all right. I was just thinking, maybe we should start up a little drinking game where every time Chomsky’s name is mentioned, we have to take like a sip of something. I’m happy to leave him up. It just came up in the paper. I didn’t bring this up. That was their motivation. Don’t blame me. Don’t blame me. No, no, no. So I hear what you’re both saying. I think I kind of, I don’t want to beat a dead horse here. Am I allowed to say that? Is that offensive and any? I don’t know, but I don’t want to keep repeating myself. Let’s say it’s just that what worries me is that as we start to formalize, what is reasoning, and whatever it comes out to be, let’s just say one day we decide land a calculus is reasoning or second-order logic is reasoning. And it’s like a throwback to the expert system days. And if we start with some statements and can deduce new statements, then that’s reasoning. My worry is that as we formalize that, we may eventually decide that nothing is intelligent anymore. Because maybe after we learn more about expert systems or land a calculus or whatnot, people will lose their fascination with it and say, yeah, you know, well, that’s not actually reasoning anymore. And eventually we’re just going to find out that humans aren’t actually intelligent either. And that we’re not actually reasoning. Well, I’m not going to have that song about that. You better hope there’s intelligent life somewhere out there because there’s a bugger all of it down here on earth. On that point as well, because you can start with humans. We know humans are intelligent. And Sam Harris has some interesting thought experiments on this. You can subdivide consciousness many ways and you can separate the hemispheres of the brain and they’re independently conscious. And when you move over to neural networks, well, clearly they’re not consciousness. And there’s a problem in philosophy, I think, where at what point do you become bald or at what point does a pile of sand become a pile of sand? Because if you just keep putting grains and grains into a pile, at some point it’ll become a pile of sand. And similarly, at some point, there will be intelligence and there’ll be consciousness, which is a bit of an offshoot. Consciousness seems to be something which is completely separate from intelligence. But surely, surely, this is just an argument for a fuzzy set, because there’s there’s degrees of being a pile of sand. It doesn’t have to be one moment of pile of sand in the next second. Not a pile of sand. Oh, we’re working in reverse here, obviously. I don’t know. Like I think that’s a spurious argument. At what point does consciousness emerge? What point does intelligence emerge? Well, maybe it’s just a thing that the more you have, like the more neurons you stack together, the more you get something that looks like an intelligent system. If we think of it purely in terms of the neurons, then we’re going to back ourselves into this corner where it’s just a matter of throwing enough computer to think. Yes. So look, Tim, maybe let’s agree that humans are intelligent. I’m happy with that. And then if we just say that, it is part of our definition that whatever’s going on up in our brain, we’re going to say that’s intelligent. What I want to make sure is that it’s consistent in the sense that if we start to find machines that do subsets so that we don’t decide those are not intelligent, because even though they’re doing the same thing that we do at least in part. And I completely agree with Alex, which is, hey, let’s just put a number on it. Let’s call it IQ, for example. And we’ll just say that Alpha 0 in the chest domain has an IQ of this, where it’s Gary Kasprov, has an IQ of maybe more, because not because he can beat Alpha 0 at standard chest, but if you had him play Fisher random or something like that, he could take those ideas and translate it over much more effectively than Alpha 0 could. And so we can have this measure of intelligence or AIQ, artificial intelligence quotient, whatever you want. But we just don’t want to be in a situation where we have to constantly keep changing our definitions because we find machines that under that definition are intelligent. And we don’t like the fact that machines are intelligent type of thing. Yeah, and that was one of the principal reasons why Sholay conceived it as the information conversion ratio. So how efficiently can you turn information into something which generalizes to lots of situations? Right. Because if you don’t have something like that, Sholay’s big thing is that being good at a skill, it means nothing, because you can buy skill and you can just create a massive neural network and you can put larger prize and experience into it. And it’s going to do something really well. But that’s rubbish, because as you say, you play a different variant of chess, and now it’s completely useless, whereas Gary Kasparov would be able to reuse many of his concepts from chess. And also the fact that he learned to play chess so efficiently in the first place implies that he would be able to learn a new game. And you say efficiency there again, and I think I’m starting to just come to this that efficiency is a key aspect of intelligence, because fine, I could build this circuit with X-mini parameters and spend X amount of time training it, and then it can give answers and Y amount of time. But with that same physical or that same logical set of resources, could I have done more? I think so. I think we can definitely do far better than just current neural networks by whatever we’ve talked about, lots of possible enhancements of it. So I think if we include this efficiency idea as part of our measure of that AIQ, if you will, this is probably an important factor to consider. And this is why the meta-learning literature is so fascinating, especially zero-on-one-shot learning cases, where they’re literally trying to learn how to generalize. It’s still pretty rudimentary at this point in time, but the work that’s coming out of there already shows promise for industrial applications, massively reducing data volumes, et cetera. But in the reinforcement learning case, learning from single episodes, that’s that sort of area there is starting to feel like where intelligence is coming from, and you’re starting to see these interesting conceptualizations of how to build something that learns fast. I think that’s worth keeping an eye on for exactly this reason. The focus is no longer, how do we do a task really well? It’s how do we do a task well enough with zero experience, or a minimal experience. And statistical procedures, we used to call those kind of stuff statistic. So I feel fair in saying that they’ve always been terrible at this. It’s just horrifically bad. So like most people, I went through an orthodox statistics education and always thought was quite cookbookish and arbitrary and never really found it that interesting. And I was trying to analyze problems in graduate school where I was collecting photons. I was doing two photon fluorescence microscopy, and I was collecting photons at very high sort of scanning speeds, and so for a particular voxel, I may have a small number of photons, let’s say 25, 30 that got collected. And I was trying to calculate the lifetime of the fluorophores in there. And I could plot these things on a graph and see with my eye what the decay rate was, but I needed a statistical procedure to calculate this. So I’m out there reading papers and looking for proper ways to do this, and find a paper that lo and behold is how many photons do I need to measure the lifetime of a fluorophore? And it goes through all kinds of statistics, Shannon entropy, blah blah, or information, whatever. And concludes it, yeah, you need 10,000 photons. I’m like crap, I’ve only got 25 photons, but I can see it with my eye. Why do I need 10,000 photons? And about that same time, and this is one of these freak coincidences in life. I’m sitting in the office and my colleague is over there laughing. He’s reading a book and just laughing. So I walk over, hey, Mike, what do you, what book do you read there? It is, oh, I’m reading this book called Data Analysis of Bayesian Tutorial, and it’s showing how sort of traditional statistics just totally breaks down where if you use this other method, it works just perfectly fine. Can I see that book for a minute? I take it, I read it. This begins my whole exploration into probability theory as a generalization of logic, right? And Bayesian analysis and whatever. And I realize that we teach this really silly path where we extract so little information from the data with so many arbitrary rules and calculations that it just doesn’t work well. I remember having these arguments with people, because Bayesian data analysis three is one of my bibles. And I remember chatting with all these people that were like biologists that are looking for p-values and everything. They’re like, oh, the p-values tell us what we’re looking for us. No, let me tell you, you don’t need that much data. We just need to make some assumptions over the space. And they’re like, yeah, but we’re going to bias our estimates. Your value is not going to lie up near infinity or negative infinity. We can say that much. It’ll be somewhere between like, loss of mind is 10 standard deviations if we assume normality. We’re making the same assumptions. We’re just making the search so much more efficient. And here are all the other things that you’ll get. Yeah, this whole thing, we’re going to bias our estimates. What’s what was hilarious to me is after I started learning about Bayesian analysis, I started making all kinds of connections to the statistics I had been taught. And I finally had a mathematical foundation to understand things. Oh, to unbiased our variance estimate, instead dividing by n, we need to divide by n minus 1 or n plus 1 or whatever. Why? Where’s that come from? What mathematical basis do you have for that? Actually, that corresponds to different assumptions about the prior for the variance. If you use Jeffries prior, then it’s minus 1. If you use a constant prior, it’s just plus a zero. In other words, it’s n plus gamma, where gamma is the power that you use for that thing. And so you have a mathematical foundation for this. And it’s not that it’s not that orthodox statistics is not making prior assumptions. It’s just hiding them. And it’s doing so arbitrarily and ignoring them. Yeah, they’re out. Yeah, no, I was just like, Hey, manning this idea. It’s really hard to make this argument to people because they think they’re doing the right thing. But I remember teaching some psychology students, they were talking about, oh, but how do we know if we’re making these assumptions, then how do we know that we’re actually finding things and until I unpack, look, you’re making exactly the same assumptions. In fact, you’re making dominoes assumptions over here. Yeah, yeah. Make some smart assumptions. Make your job easy. It’s hard to like psychology is super noisy. You’re going to be dealing with enough problems. Don’t make your job any harder than it has to be. This is quite interesting, though, because Keith and I have been talking ad nauseam about how we introduce priors into the machine learning process because statistics has a larger prize about the stochastic data producing mechanism. But the beauty of it is if you do make those assumptions, then you have so much more parser money. Stay tuned for when we talk about kernel methods and choice of kernel functions. Part of the big problem is that to do probabilistic optimization, it’s like all integrals. And so if you’re not doing Monte Carlo methods, which are going to be hideously inefficient, you need to find some cheap and effective way to even estimate those probabilistic models in the first place. And then on top of that, what does it mean to be a prior neural network? And you can say, okay, our way to Gaussian cool, we’re doing something Gaussian processing. Okay, we want to say that we expect it to be a wide tailed distribution. How can we even formalize that? And I guess that’s probably the question you guys have been discussing. But I don’t necessarily see a way forward there because we don’t really have a theoretical, a theoretically grounded model of how a neural network behaves at a more fundamental level than all the gradients are going to do this. Okay, the gradients are doing that. Why are they doing that? Until we get all that. Yeah, and on that, one of the issues is that neural networks is a weird substrate to create priors because again, we were talking yesterday about having, there’s a dichotomy between priors and experience and you could create a 3D graphics simulator, which would allow you to do spatial reasoning. So the ball can’t intersect the wall and then you could generate infinite amounts of experience and put it into the neural network. But the elephant in the room here, the real problem is that you can’t create useful inductive priors in the substrate of the neural network. All of the inductive priors are crap. They’re things like the CNN, they’re things that are working in this weird substrate of a planar manifold and making weird assumptions that they’re absolutely no resemblance whatsoever to our physical reality. There are some really interesting attempts to do this sort of stuff. Every attempt I’ve seen has been like basically unsuccessful, but there are people experimenting around with, for example, strange algebra to try and encode, say periodicity. And what was that paper? A couple of weeks ago, there was a paper released about using neural networks to compress images by mapping directly from the pixel coordinate space to the pixel intensity space. And one of the tricks there was they were using cosine periodic activation functions. So I think it’s not impossible to encode some information there, but doing so in a principled way that would correspond to the sorts of invariances we want, like I’ve been doing a lot of slam up until a few weeks ago. And a lot of image alignment stuff. We’re doing lear algebraers. We’re looking at image homographies. There’s a manifold there that if we could somehow encode it, it would render the problem trivial. But even approaching that problem is just the stuff of nightmare. You end up breaking more than you fix. And I want to pull back to something Alex that you said earlier because as much as priors talked about and tacked and whatever in the sort of let’s say the Bayesian kind of space, they’re actually they’re not the biggest part of the problem. The biggest problem we face is what you brought up, which is marginalization, calculating these integrals, right? It’s you know, for example, it’s very easy to derive that the math that you need to perform to get the answer you want under a set of the model and a set of priors. And doesn’t really the priors aren’t actually really that important in a lot of cases. But to get the parsimonious answer that you want to do model selection, for example, to actually figure out, should I have five parameters or seven or six, which is actually more likely given the data in order to calculate that you have to do these multi-dimensional integrals. And we just can’t even we can’t we just don’t have a good way of doing that. And so we know what we’re supposed to be doing. It’s too hard to do computationally. So instead we run on to approximate techniques or whatever or to make ourselves feel better. Some people just say, oh no, we shouldn’t be doing marginals. That’s all nonsense. That’s subjective blah, blah, blah. It’s actually not. Those are just excuses to just to make up for the fact that we can’t do what we know we should be doing. Yeah. Just on that, a classic example of marginal is we were talking about interpretability methods, right? And there’s the partial dependency plot. And that basically marginalizes over all of the things that you don’t that you’re not interested in. So you can just focus on one of the features. And what that’s basically doing is it’s extrapolating across all of the other features. So it’s it’s a stupid right? And and there might be lots of complex interactions between those features, which this method is ignoring. So yeah, marginalization isn’t necessarily a good thing. Oh no, marginalization if it’s mathematically demanded is exactly what should be done. And it’s a good thing to do. What we’re doing right now with neural networks is we’re just constantly doing maximum likelihood, right? We’re just creating the network that happens to be the single point in this complex geometry that gives us the the optimal value. Are you telling me that all the other information surrounding that point is not useful? That’s absurd. If there’s a complex geometry there. And by the way, I wasn’t saying you Tim telling me that’s people in general telling us right that this just to ignore everything except the perfectly maximal likelihood optimal solution and whatever it is whether it’s neural networks any model it doesn’t matter. And if you really need what the math tells you really need to do is you have to explore that landscape. You have to do some averaging over that. You have to combine that with the sort of information complexity of the parameters themselves that you’re optimizing. And all that has to be taken into account. And we’re just not doing that because it’s mathematically intractable. On that one of the first tests I had of Bayesian deep learning was from Gowls kind of magnum opus where he was putting together a Bayesian theory of dropout which kind of does exactly what you’re talking about. And instead of looking at a single point where we do multiple dropout passes through our network which in essence kind of samples from it bootstraps our model. It’s not a sample from the distribution but it could be treated like a sample from the distribution. And so we sample from this model multiple times or this model. And then we can average over the outputs and we get fairly principled uncertainty estimates etc. So it’s not that it’s completely intractable. It’s just that these approximations are often they’re not good. They don’t accurately reflect the true uncertainty in these parameters because they’re not designed to do it. It’s a hat. And that’s an approximation. That’s an approximation right. The actual the marginal you would have to really do would be some mixed integer multi-dimensional integral. I had a friend doing Hamiltonian Monte Carlo over in Tyron your networks and you know I used to laugh at him. Isn’t this the fundamental problem in machine learning in general. So as you’re saying we do this maximal likelihood estimation. And what we really want to know is the probability of everything else. And then we want to normalize by that because even in Jan Lackoon’s energy based models he uses this Gibbs distribution. Yeah so even if within a certain dimensionality let’s say n parameters even if we we decided that we’re going to use maximum likelihood within that n parameters like okay and that’s a little bit of a problem. The bigger problem is how are we comparing that n parameter space versus n plus one parameter space and n minus one parameter space and there’s all kinds of hacks to do it. For example in other in other contexts let’s say AIC whatever it is a cocky information criteria whatever that’s a complete total hack. BIC is a complete total hack. If you’re lucky enough to find CIC paper written by Carlos Rodriguez they’ll have a much better explanation about at least something that starts to approximate what you should be doing because he calculates curvature metric at the maximum likelihood and it gets penalized and it’s sort of closer to what the marginals would actually give you. But again that’s at a single point in that space and it’s a there’s a multiple optimum in there and it’s far more complex thing. Do you think that the models would be significantly better though if we could compute? Oh yeah I have no doubt that they would because I know in lower dimensional spaces where you can do that and where I have done that in the past for real world problems the performance just obliterates the hacks. It’s so superior. But on this MLE thing essentially you would just use it to compute a better probability. It’s the distribution of potential model parameters with MLE you’re just finding the single most likely single parameter value but there are multiple because there are multiple parameter values that can give rise to your model especially if there’s like hidden covariance structures or something. If you throw away that information then you end up with a brittle model that may not accurately reflect what it will not at all reflect the uncertainty over those parameters. Particularly imagine you’ve got two highly correlated parameters out of say 10. Stable enough to converge but you can be very hard pressed to tell which ones actually contributing to your solution and so the uncertainty over those two should be that should be really the uncertainty over those two parameters. And now it’s what? Yes sorry Alex. What Bayesian, you know what the sort of Bayesian model selection which is what requires the marginalization calculations really buys you is parsimony. It’s the way of controlling model complexity and that’s what buys you generalization or that’s what buys you accuracy outside of the training set. So you would find that they would just perform far better out of sample if we actually control their complexity using proper theoretically sound complexity controls rather than just arbitrary things. LASO and the linear space dropout layers and ML these are all just hacks that are not even remotely close. But it sounds like a contradiction though because you were saying earlier that neural networks have this explosion of parameters and they’re just they’re exploding in complexity and then that’s what you’re describing here for the Bayesian models because you’re saying no this controls that if we could have this parameter space where we have every possible result for every combination of parameters then it would generalize better isn’t that the same as saying that neural networks are too big. LASO and I know what I’m saying is that this Bayesian this methodology marginal sort of model selection criteria actually penalizes you very heavily for introducing more and more complexity. So it’s what would keep rains on this complexity in a theoretically sound way and I just know in lower dimensional cases where you can actually compute this it works wonders. LASO I might be clutching it stores here but the model itself might be parsimony this but you still need to compute that integral and you need to have all of that information to have something which will give you the generalization you’re talking about. So forget about the model it’s actually holding in memory that integral is what gives you what you’re proposing. LASO I was only from the training and neural network. LASO Yeah but it’s only the training it’s only the training phase in which this happens. LASO So just even then if you compute the integral it will give you a very complex multi-dimensional landscape and you need to do something with that landscape. LASO Yeah actually computing the integral reduces the the dimensionality that’s what it does is that you’re integrating over many dimensions to come down to one number basically which is the probability of this model. So what you would end up with is a probability distribution on the possible neural networks and the ones that are more complex would be heavily penalized and so what would happen is you would end up selecting as the model a much simpler neural network and that will be have much better generalization accuracy you know better parsimony etc. LASO So the analogy here just to translate it into conventional terms would be instead of doing grid search over a number of layers or architecture or whatever you want that just comes for free as part of the process. You can say it’s most likely that we should be you if you could assuming that we could even formulate this and we can’t and we definitely can’t compute it but you could say the most likely model that gives us the optimal solution is a resonant 18 with the following parameters because you can actually take the gradients then back to the hyper parameters themselves and figure out what the likelihood of your hyper parameters is. LASO Sorry but didn’t you just say before that because maximum likelihood estimation is that not what you’ve just described so that saying I’m going to find all of the different possible parameters and I’m going to see what the results are. LASO It’s the difference between a normal distribution and the mean of a normal distribution right you you swap out your thinking instead of thinking of fixed parameter values you just think every parameter value in your entire modeling landscape is a little distribution and you are setting the mean and variance there sure. LASO It’s maximum likelihood on a far reduced space so because you do all these marginalizations to strip away every dimension down to a single number which is the probability of the model and then sure at that point I’m selecting the most likely model and I’m only doing that by the way is a hack because really I should be using all the models simultaneously with their assigned probability which is I can’t even imagine how complicated that problem is but I think that the real key takeaway here is just that we know how to actually properly constrain complexity mathematically we just can’t calculate it it’s just now it’s kind of like GR we know that we know the equations but it’s really hard to calculate solutions even an ideal setups and far from it to calculate it in like an actual solar system or something score one for rabbit holes yeah I’m glad we took a detour into Bayesland because I don’t know I think the whole the Bayesland approach it gets overlooked and it’s super powerful and who cares if we can’t calculate anything it’s like it allows us to understand our hacks better at least it’s fine to do the wrong thing as long as you’re doing the wrong thing sure yeah it’s and it’s good to keep an eye out on what the right thing is because then as we as a sort of civilization contemplate that and cogitate it over time somebody out there will find some better tricks that we can do we can do to get a step closer to the right thing I’m are you staying out with a bunch of mathematicians and that’s basically where I learned everything that I know and I remember annoying them one day coming in being like hey if only we could define a group operation that’s like a derivative but you know it works for integrals and all this nonsense about you try to figure out a way that we can make our derivation the hard operation and integrals really simple and all these guys are working on Monte Carlo integrators and they just look at me and they’re like what do you think we do all day I felt terribly ashamed in that moment we learned through our mistakes yeah I want to come back to one thing before we move on yeah yeah you know mine which we talked about this sort of let’s say call it AIQ yeah yeah initial IQ measure right there’s just one kind of thing that sort of annoys me about a lot of the AI Doomsday sort of thinkers right that we’re going to be killed off by machines or what huh they assume that AIQ can go to infinity and that infinity AIQ can somehow warp the laws of nature it’s like you’re watching a movie dark city where they’ve developed this technology where they can control reality with their minds the only other person I know that’s seen that movie oh dude that movie so that predates the matrix by the way for people out there and presents a very interesting take on a very similar idea I don’t want to spoil it for anybody but yeah highly recommend that movie dark city I think it’s phenomenal and it provides my deep thought it’s worth noting if you’re watching that movie and the beginning you’re going to think this is very odd it feels good wouldn’t stick with it there’s reason after and after you watch the movie watch the the director’s commentary version if you can because it’s fascinating how they actually use cinematography to convey these feeling of lossness and disjoint sort of thing yes it’s pretty cool it’s very cool but again so that’s my thing is I don’t think I wonder is intelligence really infinite in the sense that if you were smart enough you could be like let’s say the the mule and Isaac Asimov’s foundation series where I look at you and by wiggling my eye in a certain way I just interfere with your mind and cause you to do what I tell you to do you know sort of yeah I just I don’t know that that’s really the case I don’t think that even if machines can become smarter than us fine I think there’s a limit to how much intelligence can rule the laws of physics and so I’m not sure that we’re really going to face a doomsday and one of one of my big pives I spent a lot of time getting grumpy about social science research because I love social science research is that a field of science I’m just no if it’s got science in the title it’s not a science people the psychology literature is great for this because they claim these huge effects from from small minor interventions the wiggling of the eyebrow and it’s just impossible it’s impossible that all these things coexist you know they’re mining noise but they don’t realize they’re doing it and people keep bringing up the point that human beings are incredibly complex and incredibly noisy systems and even though at the macro scale we have very common behaviors even in terms of individual reactions to small events it’s very hard to predict how an individual human will respond unless you hit them with a very large influence right do you know Werner Ving and I’m not sure if I’m pronouncing his name right but he wrote for example a fire upon the deep which is a very cool science fiction novel about AI like AI’s that he comes sort of got like and they do interesting things in the galaxy if you will and they’re sort of seen in there with this malevolent AI aims a laser beam at one the navigational systems of a ship and by just modulating things in a certain way is able to hack into it and take all right I have read that yeah oh yeah that’s an interesting idea if you keep vague tabs on the the weird and murky world of cyber weapons there are hints that that’s not as crazy as it sounds however I don’t think it’s going to work for people because digital systems are digital systems and they do what they’re told whereas neurons there’s sometimes they do what they’re told sometimes they behave there’s a resilience there’s a resilience the processing and they’re fuzziness yeah there’s a there’s an interesting thing that is I can’t remember where I read this it might have been in the oval noa herari’s book but he said that chickens have got this problem where they are attuned to a certain clucking noise so you can give them a robot a baby chicken and as long as it makes the clucking noise then then the the hen will nurture that robot as if it was one of one of their own so it goes to show that even humans we learn lots of shortcuts which and might seem intelligent on the surface but are actually a far cry from that and I wanted to just quickly respond to what Keith was saying before so again Shalei has written some interesting stuff about this intelligence explosion and he’s got this article on medium which is called the impossibility of an intelligence explosion and these are the the concluding statements at the end so he said remember intelligence is situational there’s no such thing as general intelligence your brain is just one piece in a broader system which includes your body your environment other humans and so on so he’s in a roundabout way he’s saying that there are lots of limitations to intelligence and of course we’re an expression of our environment but so he says that no system exists in a vacuum any individual intelligence will always be both defined and limited by the context of its existence and our environment not our brain is acting as the bottleneck of our intelligence and he said some really interesting things as well about he was skeptical about neural link and and he thinks that putting a neural link in our brain giving us access to the internet isn’t going to make us more intelligent because we’re already pretty stupid and we’re already quite slow at understanding information as it is he says intelligence is largely externalized he even thinks that culture and books for example are forms of externalized intelligence he doesn’t like this conception of the brain as being intelligent and he said that recursively self-improving systems because of contingent bottlenecks diminish returns and counter reactions arising from the broader context in which they exist cannot achieve exponential progress in practice typically they they display linear or sigmoidal improvement that so he uses the case of science I remember this so he says even though there’s an exponentially increasing amount of research happening in science if you look at the number of papers published the actual progress is quite linear this is a this is a comforting argument I don’t buy into the doomsday scenario just to put this on the record but they’re talking about oh it’s in our civilization our tooling what happens once we’ve got hundreds of thousands of AI agents or millions billions you know hundreds of billions across the solar system that kind of undercuts this argument a little bit I think they’re going to take over anything like that that’s just silly doomsday alarmist stuff but I think the argument that it will necessarily be non-exponential because you know oh we’re going to achieve linear growth science at the moment feels pretty logarithmic and there comes a point where from the point of your vlogger of them a linear growth rate starts to look pretty exponential there’s some debate about whether that kind of logarithmic appearance of progress or maybe it’s root ant or over in or you know whatever maybe some arguments have made or made that’s because of us humans we’re the ones who on the other hand you can argue that it’s a fundamental fact of information communication there’s different ways to take that I don’t want to I wouldn’t so easily dismiss the doomsday folks I don’t think it’s going to happen either but I just I want somebody to pose the question to them is do you think intelligence is infinite and can infinite intelligence control the reality of physics because they just ignore all the sort of realities of physics right you don’t need infinite intelligence to be dangerous and that’s true the safety in AI literature it’s good because they’re actually thinking through these problems but in a non-alarmist way they’re thinking about what are the practical problems that we’re going to need to solve what are the realistic dangers that we may expect and they they will invoke super intelligent agents but they say that’s a tail risk but it’s a tail risk that if we solve these other problems by the time that becomes pertinent you will be well prepared to tackle those problems before they become problems yeah I cool so Tim should be attempt to talk about this is our interactive Robert Lange one of our co conspirators he he actually told us that this would be a great thing to talk about and by the way and if we do a search for Robert Lange machine learning Robert has pioneered this kind of this isn’t an interactive article but this is a very creative article so he’s using animations and so on and you might have noticed for example on distilled.pub so if we do Chris Olar I always use this feature visualization so distilled.pub is a new type of it’s a way of showing information with lots of visualizations and simulations and graphics and so on and it’s so much better for comprehension right so some of these things are actually interactive and there’s a trade off to doing it this way personally I love being able to have information in PDF and I can stick it on my Kindle and so on but if you think about it some of the benefits of this new paradigm are reducing cognitive load and making the systems playful prompting self-reflection connecting people and data and personalized reading and so on and I think you can actually make a lot of very complex concepts much more accessible than you otherwise would have done. Yeah so the first thing I want to point out is people have been thinking about this for a long time do me a favor if you would just on your screen there and search up for just type in minard mi n a r d that’s based Napoleon Russia and let’s just pull up this a very cool graphics so one of those there’s he those images down there no further down yeah click on one of those this highlight one of those so this is a plot he did of Napoleon’s march through Russia where they start off as this very wide fighting force that represents their kind of size and it shows when forces break off and go elsewhere and they start to shrink over time right taking losses from the weather etc and by the time they get to as far as they go they’re basically almost obliterated and then this is their retreat back in black this is a beautiful visualization of data and people been trying to do this sort of thing for a very long time and it’s a lot of work you have to put in tremendous amounts of human effort to generate very clear and impactful visual diagrams interactive content this comment is made in the article there about people been doing I forget what’s called but adaptable text for a while they’ve been where you can set it to novice level and it shows you one paragraph and you up the scale to intermediate and it shows you more advanced text and they commented that this is a lot of work and difficulty for authors because they have to write multiple versions of the text it only gets more and more difficult when you start to think about creating impactful visualizations and more difficult still when it’s animation and more difficult still when we want to include some music or audio effects yeah even this youtube channel right this is we’re putting a lot of extra effort in to make it visual to edit it afterwards to make it more accessible so you’re saying that you can reduce the cognitive load you can make information more accessible but it’s just more ownerous to do and scientists basically have got better stuff to do that they’re busy being clever boffins they’re not really they shouldn’t be making their information more accessible to others no they I they you’re right that they shouldn’t be maybe maybe every great scientist needs an assistant that goes and helps edit their content and make it interactive I don’t know but I’m just pointing out that part of the reason why there hasn’t been a lot of progress or why maybe we don’t see as much of this content as we would like is because it’s not easy to do well and it takes tremendous amount of effort and even if you’re willing to devote the effort it’s often a very difficult problem I think of even just a simple app how like how many times is Facebook tweaked and optimized just that little simple app right let’s move this button over here and make it five pixels smaller and it has an effect on the way people use the thing I know that’s actually we’re just a quick segue on that because we’re going to be talking about the social dilemma in a few weeks and they portrayed it as being very nefarious that people at Facebook are trying to manipulate us and trying to make it as addictive as possible and I think that was a side effect though you and I both know I think we’ve all worked in application development for decades it used to be called user experience design and I think a lot of the the best web developers at Google that there was a seven second rule and they knew that if it wasn’t responsive in a certain way that you would not be engaged for very long and now that’s become so efficient that they’re AB testing this and they’re systematizing this to such an extreme degree if you are being a really cynical person you might think that they were deliberately trying to hook people on the platform like there is this technical overhead right just a recenter I will call back to what we’re talking about two seconds ago about scientists don’t have time to do this I think under the current model and this was also brought up in the paper under the current incentive structure for researchers there’s no reason to do this it’s better to turn out more publications than to have a good impactful website that people can truly understand your ideas and I think part of the way forward here in an abstract sense is to embrace the open science model to instead of thinking in terms of peer review which God bless the current epidemic has shown to completely break down when put under pressure things like post review things like open post review you put out a paper and people can criticize it and you can improve the work because it’s not about a publication ultimately science isn’t about publications it’s about building knowledge and the more people that can criticize and contribute and build on an idea and again human nature being what it is these are all lovely statements the actual implementation to be different but like instead of just putting these papers out on archival something like that developing these these sort of interactive pages on sites like distil and having them open to collaboration so you may not be you may be like a student working on a problem and you’re channeling all your effort into this but no one’s understanding you but you can pair up with someone who can build a visualization and they get partial credit for the contribution for communicating the idea someone finds a floor in your theorem and they contribute a solution and I don’t know how it would work it it looked like Wikipedia but different but thinking in beyond the model of standard publication in fact thinking beyond the model of standard authorship maybe the way to address that cognitive overload yeah that’s really interesting because the article talks about research debt it calls it out and it says that one of the problems is that we lose perspective of the bigger picture right because people aren’t understanding what’s going on and that fact can cause a kind of convergence there’s also the credit assignment problem and and there’s this issue I don’t know if you heard about in Europe and the problem with the whole peer review system is broken basically and that the second reviewer can kill your work and there’s no incentive there because there’s no reputation on the line because it’s an anonymous system and usually they can figure out what institution you work for beforehand anyway and it all links into this bigger picture that there’s no real incentive to have an open system if archive for example if they had public comments on your work because at the moment there’s there’s an open review site where you can have some kind of real collaboration but as you just suggested what if research was a continuous thing what if folks could just add to your work and it just went on and on well I think it almost has to start working like that because there’s archive 70 exists for a reason there’s thousands of papers on machine learning released every day most of them will never be read by anyone and that talking about informational efficiency there’s a reason the gains are slowing down it’s because the space is saturated but if you could find a way of of doing credit assignment and that again this is hard but crediting people for contributing to a larger project in an open and transparent way where disagreements could be resolved with our someone deleting the entire body of work and the people who are strong coders or engineers could optimize one side people who knew how to communicate ideas through the through written word or through visualizations everyone plays their role in contributing it may be an idealized system but given the technology we have how much of it’s just a failure of imagination that we’re not already doing this yeah I was just about to point that out we almost have a system that supports that now we have the internet we have multimedia HTML5 can do in any manner of animations multimedia content I don’t know if the toolset is that friendly to use or not but it’s kind of what we’re asking for is the internet maybe with some little bit of modification but it’s not a hard idea but like a lot of these big research bottlenecks aren’t unsolvable it’s funny you say we need the internet because I agree that the internet is the fundamental democratizing force and we have HTML and all of this in the article it says that the United Nations of even they want to protect the status of the internet from interference because they think it’s a fundamental human right but it’s going to be a national preserve at some point we have to make it a world world heritage site so hang on you know the American here you’re meant to be against socializing public assets oh I’m what I am for what I am for is humor at almost any cost so I do even at my own personal costs so that that’s why I made comments like that exactly but I think we are to a certain extent wasting the modern web because it’s interactive and most of the most of the content out there isn’t and a lot of that is because we consume via multiple channels and multiple devices so it’s become a bit lowest common denominator the article said places like WordPress and Medium they they’re they’re optimizing on social networking as opposed to the interactivity of content but I want to just quickly point out an example a lot that’s because of the audience which is that so I can’t tell you how many times I’ve been reading a white paper and looking through various kind of diagrams and whatnot and just think to myself can you just show me the system of equations please because I’ve already developed the ability to look at those equations and I have the abstracts whatever is in my neural net to to take that input efficiently and I don’t want to have to go through all the the kind of hand holding whereas another audience might get a lot of value out of that I will raise my hand and say I’m that other audience I’m self-taught in maths and especially when I’m reading physics papers or something what you’ve got an e here what’s the e is this an expectation is this an operator is this the energy function what’s going on here but the within the paper itself the complexity slide it was a great idea and I love the idea of that was attributed to someone of annotations for your equations that’s just you could toggle them on and off that was really perfect yeah you’re reading a paper in an unfamiliar domain switch on the subtitles you can understand a little more easily yeah an important aspect of this and that’s an important point that you bring up is to some degree the more that we can make it accessible without going too far but the more we make it accessible the more we can enable I would hope greater achievement I would say though it’s it’s got a double edged sword because I read a couple comments in the paper that that gave me pause like I want to read you this quote so unit visualizations have also been used to evoke empathy in readers in other words covering grim topics such as gun deaths right and he goes on to say something else to which I can’t put my finger on right now but oh yeah he instead suggests anthropomorphizing data borrowing journalistic and rhetoric techniques to create novel designs or interventions to foster empathy in readers when viewing visualizations and I’m sitting here thinking I’m sitting here thinking okay if I’m doing science and I’m trying to apply logic and science to fact-based to find things that are real and true do I want to be invoking emotion in people yeah I don’t they were talking about oh it can be used in various things education research etc and then they said policy and my concern is okay we’re going to use empathy to make a more convincing argument and now we’re going to apply it to the real world via policy who’s got the bigger budgets to present these better arguments and this is what happens anyway because as you say that there’s one thing which is dissemination of information and making it as comprehensible as possible but then there’s persuasion and all of this evoking empathy stuff but a lot of it is just like marketing and a lot of the when you design applications or when you design presentations and you talk to creative types that’s what it’s all about it’s about creating its story and evoking emotion and a lot of this if you look at the Venn diagram it’s very much in that domain yeah yeah and so this is like a lot of tools can be used for good or for ill and I think using it to make information more accessible to help people actually understand it from a factual point of view those are all very noble pursuits using it to yet further tweak people’s emotions I think is not so great that’s just going to further and there’s some other comment in the paper that made me realize yeah that that could just further oh yeah one of the things he brings up at the very top personalized reading let readers choose the content that is relevant to their own experience okay if we’re talking about intellectual experience like we were saying earlier add annotations to this equation and help me explain it wonderful if on the other hand we’re talking about my demographic I don’t know that’s just going to help further yeah flame and polarized people and face that’s an algorithm feedback loop based on recommendations we know this is a problem the internet it’s quite it’s quite dangerous because that’s what social media does now is see the reality that you want to see and this lived experience thing is very dangerous because the the French philosophers there’s this Ficoaian idea that your reality is constructed from language and that’s quite a dangerous place to go down but I wanted to quickly demonstrate one other thing on here so there’s a wonderful example called a visual introduction to machine learning and this is actually called scrolly telling or data journalism or something apparently scrolly telling has been demonstrated to be no better than slideware but this is an example of a wonderful intuitive experience teaching machine learning concepts and I want to touch on what Keith was saying early because Keith was basically saying not in a snobbish way but is this about making it more accessible to people that wouldn’t otherwise understand and I don’t think it’s necessarily making it more accessible to people lower down the totem problem if you look at the Yannick’s channel for example he’s making machine learning concepts more accessible to people that are studying a PhD in it so it’s not that it can’t be used for sophisticated folks it’s just that it hasn’t been used yet yeah look accessibility is not about lower high to me it’s just about breadth and I don’t I totally believe people can contribute to fields from anywhere I think it’s it is an efficiency thing which is if I’m a scientist and I communicate effectively to another scientist society gains x amount of forward progress in science and if I communicated efficiently to say a layman that just likes it as a hobby and thinks it’s cool do I make any progress towards science so it’s just about where we invest our resources and if we ever could get into the state where Alex is talking about words it’s more open and crowd sourced and yet people are still motivated to contribute i.e. sort of the internet but with maybe some enhancements modification better tooling some cryptocurrency that tracks how much do you continue to wait for waiting for blockchain to come up it always had to do then that’s all great it’s just and by the way when you were scrolling through that scrolly vision I want to point out for those who maybe are born around back up let’s say the mid to late 90s or late 90s when when html was brand new and cool and everybody was exploring what the web could do there were innumerable websites that tried to do really cool visual layouts like that flash sort of things and I spent lots of time perusing them because I thought these are neat and wow look at that and I even made some websites where I tried to do cool visualization and then you fast forward decades and the sort of natural selection and Darwinian evolution is the internet is that we’ve streamlined it and simplified it to WordPress so if you’re a Darwinian is you may just believe that what we’re arriving at right now is just the efficient way of presenting information for the population as a whole because we had all these tricks and cool things back then and people start doing it so we weren’t getting as much value out of it with the thing because the internet was an incredibly exciting time because in 1999 me and a guy called Thomas Bradley we ran a website called dhtmlcentral.com and we were on the forefront of any of those guys went on to become very senior people at Google in the subsequent years and we were on the forefront of doing that the problem was I probably visited your site and studied for a tour. You probably did it. The problem was even then it was difficult and now it’s still difficult coding is difficult it hasn’t got easier and as you say on the balance of things it’s much more efficient especially given that we’ve got certain distribution mechanisms over multiple channels it’s much easier just to use text and to focus on the social so where is the place for this stuff. To bring that I’d expand that a little bit further it’s not just that it’s more efficient it’s that bad writing is far more forgiving than bad code if you’ve got bad badly coded or badly supported code you have unusable information whereas if you’ve got just something that’s poorly slapped together poorly phrased then that’s at least salvageable you can do something with that. This is maybe this is where neural link comes in because we keep coming back to look just natural language well written in a way that creates visuals within your mind and storytelling is most efficient thing that human beings communicate with right now that’s the problem we need to hook up some wires in there so we can just input some more efficient form of media then we won’t even need interactive websites because it’ll just be an interactive wire. I know but but Sholeh said that’s a load of rubbish because if you increase the reading speed the cognition goes down there are hard limitations and I know we debated the other week that maybe you could represent it in a different way but I think the brain is much slower than we realize. Yeah the meat space bottleneck is IO. Yeah not even IO even comprehending the information so I think it’s because we can read information far faster than we can comprehend it already. Yeah that’s fair that’s fair. Maybe it’s slow because actually what it’s doing is quite difficult which is it’s creating these concepts these ideas it’s creating structures right all the things that we keep saying we would love it if neural networks could do and they utterly fail at doing our brain is doing that and so it’s doing a difficult computational problem and because it’s an efficiently designed temporal computation parallel computation device it just takes a certain amount of time to calculate that stuff and organize it. And on that note gentlemen it’s been an absolute pleasure likewise thank you and thank you for joining us so late in your evening Alex really appreciate it. It’s a lovely evening I had good excuse to sit up and have a nice cup of tea so why not. What time is it there Alex? 12 30 not too late oh my god oh my god and you’ve got another one with us tomorrow yeah but that one that was easy that’s 5 p.m. I can do that one. I’m at my best at midnight this is my prime coding time so yeah well written code I hope oh you know it’s minimal bug fixing tomorrow morning let’s say. Well Rintoni Hoor has this brilliant quote there are two ways to do software design you either make it so simple there are obviously no errors or you make it so complicated there are no obvious errors. Buggle feature you decide oh that is brilliant I think we need to have that as the intro to the show see you guys thank you