Numerics of ML 6 — Solving Ordinary Differential Equations — Nathanael Bosch
Hi everyone. I think we can start. So, yeah, welcome to today’s lecture in the Marit’s of Machine Learning. We’re in lecture six. My name is Nathaniel. I’m also a PhD student with Philip and I prepared this lecture with Jonathan together from that lecture. We’re in lecture six but we’re also in the second lecture of our blog on simulation. Let’s start at last week with Jonathan that taught us about state space models and he provided us with a language to well describe the evolution of a state in a probabilistic way and describe observations that depend on the state also in a probabilistic way and then provided us with a set of tools in order to estimate states from the observations. And we talked about that in a very general sense but then we also looked at a practical algorithm, the so-called Kermann filter and Kermann smoother or Rauchtung Schribel smoother, that solves this problem exactly in the linear Gaussian case and then we extended it to the non-linear case and got an approximate inference algorithm that which is very fast and does work well in many situations. Today we do a similar thing but we talk about it in a very different way because we talk we also talk about states, we also talk about dynamical systems but we use a very different language because well for the most part things will not be probabilistic. We assume that we have a way to describe the evolution of a state in a completely exact deterministic way but even then things are complicated enough and we do need to approximate the simulation aspect. So while before we did approximate inference this time we have to approximate simulation because doing it perfectly is just not feasible. And the language that we use is ODE’s and the algorithms are like ODE’s or ordinary differential equations and the algorithms that we use are numerical ODE’s solvents. And so we start with a small intro to ODE’s especially an intro from the perspective of machine learning and I give some examples on where ODE’s appear in machine learning. Then in the main part we talk about numerical solvents, constructive view and look at some examples and there are differences and properties and then in the last half hour we have a block on parameter in the front where it’s not like it’s not really a simulation problem anymore. It’s rather that we want to learn about the system that generates the simulation and how that intersects together we will see in the last part. Before I go to the examples let me define an ordinary differential equation. An ordinary differential equation or an ODE is this thing here in the middle so x dot of t equals f of x of t comma t. x is the unknown function. It has a similar meaning as the state from last week I think so it’s a thing that evolves over time and it is a function so it is continuous right so x is defined for all the t’s in some time domain. I also use t and talk about time because that’s the easiest interpretation of an ODE. x really has a one-dimensional input so only a scalar and evolves along this one dimension. That’s also why it’s called an ordinary differential equation. Then f on the right-hand side is called the vector field and it describes well it allows us to describe the derivative of x given x and typically we want to solve the ODE for some time domain not for the whole real line but up to some and like starting from zero up to some time capital t. So in general it is a way to describe the derivative of the function but not the function itself and so the question is what is x because this thing basically implicitly describes x and so we have to do some work in order to find x. Now to the examples where ODE’s appear in machine learning I think the first one is I guess super well known now and everyone that at least as a Twitter account knows about diffusion models or even in the first lecture we had a lot of images I think that were generated from one. So diffusion models are quite hyped for the whole year and they fundamentally rely on differential equations. I’m cheating a little bit with this example because for the biggest part I think stochastic differential equations or sts do the main work and so if you really want to get into those models you need to understand sts but at the same time in order to understand sts you need to understand ODE’s first because sts are just strictly more complicated. We will not talk about sts in this lecture nor the next but I think ODE is well important in order to get there. Also like on one detail there are algorithms that perform the smapping from noise to data that use ODE’s. So like there are papers published last year that make this transformation from there to there by numerically solving in ODE with some Rungikuta server. This mapping from noise to data has also been done before the diffusion models already because people cared about modeling complex distributions so things like here on the right or I mean images is also like there’s a distribution of images that are natural and so modeling that need well for that you need to describe a probability distribution of a dose but it’s complicated to work with whereas the thing on the right is just IID, Gaussian noise and it’s easy to work with. And this mapping from distribution set are easy to work with to distributions that are complicated to work with is something that normalizing flows do and well there’s I guess most of the work on normalizing flows or half of it like maybe does that in a continuous way by using ODE’s again. Then you might have heard about neural ODE’s that’s basically so it also relates to the two points before but there was a nice paper published in 2018 that sparked quite a lot of hype in that field which reinterpreted residual neural networks as discreet decisions of a more continuous thing where instead of thinking of a new residual neural network layer by layer you think of it as a continuous transformation that is then discretized by an numerical algorithm and one specific algorithm leads to the thing that we know and in general we get a continuous formulation and that’s yeah again sparked a whole lot of research there are still many papers being written on your load is and I would also argue that I think it accelerated the two points above because the diffusion model stuff often also relates to neural ODE’s to give something completely different in case you are more interested in optimization more interested in optimization it turns out that in order to understand what happens by gradient descent you can do something similar as I just presented new load is and reinterpret gradient descent as a discretization of a continuous object and this continuous thing is called a gradient flow and it’s much easier to write a theorem about what the gradient flow does than what the discretized thing does and I took this graphic from a very nice blog post by Frossis Bach who does a lot of great work on optimization and yeah he had a lot of posts on gradient flow and also extends this then to SDS and stochastic gradient descent. The one example we will do in this lecture is parameter inference which maybe relates more to well the machine learning settings that we’re used to so we have some data and we want to learn from it and well if we know about the process that generates the data maybe we can learn something more meaningful like in this example that we see later where we well we learn four lines even though our scatterplot describes only two of them and in order to do that we need some assumptions on the structural properties of these four lines and small spoiler for next lecture is the act of solving an ODE is itself a process of learning something learning an unknown function in this case and so next lecture we will interpret the miracle ODE service as machine learning algorithms. Alright so I hope that well you know why we’re sitting here and why we’re talking about ODE’s and why I am excited about them so coming back to the definition of an ordinary differential equation the question for the rest or the two thirds of the lecture is how do we solve ODE’s so how can we find x such that we satisfy this differential equation and well we can write down the solution of an ODE in the following way x of t is actually given by x0 plus integral from 0 to t f of x tau comma tau d tau that is a true statement it’s also not very helpful because it is the solution like if you compute the derivative the thing on the top comes out it’s not helpful because of two reasons we have an integration problem now and integration is also non-trivial and that’s why there will be a lecture on it I think and the integration problem is even particularly annoying because x appears in there too and we kind of need all the values from 0 to t of x but we don’t know x right so yeah it’s a true statement and it will motivate a lot of the algorithms or even all of them that we talk about today but it’s not yet a solution that we can actually work with there’s one piece of insight though which is the solution depends on the initial value so the problems that we actually care about are so called initial value problems which is basically an ODE together with an initial value x0 given by well some x subscript 0 whenever I say ODE I probably mean initial value problem with an ODE and an initial value when I say solving an ODE I mean solving the corresponding initial value problem so let’s look at an example that well visualizes why ODE can be a helpful tool for modeling and why well and how we can solve it our house solutions look it’s a very simple example it’s one dimensional to so called logistic ODE and but we can motivate it from a biological perspective where well we don’t use x here anymore we use p because we want to describe the population size of something and the change on population is given by r times p so r is a growth rate and p is the population so if we ignore the parts after it basically means it grows proportional to the population which makes sense like if ten people have five babies then a hundred people probably have 50 and so the growth is larger of the population is larger and then the part on the right describes a upper bound because if p divided by k approaches one then the thing gets zero and we stop growing so k is effectively like an upper bound for the population or also called carrying capacity or a bottleneck the solution of the logistic ODE is given by this today we talk about numerical solutions so we don’t really care about how to solving how to solve ODE by hand this solution you can get by doing some derivations by hand but this is explicitly not the goal today we want strategies with which we can approach general complicated problems with a computer which is why I just give you the solution but you don’t really have to trust me on this because you can compute the derivative by yourself so one of those nice cases where at least very fine a solution is very easy even of coming up with it is more complicated so yes this function solves the ODE we can plot it this also maybe visualizes once more what the problem setting is the arrows in the back are a visualization of the vector fields because they kind of move the solution around and then here are seven lines that solve this ODE for different initial values with some chosen parameters k1 and r1 so we get multiple solutions for different initial values if we specify an initial value we end up with one solution and so the goal of solving an initial value problem is basically given the starting point down here somewhere and the arrows in the back coming up with this plot that is the task that we are interested in in general please feel free to interrupt me at any point if you have questions if anything’s unclear then we can talk about that all right so the main question is how can we solve ODE’s in general so if no one gives us the solution what can we do in particular well in most cases it’s actually impossible to give the solution in closed form but we still want a strategy to find x or at least find something that is close to x so the question is how can we solve ODE’s in general and for that we come back to while the definition that we saw so far and the solution that we’ve seen so far and the strategy in numeric load e-sauvas is basically to extrapolate step by step which means that instead of trying to compute xt directly from time at x0 the question is rather if we have something at time xt how can we go to time xt plus h and then again the formulas of course exactly the same as before so this thing is still the true solution if you have the true solution at time xt and the difficulty is also exactly the same as before so in order to do this we still need to solve the integral which is non-trivial and we need to handle this x inside here which is also like that is the main difficulty in this problem the motivation is though that probably if h is small so t and t plus h are not so far away from each other then the approximate algorithms that we come up with with dealing with this are probably fine compared to trying to jump ahead directly so that’s the motivation yeah so the main question is how do we compute this thing here and all the algorithms that we show will be different ways of computing that and as a result we get numerical ODE’s so that so the first thing we will try is using Taylor series expansions or Taylor series approximations actually you’ve seen this last lecture when we tried to push gossians through non-linear functions and we well use Taylor series expansions to linearize the function in general a Taylor series expansion takes some function g that well here for simplicity maps r to r just to have easier notation and it says that well g at some tau is equal to g at time t zero plus and then we take the derivative times tau minus t zero plus one half the second derivative times tau minus t zero squared and so on or in this series representation note that this is exact if you actually let the series one run to infinity what we did last lecture then is basically ignore everything that is here to the right and only look at the left part and then this thing is an approximation and we obtain a linear function because tau only appears in here with some slope in like in front of it and that’s the Taylor series approximation today we’ll do something even simpler as our first algorithm which is we will like last time we do the first order Taylor series approximation we will not even go there we’ll do a zero of other Taylor series approximation which basically means we discard everything to the right from here which is a very bad approximation but it leads to an numerical algorithm that everyone knows and surprisingly many people use and we will see that on the second so zero of other Taylor series approximation okay well the reason we do that is still we want to approximate this integral on the top and so we do a Taylor series approximation of the thing inside the integral the integrand where the relation so this f of x tau comma tau you can interpret that as a function of tau if you just merge the f and x and so on together and then you’re in this notation here this is why I use g of tau and well when we do the Taylor series approximation around t then we approximate f of x tau comma tau as f of x t comma t and we have some error term this error term comes from this here because that is at most age and when we plug this into our integrand that means that the self-integrating a function like this thing that depends on tau we suddenly integrate a thing that does not depend on tau anymore so we can pull it out we integrate from t to t plus h the one function which is only h so we manage to find an approximation of this integral and if we plug that into the top we get our very first numerical od e-solver which is the form of Euler method and I imagine you’ve seen that before the scheme is well you take your estimate attempt here and you adjusted by adding h times you asked like f of your estimate at time t which is well the local slope it’s an explicit method because it explicitly describes how to compute x hat at time t plus h so it’s a formula you can write literally write this into your favorite programming language and compute the next point on time all right so that’s the very first saw that and again like it is actually used in some software or yeah it appears also in dynamical simulations I think I mean if you look at what open a i-gim that’s under the hood they do this basically to simulate the systems now let’s try to do a thing a little bit like change one thing and get another algorithm which is well choose a difference location to do the zero-thousand Taylor series approximation and the motivation for that is in some sense that well here we use the the local information of well the point where we are not to extrapolate maybe a way to get more information would be to use the slope from the point where we want to go which basically means we do a Taylor series approximation around tau equals t plus h and well the math still works out so we get the same approximation here only that it says t plus h now we this formulation is the same as before only that here it says t plus h and we get the so-called backward Euler method that is given by this so x hat a time t plus h is equal to x hat a time t plus h times f of x hat a time t plus h t plus h and well it’s again very easy to write down but this is not an explicit equation anymore and maybe now it’s clear why I said explicit earlier this equation describes the quantity that we want but doesn’t explicitly give I’ll tell you how to get the equation so again similar to well the thing we started with which is an ODE which describes the thing we want but we have to do some work here again it describes the thing we want but we do have to do some work and well one way to do that is basically move that to the other side also then we have zero equals something and that is a root finding problem and we will see that or we’ll talk about that again in the next few slides so yeah we have our first two ODE solvers basically forward Euler an explicit method and backward Euler so let’s look at code quickly to well see how easy it is to solve ODE’s there’s I mean it’s 13 lines of code but most of it is because I wanted this to be copy-pasteable so if you copy paste this into your terminal then you actually get a plot that looks like this and while the algorithm itself is basically only in line nine all the rest is problem definition in line three and four importing a plotting package on the top specifying a step size maybe you’re confused about this double slash here it’s basically a way to write a fractional number so it’s not not a floating 0.1 but it’s like one divided by 10 really which makes things a bit nicer it doesn’t matter a lot and then here you have the explicit Euler and it’s really the same way as I wrote it down like the new X I just overwrite the variable is X plus H times F of X comatier and it works like we get a solution of the logistic ODE that looks reasonably fine it’s conceptually similar like the same as we did before I didn’t plot the true one here because we will talk about error later. Backward Euler and code is also the same in the sense of like we can write it in one line but as I said as we did like in slimes we were also able to write it in one line but I said that you need to do some work in order to solve it well I chose the same way to implement it because I used the roots package which has a function find 0 to do the thing that I talked about which is this thing solves the root finding problem for me and then splits out a new X that satisfies this implicit equation and again we get a solution that looks fine and the code overall looks quite easy to read. A few comments about this thing though we’ve seen that in the first three lectures how numerical methods appear when you try to do something with a computer like doing gp regression and then suddenly when you start looking into what it actually means to invert the matrix you open up this whole field of research. This is exactly one of those points where again we can start digging and thinking about our numerical method calls an numerical method maybe we should think about that we can look at what this function does. I did check out the doc string and it turns out that find 0 is actually quite complicated like there’s a whole holistic algorithm that decides from a set of like 50 different methods what the best roots finding method is and so on. Today we’re not going to dig deeper there but I mean we are because we think about numerical methods we are aware of this happening and for example I know that in there are some packages where the where implicit differential equation solvers have an interface for the user to specify how this kind of thing should be solved and well because implicit methods have this additional complexity you can actually tell it to compute inverses with conjugate gradients for example so that is actually what people do and if you look at the documentation on how to solve stiff ODE’s then you need to care about these things but we want to get to Romy Kutel so we will not care about these for today but you see the theme I think. Yeah so one other question in this context is so this thing does something complicated whereas here it’s something easy so why would we do it want to do the complicated thing? In particular this find 0 function it’s an iterative algorithm like maybe it uses like a Newton iteration and so this thing is computed many times in order to get to the quantity that you care about so that is more expensive it is well more annoying to code it is more annoying to understand so there’s some extra cost involved why would we want to do that? And well I guess the answer is because back what Euler is a different algorithm from Euler and that there’s one specific property that we will look at that visualizes this difference of explicit implicit methods which is stability so yeah stability is the vague word that well pops up when you look at differences between explicit methods in general and implicit methods there are different notions of stability so if you open up an ODE book then this is not all there is but people can have a way more than this one slide that I’ll show but yeah I think this visualizes explicit and implicit Euler already also if you’ve heard about I don’t know ODE is being stiff then this is the context so to compare forward and backward Euler we will look at a very simple ODE which is x dot equals lambda x so it’s scalar it will also be easy to plot and lambda will be a number that is lower than 0 here I chose minus 21 we’ve seen a second why the true solution is therefore like this dashed line it’s an exponential decay to 0 and it decays to 0 quite fast as you can see because the plot goes from 0 to 1 but it’s visually 0 already at 0.2 or something now when we solve that ODE with forward Euler with some step size 0.025 we get these dots here and lines in between which basically I mean if we zoom in then I guess around here you see that there’s an approximation error as there always is but like conceptually it does the right thing because it goes to 0 like the true solution goes to 0 forward Euler goes to 0 in the limit it’s fine the question is for which step sizes does it go to 0 and well we can look at that by just plugging in forward Euler so x hat t plus h equals x t plus h times and then we plug in the vector field from above so lambda x hat which gives us a simpler formulation where we just multiply 1 plus h lambda to the previous estimate and now the requirement of it well becomes different like if we want this thing here to converge to 0 suddenly we know something about this thing here in the middle so 1 plus h lambda should be smaller than 1 and if we check that for lambda minus 21 and the 0 0 25 then this holds but we can also ask the opposite question for which step sizes does it not hold anymore and well when we solve to lambda or I mean okay basically I just have a counter example but we can could also solve but if we plug in a step size of 0.1 that would mean that we have 0.1 times minus 21 which is 1 minus 2.1 which is minus 1.1 absolute value of it would be greater than 1 so this equation does not hold anymore and so just looking at what we derived here it should actually not converge to 0 anymore and it well we can try that out and I mean of course this is correct what we wrote down and the code like when we run it we see that it actually starts oscillating around the thing but at each step it overshoots and then the gradient information that gets is like waste deeper it overshoots again and so on and there’s this whole feedback loop happening and it actually diverges so this basically means that for specific problems we actually need to take small steps otherwise bad things happen like otherwise we get this divergence behavior even though the ODE is actually quite simple to solve right and so I mean the goal was to compare forward Euler and backward Euler we can do the same math for backward Euler and we see that when we reorder this equation then the requirement we have is 1 divided by 1 minus h lambda should be smaller than 1 and when we plug in the numbers that means 1 divided by 1.1 and that is indeed smaller than 1 and when we run the code it actually converges to 0 and like backward Euler you can check for yourself on which h is this works but with that lambda I think it should always converge to 0 so that’s a fundamental difference on why backward Euler has properties that forward Euler has not and why sometimes you might actually want to do the expensive more complicated thing instead of the easy thing. That’s I mean it’s not completely the full part of the story because I mean the one backward Euler iteration is more expensive so this is not a final answer to in this problem you should always use backward Euler because probably when you sit in front of a computer the thing you care about is just how much time you have to wait until you solve the code E but roughly speaking when you expect your differential equation to have very steep areas then you should consider solving using an implicit server or on the like phrasing it differently when you solve a diverges maybe you have a problem with the stiffness of the problem and then you should consider using an implicit server. All right it also shows us that we talked about two algorithms both of them are super simple and we already faced the problem of which one should we choose and yeah this is one partial answer to that. In that regard it doesn’t get better because we will actually talk about the whole class of servers so this opens up basically 150 new algorithms that you can find in software packages but it’s also a class of servers that is very like well used and probably the default servers when you use Cypires probably a Rue en Couteau server a specific one that you found a very first lecture and that we will revisit in a few slides. The problem is still the same so we still try to extrapolate from some time t to some time t plus h and the difficulty is also still the same which is we need to solve an integral which is annoying but it’s particularly annoying because inside the integral we have x. The motivation for Wougen Couteau is now a different one which well it’s basically well I said there are two problems one with the integral and one with x being inside we just tackle the integral problem by numerically solving the integral and conceptually numerical quadrature is well a term for numerically solving integrals and typically works in the following way so when we have an integral with some lower bound and upper bound of some function could be vector valued then we approximate it by discretizing in a specific way in a specific way because well we have to choose where to evaluate the function g which are the so-called quadrature nodes and then when we sum them together we can give more or less weight to different ones of them and those are the so-called quadrature weights. Again I think you will talk about integration in more detail in three lectures and we don’t really like there are many ways of doing that you can probably fill whole like courses for the semester on how to do quadrature so this is really just a conceptual motivation for the kind of operation that Runge Kutta does because Runge Kutta methods or explicit Runge Kutta methods I will talk about today do a similar operation so we replace the integral with a sum of S terms and we have specific weights quadrature weights and quadrature nodes only that and I said it before there’s this additional second difficulty of well we still need to find X hat a time T i tau i because well it’s not given automatically we have to construct it somehow and this is basically the difference between different Runge Kutta methods so if you look at five different Runge Kutta methods they’ll give you five different answers to how to choose w i tau i and how to construct the X hats at those locations all right formally Runge Kutta looks like this if we compare that to the last slide you basically so here I use w and tau on the next slide I use b i’s and replace the whole f term with the case because that is the notation that Runge Kutta methods use yes you have a question the previous to take that or that could be found that is not so we could have that that’s yes yes that will be the very first Runge Kutta methods on those slides so yes yeah but you are completely right all right so in general Runge Kutta methods look like this so that is basically what I had on the last slide it’s just a definition that we when you when you code them up basically you implement how to compute K1 then how to compute K2 and K3 because K3 depends on K1 and 2 K2 depends on K1 and so on then at the end well no this is just a general formulation of the lines here and at the end you update X hat by extrapolating again from X hat t with step size h while summing up the different K i’s with the appropriate quadrature weights because this is a bit tedious like I mean this is the way you implement it and I think if you write code probably you want to formula like this a way to talk about Runge Kutta that is a bit more compact and more readable as a so called butchartablos because when we look at these equations I mean structurally they all look the same right and I told you that the way Runge Kutta methods are different is basically by choosing different weights BI different Quarantine notes here C2 C3 and so on and then constructing the X hat in a different way in a general way to write that down is with so called butchartablos because all the three variables here that need to be chosen can be packed in a table with well a column that basically this thing here describes how to construct K2 this thing here describes how to construct K3 and so on up until Ks and at the end we sum them up in the correct way in order to get the next step if you go to Wikipedia and look up Runge Kutta methods in general that you find this article with a list and they don’t write down equations or anything they’re just it’s basically a list of butchartablos so in some sense if someone gives you a butchartablos you can also just implement it or even write code that handles general butchartablos all right so let’s look at Runge Kutta method examples and well this is just so you can do some pattern matching with the stuff that I wrote below but it’s the content from the last slide and as you said already yes the very first Runge Kutta method that we can talk about are the very simplest one is again forward Euler so I kind of tricked you to go through all the stuff only to show the same algorithm again but it’s because well the extra appulation if we just replace this thing with a K1 then this actually satisfies the equation above and we can also write down a butchartablos it’s just a very like not very special one but if we insert some zeros then we get this butchartablos and again if you go to the Wikipedia list of Runge Kutta methods this is the first one like that’s actually the first Runge Kutta method they give unsurprisingly I okay I didn’t define implicit Runge Kutta methods but I you can imagine that there are implicit Runge Kutta methods and the backward Euler fits the definition of an implicit Runge Kutta method and basically well here normally there needs to be a zero for things to be explicit there’s not so it’s an implicit method and yeah there will not be a definition of implicit methods that was a question sorry is maybe I can have an attempt because I could imagine that the question is if we split the inter like the integral into different terms and then just take steps in between all the time like instead of going from t to t plus h we go to the middle point and then to the next something and and So I mean, I’m still not completely sure that I got things right. But if the idea is, I mean, one other thing that you could do is, for example, do an Euler step and do a second Euler step. Is that the intuition of doing one step and then doing another step from right there? Because this is conceptually not what happens. That would be a different algorithm and has different properties from here. So I guess the observation is correct that the BI’s, I think they summed to one, they need to probably, like something like that. But you actually have quite a lot of freedom and choosing these numbers there. Maybe, like I will have two or three examples that are not forward in backward Euler, where we see Butchette have lost. Maybe we can revisit your question after you saw that and you can check if that was a counter example or not to your intuition. Yes? Can I have a question for you? So, okay. All right. Yeah, so let’s jump to the actual example. I mean, there’s maybe the second easier thing that we can still try to interpret intuitively on how it was constructed. So, okay. So, okay. Well, earlier I said we can either take the derivative at the point where we are or the point where we want to go. If it’s the point where we want to go, we’re implicit. So, maybe we can take a better derivative in the middle in an explicit way because we don’t want to get to be implicit, which would amount to this equation here on the top. But then the question is how do we estimate the thing in the middle and the easiest way of doing that would be taking an Euler step. And that leads us to the so-called explicit midpoint rule, which is the third wrong-accoutive method in the list of Wikipedia. But if you check that, it’s not an Euler method. Like, it’s not the same as doing two Euler methods because you still extrapolate from the point down here. And the K1 is internally multiplied with half the subsides again. So, you kind of reduce the thing that forward Euler did in order to get a better extrapolation. Whereas if you wrote right down two Euler steps, something else comes out. With different properties, we’ll talk about them in a little bit. One more for the fun of it. There’s the classic fourth order Rungikuta method. I think it’s called like that because it’s like the original methods by Rungik and Kota together. These are the equations. I’m sure we could find some intuitive understanding of that. And they’re also nice plots of Wikipedia on how to extrapolate with the gradient and adjust that in some way. But I think we skipped that. The thing is, the butchers tableau is still readable enough. The coldest readable enough. You can’t implement this. And maybe at this point, it’s also Rungikuta came up with this method. So they derive the numbers on the right. And that is a proof that is readable. If you can check books like this solving ordinary French equations book, which includes a derivation for these coefficients, I think by now it’s clear that we cannot choose them completely, however, like how we want. But it’s also clear that there is some freedom. Otherwise, there wouldn’t be a list of 150 Rungikuta methods on Wikipedia. But so feel free to check out the derivation. It is understandable. You can do that if you’re interested. But yeah. So, there’s a question. Sorry. I think so. Do the BIs, so the the quadrature weights here on the last line always sum up to one. And yes, I think so. Yes. I would be surprised actually. I would be surprised actually. I think I got a bit too confused. Sorry. I don’t know if you understood. Okay. Let’s talk afterwards. Yes. Am I correct in understanding that the positionable arises from the quadrature of the integral that we saw at the very beginning? Yes. So the way to derive these would be choose the family of polynomials, derog the quadrature rule, and that derog the position from the quadrature rule. Yeah, I’m not so sure about that anymore because it’s not just an integration problem. It’s also constructing these intermediate values. So I’m not sure if you have a direct mapping of integration rules to rune-cusem methods. It’s two to the family of polynomials. I think the points at which you evaluate the quadrature rule are given by the zeroes of the polynomial tablin. So it’s built into the nodes of the quadrature rule, and they should directly come from the top. Okay. But I’m not sure if that’s your point. But it’s a node who can deploy a quadrature rule of tablin. Yes, pretty much. And so when I looked at the solving ordinary differential equations one book, sometimes the text was motivated by it. So there are these quadrature rules. Can we find a similar thing that solves all these? And then the follow-up question is how do we construct these things in between? That’s why my impression was that it’s not a direct mapping, but they’re very motivated by quadrature rules. Thanks for the question. So we were here. I guess the next one to jump to, you know, from the very first lecture, which is the dominant print or Dope-free 5 method. And at the latest here, it gets like, it’s still fractional numbers. So it could be worse. But we see that it’s hard to interpret what’s going on. And the goal should not be to try to understand how this gradient does. In this book, there are the explanations on what properties these coefficients need to satisfy. And then dominant prints came up with a method that does a lot of things like works well and satisfies the properties. And there’s a question in the back. So the question is, why are there two lines at the end? And yeah, I was coming to that. So the thing is so complicated because it does a whole lot of things. First of all, things get more complicated when you make the whole tableau bigger. And that makes it less trivial. We’ve seen that before that. This one was harder to interpret than this one. So that’s already one thing. But then the question was about these two lines. And so Dope-free 5 does one cool thing, which is it gives you two solutions at each step. Because I mean, this last line was about how to sum up the case that you’ve computed before. And there are two different ways of summing them up, which give you two Rungelkutter methods at the same time. So it’s, and this is why it looks this way. Like because they try to have an efficient algorithm that gives you two Rungelkutter methods at the same time. But which are, which are the different degree of accuracy. One of them is more accurate than the other. And there are theorems that say that the order of accuracy is different from one and the other. And so basically you can like use the more accurate one to estimate the error of the less accurate one. And again, that’s why this thing is so complicated because of this. And error estimation is in turn helpful because one thing we didn’t talk about yet. But I will talk about here now and on the next slide. And you are in the homeworks. And the first one of the homeworks is I never discussed how to choose a step size H. But if you have an estimate for the error, then you can use that in order to adjust your step size H. If your question changes from solving the ODE for a given step size to solving the ODE while satisfying some local error. Yes, there’s a question. I actually don’t know which one is the more accurate one. The same thing. One of them is always the one and the one of them always the most. Yes. I mean intuitively this zero here surprises me. So it seems like it has contains less information. So I would assume that this one is less accurate. But I wouldn’t guarantee it because I never implemented it by itself. Yes, another question. The question is why are there two zeros here? Why even computed? And it’s because above there are not the zeros. So you need them as intermediate values. Exactly. So you also, I think in the very first lecture, you switched to Python code and scrolled through the Dobry 5, Fauter and Colt on GitHub. I just copy pasted one specific part out of it. But so the rest of the code is still quite hard to read because it’s still Fauter and Colt. But at least this huge list of numbers that you talked about last time, not last time, but in the very beginning, got a bit more interpretable because now at least you know that this hard to read thing maps. To this slightly less hard to read thing. So a six five is well six five somewhere here. And then yeah, you can find all of those numbers on the left side. And again, it says that given the butch at a low implementing the method basically amounts to while implementing all of these numbers and then doing the generic Runger Hood, I think. And also, as I mentioned before, the Fauter code is also so annoying to read. Well, again, because it’s Fauter, but also because all of this stuff also happens in there. So it doesn’t just compute the next step. It computes two estimates. It does some local error estimation. So we have local error estimation. It chooses a step size. There’s all of this bells and whistles stuff that is on top that gets called when you call Sypai solve ODE. Yeah, and once more, I mean, I refer to this book very often. So you can imagine that it is actually this thick. But in chapter 2.4, there is a section on error estimation where this here is actually explained how it works. All right. So we come to the intermediate summary and to the end of this block on how to solve ODE. I think the main takeaway is, well, we talked about Runger Hood’s methods and I showed a bunch of them. We ended up with dock 3.5. There are a lot of them. They have different properties. The most complicated we saw today is dock 3.5. But you can actually get larger methods with like what a tableau is 15 lines long. And similarly, and similar readability. But there are differences with differences between the service. And these differences are interpretable in some way. We talked about stability. So explicit versus implicit methods. I only talked about forward and backward Euler. But there are implicit Runger Hood methods. I didn’t show the tableau for them, but they do exist. They exist in Python, and Julia, and whoever you like your service. And so if things diverge, maybe try and implicit server, or if you know that your problem is stiff, maybe try and implicit server. We also briefly talked about order. I didn’t really have that as the focus in the last few slides. So let’s do it quickly here. Because there was this comment on if it’s the same to do many small intermediate steps or do this complicated construction. And the difference between doing the complicated thing and doing the simple thing is actually that you have a different convergence rate of your local error. Where, okay, it’s not defined here, but local error is basically the one step error. So the question is, how much error do you make if right now you have the true correct solution, but in the next one you do the estimation. And the order of Runger Hood method is defined as, well, the local truncation error having an order h p plus 1. It basically also shows you that if you make h small, then your local error gets small. If you get h to 0, then your local error converges to 0. So that is good. But yeah. And then the service that we looked at is thought Euler has an order of 1. And you will prove that on the access sheet. Exclusive midpoint method was the next more complicated thing. And it is a bit more complicated because it has a bit better convergence. Fourth order Runger Hood method. It’s in the name and up to five again. It’s called up to five because it has all five. So maybe this is the point, right? What’s the next thing that we call that? We mentioned that you did four boiler sections of Rung that come four as a rating. And you could use this at the time h or 1. And the error loss would be something fungal time h or 4. So you do it with the original Runger Hood method. You open to four other reasons of that. In the error, it’s a button and h to the four button rather than four. And it teaches that 10. That 10 to the minus four is not less than one of the four. Thanks a lot. Yes, another question? No, it uses, well, it has one other difference in both estimates. And that’s also the trick on how you do step size selection them because you know that. There are the difference between those two estimates because the other one is also not the true one, but you can use the difference because you know both of their convergence rates. Also in the explanation that we just heard it’s so this does not mean it’s always better to use a higher order method because otherwise we would use other 13 methods or something all the time. Because there’s a hidden constant in front. And so unfortunately, like same as with stability and order, that you have a heuristic of if you want to solve something in a precise way, probably you want a higher order. Also, probably dofry 5 is an OK default setting. But like if you want something only very costly, things can be faster and more stable by using only an order three method or something. And then I mentioned also that there are many like balson whistles happening when you call sapper or similar. The most important one is probably step size selection, which uses this information about order and local error to choose a step size. And again, that’s something we don’t have time to cover in the slides, but it’s a whole question of wondering about how to choose age instead of taking age as a given. And the question rather becomes how to choose age if my local error should be lower than. And we have an exercise on the exercise sheet. Also, just for completeness, there’s sometimes you even have automatic server selection happening that tries to do the above thing, which makes the implementation of everything way more complicated again. Because they only use heuristics. But as always, when you have automatic things, as soon as things break, you probably need to understand these again. So it’s good to know about stability and order. Alright, that concludes the how to solve all the es part or how to do simulation. Are there any more questions on that, otherwise we have time at the end also. Okay, so now we flip the problem setting basically because up to now I assume that we are given an initial value problem. So we have an initial value we had a vector field. We know everything exactly and perfectly. And still we have to approximate because it’s a hard problem. Now we assume we don’t know f and we don’t know the initial value maybe even. But instead we do have data, which well, we like to have. And the question is how do we estimate f and estimate the initial value from it. And yeah, I mean, the motivational example from the first flight was this one basically. So when we have a scatterplots like here with a green and the red, yeah, green a red scatterplot. And we just put on our machine learning hats. Then we do cuffeting. We use a neural network or after this course we use a GP. And we get a very nice interpolation and then we go to the prior on the sides. But of course, if we know more about the generating process of this thing, then we can get a more meaningful answer. And this links together this whole idea of ODE with the problem setting that we often encounter where we just, where we just try to fit data. And while in this block we use ODE as information because it helps us understand the process that did generate the data. And we can write down the generator model and do inference in that system. And in this example, I mean, yeah, you’ve seen flavors of this, I think multiple times already. I apologize for not using the same ODE here as in later slides. It’s because on this paper we used a different ODE than another one. Here it’s a so called SER model. It’s also an epidemiological model where S stands for susceptible exposed infected and recovered. So we don’t model the dead. We just say if you die or recover, you’re recovered. You could add a D at the end and then we would not write that down. And well, it’s also quite interpretable. It basically says that there’s a specific, oh, there’s also a typo. But with a contact rate beta, the end susceptible people meeting exposed people, you have people moving from here to here. So this should be an E divided by N, pardon for that, so that the mass moves from there to there. Then with some rates exposed people get infected and with some rates infected people get recovered. And you can also have an additional term that moves people from susceptible to infected directly. And if you look that up on Wikipedia, that is the case. But we said the second beta to zero in this experiment, so made it easier to write down. And the parameter entrance problem is basically finding beta, gamma, and lambda such that the od solution fits the data, which well, here’s not the case. So the lines in the back are generated with wrong parameters and the lines on the bottom are generated with inferred parameters. And having a good solution basically means having a good fit through the data, but at the same time also having parameters that we can kind of cross check with what we think they should be because compared to the way it’s of a neural network, we actually know what lambda means. And if it says that I don’t know the recovery rate is huge or super small, then maybe something’s wrong because maybe we have some additional information from literature about it. Yeah, so it’s not just a simulation problem, but there’s one hidden in there. But it is rather an estimation problem of the vector field F and the initial value. To write that down, we write down the generative model for the state x, and then we will write down the observation model. And the generative model for the states is in general an od where now we have parameter theta entering the vector field F and parameter theta potentially also influencing the initial value. But all the rest is the same, which also means that if we know theta, it’s a simulation problem, we can just solve it with the 5. If we don’t know theta, then it’s not a simulation problem, but an inference problem. And the thing we learn theta from, or we want to learn theta from, are noisy observations. So a data set that consists of y-i’s and ti’s that are observed as follows. So yi depends on x at time ti. For generality in a linear way, you can imagine this age, for example, being selecting only the infected people. Like in the example above, like before, we can interpret that we can measure this, but this we cannot measure and that we cannot measure. But this again, we can. So age would be a linear map that maps a four dimensional thing to the last two dimensions, like a projection. That’s what makes sense to be in there. Then we assume gosh noise. And the general goal here for this parameter inference problem would be to estimate theta from the data, which with base rule is means solving this thing. Well, as you know, our probably had before, doing like actually getting the posterior is not quite trivial. There are a bunch of things we can do. We could use mcm, or a plus approximations or something. Today, we will talk about kind of a sub goal, but it really contains most of the complexity that we need to talk about today, which is computing the maximum likelihood estimate, which requires, well, computing the likelihood. But the main thing that needs to be done is actually thinking about what this term is. We will do that on the next slide. And since this term also appears up here, this is really the thing we need to understand. And I mean, we can always add a prior or start. And then think about how to extend and extend this to get the actual posterior. All right. We can write down the likelihood because we can write down the generative model. And we know that the data here is just well a linear map plus Gaussian noise. So the likelihood of the whole data is just well, a product of Gausschens evaluated at yi with mean h times x theta ti comma sigma. Now, I mean, it’s easy to write down. And this is actually the correct likelihood that we cannot work with that because our fundamental assumption here is that yi is measured from the true x. But we spent all this time today talking about the fact that we cannot actually get the true x and we have to well estimated somehow. So we cannot actually evaluate this likelihood and we have to approximate it somehow for exactly the same reasons for which we needed to do all the wrong equipment stuff before because we cannot compute the true x. So that’s a problem that we also talked about what to do in this case. And at least from the perspective of today, the correct thing to do or one thing that we can do is approximating the true solution with an estimate of the solution, which is essentially what we do anyway is when we call Romeo Kutter because we take the output and just continue working with that. And so we just assume that our solvers good enough. While of course having in mind that the server makes errors. So yeah, if you use it with a huge step size, maybe this is not a good assumption. So yeah, but we can do this approximation and then plug it into our likelihood from above. We get out this term, all the change is ahead here. And this term we can now evaluate and we can actually compute the marginal likelihood at the maximum likelihood or the likelihood itself. And as you know, maximizing likelihood is equivalent to minimizing log negative log likelihood. And we get basically the mean square error. In particular, we assume that this thing here is sigma times an identity matrix and sigma fixed, then it is actually a mean square. So this we can compute. It’s a loss function that takes a parameter and gives us a loss. And we need to maximize it. And I mean, we maximize loss functions all the time. So we’ll just do that. So let’s look at an example. And I mean, I want you in advance, but now we don’t have an S E I R model anymore, but S I R D. The E disappeared, which was the intermediate thing between susceptible and infected. But instead we started counting that people. We have a time span, we have an initial value that is that has a free parameter. So we say that the number of infected people in the beginning, we don’t know. So we need to estimate it. And we have a non parameter, spitter gamma and data, where beta is the context rate. We generate a data set by choosing some parameters and then simulating this model and then taking noisy samples from it. Not also that this will not be estimated. It will not be part of the loss, but it totally could be like we could estimate a noise term just that here, I didn’t. And the true parameters are well, a very low number of infected people in the start. And then well, some contact rate gamma and eta. If we have this, we can also start writing down a loss. This is exactly the thing from the last client. It’s a mean squared error. And there’s the 0.1 coming from up here. It doesn’t really change much. Same as this half here. But we can implement it and we can evaluate it. And then, well, if we want to optimize it, we start with some initial guess. This is a very bad one because it’s others of magnitude wrong here. And, yeah. And if we plot that initial guess, we also see these. We only see three lines because R and D are stacked because those two numbers are the same. But it’s a very bad fit. And what the loss does is given the parameter guess, computing the approximate solution. So these lines, and then computing the L2 loss, which means well comparing this line to the scatter plot and that line up there to the blue scatter plot down here. And now we can run an optimizer. I mean, it’s an optimization problem. So the question of which optimizer to run is similar to in general. And I think there will be time dedicated to that on this lecture. I used LBFGS because it worked. And well, when we run that, we see how it iterates. And I take a bit larger steps here. And at some point, after 150 iterations, we have what visually seems to be a perfect fit. But the cool thing is that we don’t just care about fitting the data and seeing that the lines are fine. But we get an output that we can actually interpret. Like from this problem, we inferred that the number of infected people initially was 10 to the minus five. And from, okay, where the population does from zero to one. So scale it with the population count to get an actual number. We get a contact rate of 0.5. We get a recover rate of 0.06. And we get a death rate of 0.002. And these numbers we can interpret. We can compare with literature and so on. Which is why it’s pretty nice to not fit this with just a neural network or even a GP. Even if like we at this point don’t get uncertainty. So we get something that we can interpret. Alright, so this is in general. And I mean, this thing I explained here. Scales to or the strategy is also how to solve parameters and problems in more complicated settings. It’s really about having an estimate for the dynamical system with some three parameters, having the data and then constructing the loss where internally it uses your an Audi server. Now we come to an example that wants more, you know, from the very first lecture. But in the very first lecture, you saw a solution that we don’t yet know how to do. So we’ll do something else. But the motivation in the very first lecture was that we take the SID model from the last slide. And if we compare that to actual COVID data that is from up here. Like the black dots. Then we see that the curves from here would probably not be a good explanation for it. And even if we just look at this model, it’s because the model itself would also not be a good explanation for it. And one way of making it better is, for example, to make the contact rate time varying. I think that this is still a very reasonable assumption by from my experience. It’s still only a model because I mean, it’s still only tracks like has these four categories and so on. So I’m not claiming this is the model, but we made it more complex on one specific point. And so the question is now how can we do parameter inference with this slightly more complex model. Or I’m quite a lot more complex if we make something continuous instead of just a scalar. And what we’ve seen in lecture one is that, well, by the end of next lecture, we will end up with this plot where we find a beta that is time varying and that gives us an estimate of the contact rate. And together with an estimate of the state of the solution, which has a good interpolation and with uncertain estimates. We don’t have the tools for this today. So this is the motivation, this is the motivation for what I’m going to show. We do, however, have the tools today to just model beta of T with a neural network and do parameter inference on the neural network. I’m not claiming this is the smart thing to do. And as you will see in the results, it doesn’t give you something as nice as here. And that’s an, like in principle, we have the tools. So the estimation problem is one where for this COVID data here, I assume that gamma is known and eta is known. So we don’t estimate those anymore. We just take them from literature, which is also what was done to get this plot here. And all the parameters that we need to find are the parameters in the neural network here. But we can still write this loss function that takes the parameter and maps it to something. And what it does is basically, given the parameter, soft the ODE, confused an L2 loss, and then tried to minimize that. And as the difference in quality of this plot in this plot shows, I prepared this for this lecture because it seems like the thing to do, not to explain what happens here. So, yeah, it could have been a nicer plot. We have SIR and D for different plots because they vary so much in scale. But I wanted to look at all of them, not only I, and the scale the first, that’s why all of those numbers on the left. And the thing I cared most about is beta, which is why it takes up so much space on the right. I will show an animation on how it looked in my experience to try to train this thing. And you will see that, well, it kind of works okay. I’m not a deep learning guy. I don’t really know how to choose neural networks. So, I chose hyperparameters after trying around it for a while, in preparation for this lecture. I tried to optimize this in preparation for this lecture. And I get to something that is, I mean, it’s not comparable to the fit that I had with the GP stuff. But I mean, it gives us the two bumps here and here. It fails to fit all of this complex part over there. And it gives us a contact rate. It’s way less beautiful than if we would have uncertainties out around them, and it’s way harder to interpret. But it’s a point estimate for a contact rate. And at the same time, I say it’s hard to interpret. But in the, like, we’re using neural networks, right? And, like, typically, neural network things are hard to interpret. And here we can plot the line and we know what it means. So, like, compared to neural network stuff, this is super nice to interpret. It’s just that, yeah, I don’t actually know how good this contact rate is and how well it explains the actual contact rate that we would have observed. Also, like, if we, yeah, I think kind of stops at some point. I just ran it until conversion. Like, so it’s an optimizer that’s line search stuff happening, which is why sometimes it’s faster and sometimes slower. Yeah, and this is the final result. So, yeah, it’s okay. But it could be better. And in particular, maybe one thing to mention is that, like, so this was the result. And here on the left, there’s this peak happening. And the question is, so if we try to interpret this, then basically I would be like wonder about what’s going on here. If we compare that to the other result that we have, then until January April, that would be from here to here, this thing is completely uncertain about everything. So, maybe I should not spend half an hour trying to interpret this peak. And the reason for why this thing is uncertain is also like we can interpret this because if the numbers are low, there’s the typo. This e should be an i. It moved from one side to the other. So, if the number of infected people is small, then this gradient is small. There’s low change and basically better influences basically nothing. Like, if you multiply a very small number with something, then it doesn’t really change much if the something is big or not. And that’s kind of why there’s so much uncertainty here for beta. And also here, because here again infection rates as well. Whereas, in this plot, like, we don’t really get that information. And, like, it’s kind of a drastic change to double contact rates. But again, the big disclaimers, I did this in preparation for this lecture. Conceptually, I think this is a valid approach. I mean, with all the pros and cons that you have when you use neural networks. By doing exactly this, you do not get uncertainty quantification. But, as I said before, we talk about maximum likelihood estimates because that’s the hard, like, that’s the thing where we need to, like, that we needed to learn how to do. But if you have likelihoods, you can add priors, you can start sampling and so on. Not claiming that this is the best thing to do in that context. But it’s not impossible to add uncertainty quantification and still do conceptually this, which is using a neural network or another parametric model. And this is actually, like, I gave you this code and you can see that on Elias and you will, you can play around with that for the exercises this week. And, yeah, so what we will do next week as a final thing to come up with is this. So, because we had so many lectures on GPs and they have, well, so many things that I just mentioned, like about the uncertainty and what’s happening on the left, would also be solved by using another class of models. Which might be in particular very helpful because we met a scalar to a scalar. And it’s not a very common neural network task, let’s say. But it is a common task, like we saw many plots where we did exactly that with a GP. So that might be the more natural choice and we actually can do inference and that’s what we’re going to do next week. So with that, we are at the end of the lecture. We saw, well, we talked about ODE’s in general and saw some examples of machine learning. And I hope, again, that I gave enough examples that you see that, like, in core machine learning research in papers that are submitted to conferences right now. Like, two weeks ago, I reviewed papers that had ODE’s in them, again, as always. And so ODE’s play an important role in machine learning, like in core machine learning papers, but also when you start applying machine learning to like real world problems in the real world, you often have stuff to write down because you did actually think about the problem before calling your machine learning algorithm. In general, solving ODE is non-trivial. That’s why we had like one hour talking about numerical solvers and we kind of only scratch the surface. But we’ve seen a bunch of examples. And I hope you got a better feeling for how, like, the difficulties that arise and the complexities they are for solving ODE’s with a computer. And, well, we can, we cannot just simulate ODE’s, but we can also learn ODE’s in their parameters from data, which, again, I think is particularly interesting in the machine learning context. Even when there are neural networks in there, I’d be one final mention of something like the way to be more general, I zoom out more. And at this point, you know what neural ODE’s are and how to work with them, basically, because in the neural ODE context, the difference is that you don’t assume structure because maybe you didn’t think about your problem before and you just make everything in your network and try to train that. But conceptually, it’s still like, it’s basically the same loss function and the same training procedure. So, yeah, there we go, neural ODE’s. All right, so that’s the end. Next week, we talk about probabilistic numerical ODE summaries. So, for the first half, same problem setting, but different point of view, because we know about probabilistic state estimation, and so we want to apply that. And then at the end, we do parameter inference again. So, thanks a lot. Please fill out the feedback form. Thank you.