# Numerics of ML 7 – Probabilistic Numerical ODE Solvers – Nathanael Bosch

Welcome everyone to lecture number seven in the Marrix of Machine Learning. So it’s lecture number seven but it’s also lecture number three on the block of simulation. So this is the third one in that row. It started two weeks ago with Jonathan that told us about state space models and he introduced the language to describe dynamical systems but in a probabilistic way. So we have an estimate for a state and how it revolves over time and we also have a language to describe an observation process that depends on that. So the point of view is one where there’s uncertainty involved in the evolution and in the observation and we try to reconstruct the state. And the specific tools that we talked about are common filters and smoothers or extended common filters and smoothers for nonlinear problems. Then last week we also talked about dynamical systems but with a very different language and we talked about ordinary differential equations where the point of view is not really anymore that we have a state that evolves in a probabilistic way and an observation process but it’s kind of a very precise way to formulating that we really know how the state is going to evolve with a differential equation. But then the problem is that when we do that we still have like we want to answer the question of what is the state then given this differential equation and doing that with a computer suddenly is difficult again and still does estimation. And we were only able to get algorithms that do solve ODE’s approximately. And so today we combine those two ideas. We will look at ordinary differential equations and ways to solve them but through the lens of Bayesian state estimation. And so this will be the biggest part of the lecture and the main part is looking at ODE’s, interpreting them as state estimation problems and then solving them as state estimation problems. And by doing that we get probabilistic numerical solvers because the algorithms that we use are, while they give us distributions over the state or over the ODE solution. And so the output is strictly richer than from a classic solver but we will see that in the lecture. Before I start with the actual content, I will have a one slide recap for each of these topics. The first one is the numerical ordinary differential equation solvers. So the stuff we did last week. And that was what we looked at differential equations or ordinary differential equations which is this equation in the middle and big, which describes the derivative of x by some vector field f that can be easy in the examples that we looked at but can be hard in practice. And then vaguely speaking, when you are given an ordinary differential equation, the task is to find x. And then you use an numerical algorithm in order to simulate starting from x0. But it’s kind of like there’s a mismatch between this task of finding x and the thing we actually do. And one example that popped up last week and we will look at this a lot today again. It’s the logistic ODE. That’s one of those cases where the differential equation is quite easy because we can actually solve it by hand. And that makes it nice for us to evaluate the methods we have. And while visually speaking, the problem states that we have these errors in the back, which are the vector fields, so this f, we have the blue thing on the left, which is the initial condition, and we want to find the black line. And then last week we looked at numerical methods that do that. And while we looked at these three, the first two are methods. And the third one is a whole class of methods. But each of them had the approach that we have some estimate, at some time t. And we want to compute an estimate at time t plus h. And we do that in a specific way. And the simplest one we could come up with is forward Euler. Similarly, simple is backward Euler, but there’s the tricky implicit part. And then the explicit one-year-cutter methods got, if you recall, these butchers have lost. They got, well, way larger and more complicated. But conceptually, it’s this simple formula. You just need to define what bi should be, c i, and then there are these other a’s hidden in the x tilde in here. But so there’s a mismatch between what we do and what we want. Because I said we want to just find x. And what we do is we call it x hat now because it’s wrong. So fundamentally, these methods do they try to find a quantity. They cannot find the quantity. So they approximate it. Or when we switch up the vocabulary a little bit, they estimate the true solution. And they provide a single x hat as an estimate. And they don’t just estimate it randomly. They estimate it by, well, from some information, for example, by evaluating the vector field. So you evaluate the vector field. And thereby, you update or you extend your estimate to a future time point. That’s the point of view that we will have today. And estimates really means there’s an error. And the error depends on the step size. I think we’ve seen that last week also. So if we make the age small enough, then the solution gets better. If we would be able to make the age infinitely small, then we get the actual true solution, but we cannot. And well, depending on the age, we have either a very noticeable error, as you on the top. Here in the middle, we still see it. Here on the bottom, we cannot see it anymore in this plot. But there’s always some errors somewhere. And so fundamentally, in the Merkle-ODs of us try to estimate a quantity. They will be wrong also. So they will never be completely correct about the thing they return. There will be an error. And they estimate x by using some information, in this case, evaluations of the vector field. And this kind of once more sets the whole motivation for, well, when we see that Merkle-ODs are as do estimation anyways, we might want to phrase it as an estimation problem and solve it as an estimation problem. And the tool we use is exactly what we did two weeks ago. Special set estimation. And just to briefly recap it, like refresh the vocabulary and the notation that we had, we looked at nonlinear Gaussian set estimation problems, where x is the state, the lower case x here, and y are observations. And well, we first write down kind of a generator model of how we think x and y behave. And we do that by specifying an initial distribution and the prior or dynamics model that describes how x evolves over time. Then we specify a likelihood or a measurements model, or observation model sometimes, that explains how y depends on x. And then we are given some data. And the task is now to do inference, which basically means, say, well, describe what we think x is, given the fact that we observed some data. And there were kind of four, well, two tasks that we looked at last week. But then there are two more things hidden in the algorithm. Basically, we have the filtering distribution that we looked at a lot, which is basically estimating the current states, given all information that we have access to right now. And the smoothing posterior, which is estimating x, given all the data. And then, well, to compute these, at some point, we do compute these prediction estimates. So basically estimating a future state, given current information. And there are these likelihood terms hidden in there, which gives us an estimate of the next observation that we might see. And these quantities on the right, we can compute with quite simple algorithms, so the extended comment filter and extended comment smoother, which have some, well, the building blocks of a predict step and an update step. We don’t need to go through a formula. But it’s quite nice to see that the predict step is really a two-line thing. And the update step is a five-line thing. You could even make it less, of course, and just make things less readable. But it’s conceptually quite a simple algorithm and simple to implement. And this moving is similarly complex. And so, yeah, I said it before. And I will repeat this quite often today. But we combine these two approaches. So we combine the methods from the last slide that we saw in order to solve the problem from the first slide to get probabilistic numerical OD solutions. And one motivational quantity is this one, because that kind of describes the kind of thing we want to have in the end. And it also shows you how to phrase the numerical OD solution as an inference problem. Because instead of saying that we want to find an x that fulfills the OD everywhere, which we cannot compute anyway, might as well specify the thing that we might be more likely to compute. And that is we want a posterior of an x of t. That satisfies, well, the initial condition, because we look at initial value problems. So we want x0 to match the given initial condition, the x subscript 0 is given. And we want to fulfill, well, satisfy the ODE. But we relax the whole thing and satisfy it only on a discrete set of points. And the motivation for that might be go back again. We only evaluate the ODE on a discrete set of points anyways. So all of these methods that we looked at, we always go from t to t plus h in some sense. We never look at the continuous thing or condition on the continuous thing, which is why we make it part of our problem statement to say, well, given the discretization, I want an estimate of the output. And because I told you that we’re going to use the stuff from two weeks ago, well, that motivates the fact that we want to do Gaussian filtering and smoothing. And the reason for that is, of course, not only because we did it in lecture, but because it’s a fast algorithm. It does inference in OAuth n time. And well, Runga Kutta does that too. So we want to be a bit comparable. And in order to end up with something where we can do Gaussian filtering and smoothing, or extend it kind of in filtering and smoothing, we need to construct a state space model. Because, well, as you’ve seen on the recap, once we have the state space model, we know what to do. So the main job will be to build up a state space model that, once we solve it, gives us a quantity of this kind. And for that, we will need to define a prior model, a likelihood and data model, or a way of informing the state. And once we have that, we can do inference. So let’s start with a prior. And well, on the Bayesian filtering and smoothing slide, I use this lowercase x to describe the state. And I also use the lowercase x in the ODE context. But in some sense, let’s still ask the question, what is the state, actually, that we can consider? And well, just using this lowercase x through ODE solution doesn’t satisfy the thing we wanted the state to be two lectures ago. I think on the slides, we had something like a quantity that fully describes the dynamical system. And just having x of t is not enough. It’s like, you take the logistic equation, it’s a single point, it could evolve in any direction. But one thing that would, if we could, fully describe the thing, would be to have x of t and all the riveters. Might be infinitely many. But we’ve seen Taylor series expansions last lecture, and two lectures ago. So I hope these are familiar. But basically, here, the point is really, if we had access to x of t, it’s derivative, it’s second derivative, and so on, potentially an infinite sum, then it’s annoying to work with. But if we had those, that would fully describe the system. And well, once more, we do truncate the Taylor series expansions. So we want something that is very informative, even if it’s not perfect, which we can get when we truncate the whole thing. We only track the first q derivative, where x superscript 0 means the 0 of the derivative, and so on, in this notation, which introduces an error term. And now I will get a bit vague for the rest of the slide, and then very precise on the next slide again. But in some sense, the prior that we had two weeks ago, what was always like a probabilistic thing, right? There was uncertainty when you go from one step to the next. And that is really motivated from this term. If we would do this approximation, then there is an error in there, which we don’t know if we only do this approximation. And so to get a probabilistic model basically means, well, this error is now a random variable, and suddenly we have a transition model that I will write down on the next slide. So the goal here is really to find out what quantity can we use as a state in an ordinary differential equation. And my claim here is that tracking q derivatives is good, and then we have an error term that we can describe. So that is the state. We will, instead of just looking at x of t, we will look at x of t. It’s derivative, it’s second derivative, and so on up until the cube derivative. q might just be one, then we have a lot of noise. In the, I will show some code later where all we will track are the first two derivatives. So it’s not like we want 100 derivatives or something. But yeah, by tracking a limited number of derivatives, we get to a probabilistic state model. All right, that’s the motivation, but the notation doesn’t yet match what we did two weeks ago. And so I give you something that does match the motivation, the notation from two weeks ago. And well, that’s on this slide. The thing I just described has a name. It’s a stochastic process. It’s a so-called Q times integrated Vina process. And there’s a bit more history to this thing and a bit more context than we have time to talk about today, because it’s actually a continuous thing. But we will work with it in discrete time so that we can apply all the tools from two weeks ago. So we have this uppercase x that I will use now whenever I talk about this stochastic process that has a structure that goes from x0 to xq. And to be a bit more precise, I replace it now. Like I don’t write the lowercase x anymore, because the lowercase x we don’t know, right? That’s the true thing we would like to find. But the uppercase x is here. I know the random process that we are currently defining. But the idea is that, of course, this random process is supposed to model the true solution. This Q times integrated Vina process prior, this one specific process that I motivated and that we will choose to do today has transitions of this form. I think the most interesting lines, maybe this one, which says that if we have x at time t, we can get x at time t plus h with a Gaussian model, which is actually linear because we multiply some matrix a of h with x. And we have some covariance matrix Q of h. And the Q times integrated Vina process, apart from the Taylor series approximation motivation, is also quite convenient because this a of h and Q of h are known in closed form. The formulas look a bit annoying, but you can implement this and you can build up these matrices. So it’s actually quite convenient. And so this object is now something that we can work with and that we will. And the a h of Q h, yeah, I told you they, like the formula looks a bit annoying, but when you look at the thing in two dimensions, then suddenly it becomes quite structured. And a h is this upper triangular matrix with, well, actually only diagonal terms. So once everywhere here, h, h squared half, which might remind you of the Taylor series coefficients. And that’s where h comes from. And Q h, well, has a similar structure, not the same. But again, like written like this, it’s quite easy to write the code for it. And we will actually see the code for it. All right, so this is our prior model. We can plot it, so this is the, okay, maybe before I show the plot. It’s a two times integrated Vina process prior, because, well, we track two derivatives. And so the second, yeah, maybe it’s better with the plot. But the thing on the right, you see a h is a three times three matrix, because we track the zero of the derivative, the first and the second. So it’s a three-dimensional thing. If I plot this process x of t, then I get these three subplots basically. Like here, I think there are five lines in each or six. And then, well, five. So these are five draws from the process I defined. And maybe that visualizes why it’s called a two times integrated Vina process. Because the second derivative is a Vina process, which you might have heard before, maybe you’ve heard Brown in motion before, it’s the same thing. So this is this, we had random walk. If you integrate that version, you get these. And if you integrate that one, you get the top plot. Or going the other way around, if we take the green line from the top plot, we take computed derivative, then it is actually this green line here. If you compute it’s derivative, then we get the, well, not so nice line down here. And well, that is why it’s called a two times integrated Vina process. And if we draw more samples, then we maybe get a better feeling for, well, it’s statistical properties. Because at each point in time, it’s actually a Gaussian. And again, we will not talk about it as a continuous object a lot. But just so, like you have the reference, these are Gaussian processes, actually. So they satisfy the properties of a Gaussian process. And they are macovian because we have this. So it’s so called Gaussian macov process. All right. And then maybe one other thing to mention at this point, because we will need it later, if we have the whole stack capital X, then accessing, well, for example, the first derivative, is very easy. It’s basically just looking up the correct entry, which, well, instead of implementing as a slicing a num-py or something, is like a multiplication with a projection matrix. So accessing an entry is a linear operation. So it’s very convenient to give them this whole stack, access the first derivative, second derivative, or something. And we will need that. And if you slice. All right. To summarize all of this stuff I had to define here, we found, we defined a prior that we will work with. And it’s the Q-times integrated Vienna process prior. And it’s nice to work with because we can write down these transition densities, which we need to do Gaussian filtering and soothing later. And it’s convenient because A and Q are a given in closed form. And so the other part that we need, it’s two bullet points formally, but like conceptually it’s really the same and we will see why. The other part is this likelihood and data combination that will inform the prior so that it actually does the thing we wanted to do on the top. And to get there, we will look at the ODE again because, well, unsurprisingly, we didn’t use the ODE yet. And so the other half of the state-fetch model then probably has to include it. And so we will define a likelihood model and data motivated by the ODE itself. And well, the thing we were given in the very beginning was this ordinary French equation with the lower case X and the goal that we want to find X. And that’s the perfect thing we want, but it’s intractable. Now we defined this X of t process in the last slides. And the motivation for X of t was that it should model the thing we want to have in the end. And also I said that X1 is the derivative of X0 and they all correspond to one of the derivatives that we want to model. And so using the language of the thing we just defined, the equivalent statement to solving the ODE would be that we want our process to satisfy this equation. If it does, then it also solves the ODE. And then we can reformulate that to put all the X’s to one side and define a measurement function or sometimes also called information operator that takes X. It is defined in the following way and it should be zero. This is the ideal thing we would like to have in a perfect world where we have infinite compute. Now we already talked about discretization. So we already relaxed the goal a few slides ago that instead of trying to solve it everywhere, we try to do things on a discrete grid. And so the requirement basically becomes satisfying this ODE on a discrete set of points. Or equivalently, with the thing we just defined, that means that we want M of X now evaluated only at discrete points t i to be zero. And like conceptually, it’s still kind of a thing we want, right? That’s the information that we have and that we want X to satisfy. And the tool that we use in order to make it satisfy the thing is by making this an observation model or measurement model or maybe even information model and then doing inference into slides. And so this quantity motivates an absolute modelled model in the following way. So in the state space model description, I had Y’s. Now I use Z’s. I will say Y in a second. But we have this observation Z that conditionally on X is given by, well, a Gaussian, where basically we use this measurement function that I defined above. And we have data that says that this thing is actually zero. And zero is precisely the reason why it’s a Z. And so it’s really like this is the desired thing. We define this observation model. And then in the inference, you will see why this actually helps us satisfy the desired thing. I introduced one thing here. I don’t know if you realize, but there’s an R. There’s actually no reason for an R to be there because there’s no noise in the measurement process. We defined this function, like this measurement function above exactly. So there’s, in our problem setting, there’s no reason to assume that F gives us something random. Also there’s no reason to assume that we don’t actually want to satisfy the OD. We do want to satisfy the OD exactly. So there’s no noise on either side. So it’s actually a noiseless, like this model, which more formally would be a Dirac likelihood. Though this one is actually quite convenient because if you do, if you have the Kalman filter update in mind, there’s an R somewhere in there. And set, like just plugging in the zero does the correct thing. So that is actually what happens when you condition on a Dirac likelihood. You just, like in your code, you really plug in the zero as observation noise. All right. So again, we define a lot of stuff and this likelihood model. The good thing is we can have a look at it and like get a better feeling for the thing I just defined here. Like before that, I defined a prior for the X and how it evolves. Now I defined a generative model for a Z. And so we can look at it the same way as we did look at the prior. I can draw samples. This time for X, I just used the once integrated Vina process, which means that X1 is a Vina process and X0 is the integrated version. I mostly did that to save some space. And Z of T is defined in a specific way, right? So the concrete example is again the logistic ODE. And for this given ODE, when you plug in the ODE on the left-hand side everywhere, then Z of T is basically defined as X1 minus, so Z of T is X1 of T minus X0 of T times 1 minus X0 of T. And that is literally how I computed these lines. I take a line from here, that’s the X1, and I subtract a line from the top times 1 minus the line from the top to get these sample draws. So it’s a generative model and we can simulate it. Now the question, I mean, maybe to understand why we did all of that, let’s look a bit deeper at the solution and how that relates to the whole inference stuff. I can plot the solution in there too. We don’t really see much because I’m zoomed out a lot. So if I zoom in at the solution, then we see, well, the standard logistic ODE solution that we know from all the slides, this is the correct derivative. We can also compute it in closed form for this ODE, that’s why we look at it. And then Z of T, like if I take this black line minus the black line on the top times 1 minus the black line on the top, then I get exactly 0. By construction, because Z encodes the fact that the X satisfies the ODE. So it has to be 0. But maybe seeing this again, like explains why we have a noiseless measurement model. This line doesn’t wiggle around the 0 or anything. It’s actually 0 if you give it the true solution. And now, like looking ahead, so this is our generative model. And now looking ahead of what the inference process will do, the data lies, well, in this plot, not in the other ones, it’s in the space where the Z is lie. And it’s kind of a discretized version of the Z. And then the thing we will do next, and in the code N in the homework, is basically going from the generative model to a posterior, which basically says we want this thing to collapse around the V. It’s not perfect, it’s approximate inference. And by doing that, we also get estimates for the X, and we see that they relate to the solution. So I hope that this clarifies why we go through the trouble of setting up this generative model in this way, because it allows us to go from here to here. And the thing we get out is kind of an ODE solution. And that’s what we will do in a bit more detail on the next few slides. But basically, so there were two parts. We defined a prior in a specific way, which is the Q times integrated Vina process prior. It’s a class of prior in some sense, because there’s a Q that we can choose. But I didn’t give you the most general thing that we could work with. But this is a very general way to do these things. And we have the likelihood model that encodes the differential equation. So this is the object that relates the prior to the thing we want to have in the end. This likelihood model and data combination. And I mean the data, they are just zeros, right? But in order to really get to this language that we introduced two weeks ago, we have to define all these quantities so that we can do inference. So that is actually the main part. So we wrote down a state space model and we can now just solve it. Yeah, there will be so many of these things in this lecture. So let’s look at it again. So that’s like the whole thing. I think I didn’t talk about the initial distribution, but you can just use a zero mean standard covariance Gaussian. We have a dynamics model that I defined. We have a likelihood model or information model that encodes the ODE so that depends on F. And we have the data. There’s one thing I didn’t talk about and that we still need to do, or to add in order to actually solve the problem that we’re given. Does anyone see what’s missing? So the only information that we include right now is the ordinary differential equation. But there are many solutions to an ordinary differential equation. So there’s one additional piece of information. Yes? Exactly. We’re not using the initial value. Thanks. So precisely, that’s it. The initial value is missing. And so the question is how do we include this initial value? I skipped it earlier, basically, because I think this is the more tricky and more interesting part. How can we include the initial value? Yes? Maybe we can include the initial value? We may be setting mu to the initial value. That does actually work and is a sensible thing to do. There’s a more general way to talking about it. And it’s really, it results in the same because it’s a noiseless observation model. But this really drives the point, like motivates once more, that observation models or information models maybe even, the objects that allows us to define the things we want. And we will do that a lot. So the way we can include the initial values by just defining another measurement that depends only on x0. So on x, the full stack of all the derivatives that we track, it actually evaluates only the zero derivative. And the data point we have is the initial value itself. And because it’s noiseless, it’s kind of setting it to the correct value, only in a formulae a bit more precise way. So yeah, it was super easy to give a model, just add one more thing we want. All right. Okay, so on the next slide, I will write down some pseudo code of the extended command of the filter. There are building blocks that we need. So for completeness, I define this Kalman filter prediction and the EKF update. It’s the same equations as we had in the second actual slide, I think, only that it’s pseudo code now instead of equations. We will also use a Kalman filter update, but it’s the same only that, like, computing the Jacobian is through you all. So these are the building blocks that we need in order to implement the extended command of the filter. And then the algorithm itself is given, like, by this. Let’s go through it line by line. So it’s a procedure or a function that takes in basically the states by model that we wrote down before, which has different blocks, which is, well, the initial distribution, mu zero minus and sigma zero minus. The transition model A and Q, the vector field and the initial value, which implicitly defines an observation model, and then time discretization. And then the algorithm, well, first does this initial value, because while we start at time zero, and then we have this, we need to satisfy the initial value first, which is a Kalman filter update, so that maybe that explains also why we have mu zero minus and sigma zero minus, because, well, there is an update that we do first. This E zero is the linear observation, because I defined E zero to take the state x and map it to a zero of entry. This zero here is the observation noise, which is zero. And then we have the data point x zero. So again, that matches the notation here on the right. If you go through the slides again at home, then I think these things should always match the things that you see here. And then inside the filtering loop, this is just the standard filtering loop, because there’s a prediction and an update happening. The prediction uses the transition model and I mean, A and Q depend on the step size the way I defined them. So in the line before, we need to compute the step size. And then the EKF update does not have observation noise, and the data point is zero. And then I had to define this measurement function in order to kind of hide the T, because otherwise, it doesn’t formally match what I did on the slides before. And then we just return the whole thing. It also, I think I hope that from this description, just doing the extended common order is smoother, it’s kind of straightforward. We take all of that and we smooth. And then instead of returning the filtering distribution, we return the smoothing distribution. All right. I will show a lot of code today to really prove that these 10 lines with the building blocks and setting up the model correctly is enough in order to solve an ODE, because sometimes the formulas seem more complicated than it actually is. So I will go and show some code. This is Julia again. I think you probably, while we’ve seen a little bit of it last week, but a bit more two weeks ago, I know that most of you are probably more familiar with Python, but I think the notation should be simple enough today that it’s easy to follow. And so that was actually my motivation, because if you look at this Kalman filter predict, for example, on the top, is it big enough for everyone, by the way? No, okay. A command plus. Yeah. Okay, then I have to scroll a bit more, but it should be better, maybe. All right, so this Kalman filter predict that I wrote down here. I mean, it’s two lines in the return statement and it takes four arguments and like, it’s really, it’s exactly the same as in the slides. And this will also be the same in the code mostly. So I have these building blocks, and maybe, so you don’t, like, I don’t hide stuff, there’s some imports above. But yeah. And then I have these building blocks that I need to define first, not that to build the extended Kalman-Laudier filter. There are these Kalman filter prediction, Kalman filter update, extended Kalman filter update. Maybe if you come from Python, then this is one of the things that might be a little bit surprising at first, but it’s just a transpose. And this divided by S, I think you’ve also seen two weeks ago at multiplying with the inverse, but in an numerically more stable way because it’s also linear system. And I use a package to do Jacobians. Yeah. But so these are the building blocks that we have. And then the actual algorithm that will, like, that will be run in this notebook is this extended Kalman-Laudier filter. So it has the same name as in the slides. It takes in the initial value, initial distribution, the transition model, the initial value problem basically. So the vector field f and initial value x0, and a time discretization, I also pass in the e0 and e1 because from the inside it’s like not really, like it’s, you cannot figure them out from the inside of the function side just passed in them. And then the algorithm itself consists of, well, mostly the blocks that we saw in the slides, which is one update on the initial value here, so on x0 basically. And then inside the loop, there are these four lines that we’ve seen in the slides. So computing a step size, doing a prediction, defining measurement model, doing an update, and then just returning the whole thing. So this is, I think, closely matched the pseudo code that we’ve seen before. And so we will solve a logistic ODE, again, yeah, you’ve seen it before. So I define a vector field, I define the initial value, time span, and the true solution looks like this. Then in order to run the extended Kalman-Laudier filter, we need to define a prior. We did that on the slides already, and the code is actually the one from the slides, so it’s a two times integrated Vina process. And if you look at the slides, there was exactly, like this was a of h, and this is q of h. Once again, maybe this a of h is unfamiliar, it’s a short-hand notation for a function in Julia. So a is a thing that you can evaluate at this step size, and it returns a matrix. And then I define these E0, E1, E2, which just, well, are a one times three matrix that map to either the zero-th derivative, which is the first entry of the vector, the second or the third entry of the vector, or the first and second derivative respectively. So if I take E0 times x, then I, instead of having the three-dimensional object, I only have a one-dimensional object, which corresponds to the zero-th derivative of the thing we want. The initial distribution, I chose arbitrarily here, it’s just zero mean, and standard, like one covariance, and then I choose a discrete time grid also with some step size. And if we run the whole thing, so there’s a bit of code hidden sometimes. Then, so maybe you’re more familiar with Jupiter, in Pluto you have the output above the cell. But this is the code that is run in order to get this thing. The line that I need to show is only this one, because this says, well, we call the function I just defined with the objects I just defined, in order to compute filter estimates, and then below that there’s some annoying plotting code in order to get the thing on the top. But if I hide this again, then you see that, well, the, first of all, we get three plots, because we solved the ODE, but we are state describes as zero of the first and the second derivative. So we get an estimate for xx dot and x dot dot. And second of all, there’s a, well, there’s a dashed line, it’s hidden behind the blue points, because the thing that we output is actually an okay ODE solution. So we get a line that also is roughly at the location where it should be. So it’s pretty cool. I mean, we, we learned about extended current filters two weeks ago, and now we used it to solve an ODE, where, where we did, we, we did the whole lecture on how to solve ODE’s, but we could basically just use the algorithm that we learned about two weeks ago, to get something that for now looks roughly the same. So that’s pretty nice. I have a block here playing around with step sizes, and that is exactly what we’re going to do, not just to showcase Julia’s slide-off, but there’s a, like, we will learn from it. But basically, we can change the step size and make it smaller. So down here, there’s code again that is hidden, which basically just calls the function again. And you see that if we make the step size smaller, then at least in the third subplot, the uncertainties get smaller. And if we make the step size larger, then at some point we will start seeing the uncertainties in the first plot too. Yes, here, for example. So we really see that we get uncertainties. Maybe one thing about these uncertainties also. So you had, part of the homework on classic ODE service was about local error estimates. And maybe one thing that might be surprising is that when you look at this, the uncertainty doesn’t just grow at time, but maybe you actually expected that, well, I mean, if I make an error now and make an error again and so on, then it should just grow forever. And in some sense, like, worst case for any ODE, that is true. Like, if you want to know the error of a solver on any problem in the worst case, then you can prove that they can get larger and larger. But this is not a worst case error. This is an error estimate for this specific problem. And the logistic ODE, if you remember these errors, basically, as long as you started the number larger than zero, you will get pushed to the one eventually. So even if you have a large error at some point, as long as you get to the one eventually, then your actual error does decrease again. And so if the actual error decreases, you also want your error estimate to this crease again. And it does. Like, it shrinks again on the right-hand side in the top plot, also here in the derivative, because the derivative tends to zero. And so, yeah, it is an actual error estimate. It’s not meant to be interpreted as a worst case bound. It tries to model the error that is happening. There’s, I think, one other thing here. So at some point, the solution gets worse. And just to showcase that smoothing generally does improve the result, which also kind of makes sense, because, I mean, if we look at the line, we want to interpret it as the best estimate that we can have. Whereas the line I show right now, without the smoothing, means that this point in the middle is only informed from the left-hand side. And if we don’t have to do this, then why would we only inform it on the left-hand side? So if we look at tragic stories, in general, it always makes sense to smoot. And that matches also the motivation from two lectures ago. And filtering is nice for online stuff or like trying to reason about a point life. But when we have tragic stories, then we can get further improvements by doing the smoothing. All right. So the error estimates are kind of cool because of their structure. There’s something that’s not great about them. And maybe it’s, I don’t know. So can someone tell me what is not nice about these error estimates? Or describe how they could be made even better, maybe even in this specific plot here? So the blue shaded area is supposed to be an estimate for the error, that the numerical error that the solver does. And the numerical error here is, well, the mismatch between the blue line and the black-dash line. So the area is like this, but the actual error is like this. So basically, the error estimate is way too large, in some sense. It would be a much better error estimate if it barely covered the black-dash line. And that’s like coming back to the whole motivation and what we’re doing is part of the motivation is that we want to have error estimates. So we want to have an answer from our algorithm that gives an estimate of where the solution lies. And I mean, the estimate is fine, right? The black line is covered, but it is kind of underconfident. So we would like it to be more confident. Like the blue area could be smaller, and we would still be, like it would still cover the true solution. So yeah, that’s the answer to this thing here. The answer to this is, are too large. They don’t just seem too large. They are actually too large. And that is the thing we will have to fix now, and that we didn’t talk about yet. But so this whole topic of making uncertainties more meaningful is also known as uncertainty calibration. And we will see that we will have to do that in order to get actual uncertainty estimates that mean something. And so for doing uncertainty calibration, there’s a very simple reason on why the uncertainty is where the way they are and why they are too large. And that is that there was one part in the prior that was kind of hidden on the way. And again, one of these things I didn’t talk about so that we can talk about it now. And that is, there’s a prior hyperparameter that we did not set. And I defined the prior first. So the prior of course depends on Q. And that Q needs to be specified. And that is also a question by itself how to choose Q. But there’s one other hyperparameter that again was kind of hidden on the way that directly influences the uncertainties. And that is the sigma squared. There was a, like on the first slide where I introduced the prior, I said there’s a Taylor approximation plus noise and there was a sigma in front of the noise. And that really is a hyperparameter. You can think of it as a GP hyperparameter. Like when you do GPs with RBF kernel, then there are typically like, there’s a length scale and an output scale. That’s basically an output scale. It makes the thing wider or less wide. And because it directly influences how much noise is added at each step. So making this thing a thousand just makes things a thousand times more noisy and making it a 0.00 something just makes things way more precise. So like it’s clear that sigma directly influences the covariances. And we didn’t set it. It’s a hyperparameter. And the question is basically how should we set it so that the posterior gets more informative? And on a high level the motivation is exactly the same as on GPs. That’s why I mentioned them because in Gaussian processes you have the same problem. You have prior hyperparameters that you need to estimate. And in the first three lectures I think you also talked about that. You estimate them by maximizing the likelihood. And there’s a way to compute this likelihood and then well you optimize it by, while calling an optimizer computing gradient somewhere. In our specific context, the, while this formula is not specific to our context but it’s convenient because of the context. Maximizing the likelihood means maximizing the like p of data given the parameter which data was this collection of these. So it’s a bunch of zeros but they have specific numbers to them and they correspond to random variable. So we have z1 to n. And we can decompose this probability here by, well, writing it as z1 given, zigma times zk given, well, other data points before. This is really like if you multiply them up again then you end up with all the z’s on the left side of the bar again. So this is a reformulation. And that is convenient because we have a way to expressing this thing on the right. That is why like when I recalled the quantities that the extended common filter are compute, I also mentioned that we have this estimate of what the next measurement should be given all the current estimates because we need that in order to compute this marginal likelihood which we want to maximize. So we, the extended common filter actually does Gaussian estimates of these. So this p of zk given z1 until k minus 1 is a Gaussian where these things pop out of the updated question. And that allows us to rewrite this thing as a, well, so called quasi maximum likelihood estimate because there’s an approximation happening here. So basically we have an approximation of this, yeah, there should, this should be a approximately a question because, well, this is not correct because of this approximation happening. But we will compute zigma hat as the arc max of this thing on the right because this is tractable whereas this quantity for a nonlinear model is not like, it’s not trivial to get that. And so, so this is really a general way of how to do parameter estimates in extended common filters. In the, like on the next bullet point we will talk about this specific parameter as zigma that is exactly in this location but conceptually, like the, the standard approach part is really just a recap of what you do in GPR regression. This thing here explains to you how to do parameter estimation in extended common filtering in general. So this is the quantity you want. In code, this is the thing you compute, you, you use autodif with and you optimize and so on, like that’s, that’s a reasonable thing to do. In our specific context, like in ODE filters and so on, we can be a bit smarter than that. Instead of computing this quantity, taking a gradient step and so on, we can actually solve for zigma hat in closed form with a bit of derivation. Half of the derivation will be hard of the homework. But you can, well, in the specific model, if you write down the model and write down the inference algorithm that we use, so write down exactly the state space model as we add on the slide and use an extended common filter, then we are able to show that there is, like we can compute the maximum likelihood estimate in closed form by evaluating these things. So z hat is still the same quantity as before and as also, so these two come out of the updated questions, the specific z are zeros. So we can also just ignore them. And so not only can we compute that, but we don’t even have to re-compute the solution again with the updated parameter. And that is basically part of the homework in the homework you will show that zigma directly scales to covariances. So like setting it in hindsight is equivalent to just re-scaling the covariances. And that’s part of the homework. And because of this equivalent, what we can do is we can take our estimates that we had, these bad estimates from here, and we just overwrite them in a rescaled way, and then we get better estimates. Or we get the estimates that correspond to the maximum likelihood estimate zigma. And well, we will visualize that and show that in code is also again conceptually quite simple to implement. We have these, so this is exactly as in the slides. I don’t know why the resolution is a bit off. Yeah, okay. This then get better. Well, so these are the equations from the slides. I just wrote them down again so that you have them for the code. I defined a new update function with this suffix cal for calibration. It’s exactly the same one as before. Only that in this line, I also compute this increment. Because, well, inside of the sum, we can compute this quantity, which I call an increment, because it increases the sum at each time. We can compute that in the update step and just return it. So, otherwise, it’s a copy paste from before. And then, I also just copy pasted the extended Kalman-Ode filter from before and just changed two lines, which is this one where I take this increment and I increase the zigma head, because we want to compute the sum at the end, so I do it in a running way. And I do this divided by n also to, well, because this is the quantity we want to compute. And d is the dimension of the ODE and because it’s a logistic ODE, it’s one. And then at the end, there’s this rescaling happening. And then, yeah, so this is basically with the dot we overwrite the covariances. And that is also what I had here above. So we take the zikmas that we computed and we just rescale them with the zikma head that we just computed. Apart from that, it’s just copy pasted code from before. And then we can do the same thing. We can play around those step sizes. And maybe one thing that you see directly is that down here, we always had some uncertainty before, if we scroll up. Even for small steps, even though the error is actually super small, now we cannot see the uncertainty anymore. So that visually is already an upgrade. And then like looking at the other end, we see that, well, now for these step sizes here, it covers barely the true solution. So it’s much less overconfidence. The posterior is much more informative and much closer to the actual solution, more meaningful. And then if we make the steps too large, well, maybe this is kind of a failure case, but it provides one piece of information, which is uncertainty estimate. So we are still doing approximate inference. Formally when you run an extended current filter, you don’t, like, it’s still trying to approximate something as Gaussian that might not be. So that is one reason and why it is this way. And even just, like, generally, conceptually and philosophically, getting uncertainty estimates that are always perfect all the time is an impossible thing because if you had them, you could just improve the solution. So in some sense, you only, like, the goal is to get uncertainty estimates that are as informative as possible. And well, in many cases that we see here, there’s a meaningful relation between, in particular, the structure of the uncertainty estimate because it gets smaller again and the actual error. In these plots, we don’t really see anything. So one other plot to look at is basically from the top plot, I just plot the error. So instead of looking at the solution and another line, which kind of lie on top of each other, I look at the difference between the lines. And so the error estimates on average is zero, of course. So this top plot is the actual error and the bottom plot is the thing in log scale. And the mean of the error estimate is zero. And then there’s a, where’s the, there’s a dashed line somewhere that we can also see. So for these solutions above, where we see uncertainty, there’s a dashed line. This is the actual error that we are making. So we’re kind of above the true solution and it’s still covered by our error estimate. When we make things smaller again, then yeah, in this plot, we can still not see anything. But if we look at things in log scale, then we see that we are like an order of magnitude of or something. But structurally, it does the correct thing. And well, it’s underconfident, which then like typically is a direction that you prefer to tend towards, if you have to decide between over and underconfident. All right. And like for all the step sizes that we can play around with here, we do get meaningful error estimates that have a similar structure to the thing that they should have. All right. And so with the stuff I defined before and the error, like the uncertainty calibration from this slide here, which also led to code, we do get an algorithm that provides error estimates that really depend on the ODE and its discretization. And meaningfully, describe the error, like the numerical error that we are actually making. So that is kind of the thing I motivated in the beginning of the lecture and the thing we went for. So now we kind of did it in some sense. We have a numerical ODE server. It does provide error estimates. They are useful. They’re not always perfect. The algorithm is fast. We kind of checked all the boxes. But maybe that’s a good point to ask, like, why are we doing all of this? So why are we doing ODE in a probabilistic way? And why don’t we just call it on the quitter? And there are kind of two aspects that come to mind and that influence our research on the group also. That’s this part, basically the one that we talked about so far, which is just fundamentally, we believe it makes sense to have uncertainty quantification whenever you can and to try to quantify errors that you make and to be probabilistic about things. And we’ve seen that we can get uncertainty estimates that are meaningful. We get plots that look nicer than they did before. We get these uncertainties around. We can draw samples from the whole thing. It’s kind of nice to be probabilistic about all of this. In the rest of the lecture, I will talk about a different aspect that is not motivated by the error bars or anything, but by the methods that we use. And basically, I believe it’s actually a convenient thing to use ODE servers that are formulated in a probabilistic way and to go through all the hassle of defining the states based model and so on. Because it’s quite flexible. We were able to, in one question, answer basically decide that we also want to include an initial value. And in the rest of the lecture, we will just include other stuff that we would like to have and see how that leads to algorithms that we can run. Computing things that we didn’t even talk about yet with essentially the same algorithm. And there will be two examples or two main ways where we’ll talk about this. And well, yeah, so leave the space of solving ODE’s a bit and look at all the things that we can solve now that we know how to formulate things in such a way that an extended common filter can solve them. And yeah, so this, this first example is really motivated to looking at other problems that are not quite ODE’s the way we looked at, but similar enough that we can maybe figure out how to solve them. This is the stuff we did so far. So we had an numerically problem, which is an initial value problem, and we defined a state space model, and then we just ran an extended common filter with fairly generic code. What if we actually have this problem? It’s a second order ODE. It does appear actually quite often. When you talk to physicists, they are often able to describe the acceleration of something, because I don’t know the something is a planet and it moves around the sun. And so you have physics that you write down and you get a description for an acceleration, but not for the velocity. So if we have this problem, how could we adjust the state space model to compute the solution to this problem? So anyone has, have a suggestion of what we could do. Yes? To include this x dot, if you mean? Yes, exactly. So that would include this additional piece of information if we make one that looks similar to this. I will not show the next part yet, because there’s still one more thing we want to adjust. Yeah, again? So you mean adjusting the ODE like LioTier? Yes, exactly. So if you recall the derivation for this line here, we started from an ODE, then we replaced the lower case axis with the upper case axis with the correct superscript, and we ended up with this, where this is basically the left-hand side from the equation. That was the right-hand side from the equation. And if instead, in the beginning of the lecture, I would have started talking about this, I would have done, like I would have set every, like all the words would have been the same, only that the thing I would have written down is this one. So we start from the top, and instead of saying it should be inequality, I put all the things to one side and said that their difference should be zero. That’s why there’s a minus. And instead of talking about the lower case axis, I use the prior process that we have here, because the goal is to write down a stats-based model. And so the second ODE from the top becomes a LioTier model, or an information model, that is defined as x2, no? Minus f of x1, x0, tia. The motivation for this quantity is exactly the same as before. Instead of saying, oh, we also need to add the initial value, I would have said we need to add the initial value and the initial derivative. So I just defined this thing to make sure that x1 is actually equal to the initial derivative that we are given. And this is again something we can work with. Like we can just run an extended comment filter with some predict and update steps. And to prove that, I will show code again. Once again, a lot of copy paste from stuff we did before, because not much changed. So in this block, we saw the second order ODE, a specific one, so-called harmonic oscillator. But basically, there are only two lines that are changed from before. Exactly two, which is, we also do an update on the first derivative. If you compare this line and that line, then the only difference is that here the measurement operator is zero, and here is E1, because first we update on x0, and then we update on dx0, just the initial derivative. And then I changed the observation model, but I didn’t even have to change the EKF update call, because the EKF update was defined in a general way anyway. I got this generic update building block that you could use to solve all the exercises from two weeks ago. But I did, however, redefine the measurements model as the difference between E2 times x, minus f of E1 times x, comma E0x, TSI, to match the thing I wrote here in the slides, like x2, x1, x0. And there’s calibration happening again, because we always want to do that to get meaning for results. And now we can look at the result of all of this. There’s code hidden again, but it basically just defines the harmonic oscillator and then calls the extended Kalman or the E filter to. Please don’t write actual software like this, but it’s nice to be able to show that this is actually the only change we needed to do to the function, like copy-pasting stuff. And well, again, we get a probabilistic output of our server that provides an estimate of what the solution to this ODE should be. And if we take a very car script, then again, we make a noticeable error, but the error estimate is kind of in a similar scale. This is also quite nice to see, because the error in this case doesn’t just go down to zero, because there’s not like this attractor that we tend towards anyways. So the error really adjusts the problem setting, like it really depends. Yes, one question? So if we can write this up for the loss, we don’t have to wait. Yes. So it’s completely in the first line of the model? Exactly. Which is conceptually, so very good comment, maybe to repeat it. So the point was that for the velocity, so for x dot, we don’t have an equation. And so it’s just inferred by the model. And yes, exactly. Kind of in the same way as for x itself, we also didn’t have an equation. But we have this measurement model, our information operator, that takes the whole stack and maps it to something we know. Which is, well, formally speaking, it’s a zero. But I mean, the important, like the interesting thing is the definition of this information operator. And now in the second order, ODE, the information operator itself has access to x0, x1, and x2. And so we learn about the whole state in the process. Exactly. And so therefore, the output that we get actually describes x, x dot, and x dot dot. Where, by the way, like, we also had an estimate for the second derivative when we saw the logistic ODE, and that doesn’t even appear anywhere, but because our prior says that the second derivative is the derivative of the first derivative, which sounds obvious, but that’s actually encoded on the prior model. Therefore, we also had an estimate for that earlier. All right. We can basically just solve other things now, with similar motivations. I don’t know if you ever heard about them, but there are differential algebraic equations. They can be written as done here in the middle and green. So it looks like an ODE. We have this M on the left. But we are not able to move the M from the left to the right because it’s singular. So we cannot just multiply it with its inverse. You could, for example, imagine a diagonal matrix that just becomes 0 at some point. So there are some equations that don’t actually describe a derivative. They are algebraic equations. We have what’s called differential algebraic equations, and I think the standard example has always come from chemistry. So there are some chemists to scare about them. But for this example, suggestions on which part of the state-space model we can adjust to get an algorithm that we could just run? Yes? The likelihood. Exactly. Instead of calling it ODE likelihood, we could call it DAE likelihood. And we can encode this equation in here, run it, and we get something that solves the DAE. And this is the one we don’t have a code example for. But we did that on the paper, and it also works. So we can actually solve the AEs in a probabilistic way with the same algorithm, even though I never talked about you of what on DAE is. If you really care about the AEs, probably still make sense to have a full lecture on them and do all the theory stuff. But conceptually, from an algorithmic perspective, it’s very straightforward on how to solve a DAE right now. One more example. Sometimes you have an ODE, but you have also something on top. Maybe the motivation here could be a planet again that moves around a star, where you can write down a differential equation that describes the acceleration of something. But in addition, you also, because when you write that down, you did a physics course before, and you learned that energy has to be conserved. So you can write down a bunch of equations that give you a quantity that needs to be constant, or something that needs to be zero if you just put the constant in the equation also. And so you have an ODE, but you have some additional information. The whole thing is kind of over determined by now, like formally, because you don’t actually need that equation. It follows from the rest. That’s what you learned in the physics class, probably. But it is additional information that we can use and can encode in our model, and also outside of probabilistic numerics. There are algorithms, like people in classic numerical analysis, sit down and define solvers that try to conserve quantities of a specific kind. That’s how research feels. You can find book about them. But from our probabilistic inference perspective, it’s just another piece of information. And so any suggestions on how to include just another piece of information in a well-discretized way inside of the state-space model? Yes? Maybe we could have another observation bubble. Yes, exactly. Thank you. So we add another observation bubble. And while algorithmically, instead of doing one update step, we do two update steps. But putting the inference algorithm aside, like this is now the model that would be motivated by the equations above. So basically, there should be TIs in here, of course. And under discrete grids, we also want to satisfy a G, where again, formally, we have this Europe’s observations that we condition on. And algorithmically, it’s two observation models, so we just do two updates instead. All right, I did this already. There’s a video that maybe motivates this whole concept quantity stuff. This is like, it’s not a planet. So I think it’s a star that moves around the center of a galaxy. And the whole thing is projected to a 2D plane, because that’s how we see it when we look at it. On the left, I have a very accurate solution of how it should look. In the middle and right, I have two extended common filter approaches, where the middle one only has access to the ODE information. The right one has access to the energy and ODE. And they have the same grid size. And so if we run the whole thing, then it looks pretty nice. They’re basically the same. I think if we go back a bit more, then they are more closely the same. Up to here, it’s kind of similar here, maybe. But then if we continue to run the thing, then we see that the middle looks noticeably a bit more different. And also, now it kind of starts to do weird stuff that doesn’t, at least visually, it’s not as appealing as the stuff on left and right. So it seems to us have lost some structure that the other ones have. And that is really what’s happening. Like we are the right hand one. If we compute the energy at each point on time, then the right hand has much more constant energy than the left. The left just leaves the constant energy plane and then does something that we cannot interpret any more from a physical perspective. And instead of showing a video, like maybe that’s also, like if you let the video run for a long time, and in this case, I had other set sizes. But basically, the trajectory you end up with for the true system is kind of the thing on the left, which visually looks quite appealing with a lot of structure and so on. And then the one on the right is not the same, right? The lines are probably wrong. There’s a numerical error. But it keeps the structure that you have much better than the thing here. And if we let things run for long enough, then this here becomes quite degenerate. And it shouldn’t even be able to go here to this middle, right? There should be an empty triangle. So even though mathematically, the thing, like you don’t need to add this conserved quantity, right? Because the ODE fully specifies it already. We’re not able to solve the ODE anyways. So putting in more information actually helps us to get the result that is more meaningful. And there are classic methods that do similar things. But that is kind of the flexibility and convenience point from before. We didn’t have to learn a new algorithm. We just had to think about a problem, formulate the state space model. And then we used the tool that we know already from before, which is one of the standard tools for doing state estimation. All right. There’s one more example. It’s not a new one. You know it before. We have all the tools to solve this problem. And that’s kind of a combination of ODE. So the stuff we talked about in today’s lecture and last lecture. And GP regression are, well, external data. And the examples of course, the COVID data, where we have a scatter plot of the number of infected people in Germany from, well, beginning 2020 to July 2021. And we’ve looked at this last lecture already in the context of ODE’s and parameter inference. Like in general, when I, today we used ODE’s where I only write X and X was always one dimensional. But we can extend the things we did today to higher dimensional ODE’s. In fact, in the homework sheet, we will do a two dimensional ODE. Basically means for each dimension you do things individually. But you will see that in the homework and in the tutorial next week. But so for this epidemic model, of course, we had this SIRD model, where S, the number of, like the susceptible population, are the infected. So the kind of the scatter plot here are the recovered people and the are deceased. And we already motivated that if I just have a scalar B, then the result will look way different. And it’s not a good model for the reality. So one way to improve that model in order to be able to recover this plot would be to make better of T a time varying thing. And last lecture, I showed you a result with neural network, which was okay, you know, maybe. And today we will make beta a Gaussian process, at least that’s what I said last week. More precisely, we make it a Gauss-Marco process, because this is the language that we know to handle in this context of patient state estimation. And like we didn’t really talk about why this is actually a Gaussian process and this continuous and so on. But I guess the intuition is we can specify the stepsize, right? So we can, if you can discretize something at will, might be a continuous object. And so this beta of T, we can write down the same way as we wrote down our state X with a transition model. So it’s a linear model. There’s an A subscript beta now, because it might not be the same. It depends on the stepsize H. And it describes the transition from beta of T to beta of T plus H. And we also have an additional model for the data that explains how this scatterplot depends on the other E solution. And it’s basically Y given X is a Gaussian with some matrix. Maybe this should rather be YI given. And then you have to stack S-I-R-D. And that is, well, only the IS with some Gaussian. Yeah, this is a, it’s one dimensional, so we don’t need a covariance matrix. Exactly, because I mean here, yeah, I didn’t plug in the actual ODE, but X is this, for the matter of the ODE, so it’s this for-dimensional object, S-I-R-D. And we measure only the IS in this plot or in the code we can actually measure more of them. But in general, we have this linear measurement operator. And well, we do the same thing as we did for the last bunch of slides, which is write down a state space model, and then do entrance. And you see these things get a bit more complicated and larger and harder to parse visually, which is why I made these three blocks. So there’s a solving the ODE part where, well, we also assume that we know the initial value. There’s a data part, and then there’s this beta that come in addition to the ODE solution itself. And so the first part is the prior for the ODE, where the X, well, is S-I-R-D. So the capital X is supposed to be a model for the ODE solution, so for S-I-R-D, and we track a bunch of derivatives with the same motivation as before. We have a model for beta, where again, like, it’s the same language as we have above. It’s only that. So this is a Gauss-Marc of process. It might also be a Q-temp integrated linear process prior. We can choose it that way if we want to, with some initial mean and covariance. And now we have a bunch of likelihood models, but two of them we already know before. This is the ODE likelihood. Only that it not only depends on X, but also on beta, because the vector field depends on beta. We have the initial value likelihood, which says that X at time zero should have the specific value. So probably, like, well, the initial count of how many infected people and so on there are. And then we have the data likelihood, which is the thing I wrote down on the slide before, that relates this ODE solution to the scatter plot. And actually, we’ve seen this in the last lecture also in the primates and France context. We said that we measure the solution with some linear matrix H, because maybe we don’t measure all the dimensions or something. And we also said it’s a Gauss-Marc model. So we have two Dirac-like yields and we have this Gauss-Marc-like yield. We can look at it in a bit more visual way. I think we’ve seen a bunch of these in the very first and lecture and two lectures ago. It’s this stuff here, but in circles and arrows. So we have a state that develops over time that is a prior. We have a latent force, which is also like where we also have a prior for it, a Gauss-Marc of prior. Then we have this ODE that depends on both the state and the latent force, because X is an input and beta as an input. And then we have observations only of X. We’re not able to observe the beta. And the inference task that we will use an extended common filter for is basically given the black dots, what are the white dots? That is what an extended common filter does. Only that in the settings we had before, the first row doesn’t exist and the last row doesn’t exist. But in some sense, it’s the same inference I grew up in the last before. In this conserved quantity itself, we already talked about even having two observation models. So we do the same thing. Yes, there’s a question in the back. The question is, if it would also be possible to put beta in the state, very good question, you’re taking away my next slide because the next, I click somewhere, oops. The next slide was basically about how do we implement that? And maybe the one reformulation that we can do to just make this simpler and more close to the thing that we already did before is exactly that. The X and beta kind of look the same. Z depends on X and beta, so might as well merge them. And so what we can do, I introduce X tilde. It’s a stack. Well, in X tilde there’s X and there’s beta, which means that mu zero tilde and Z sigma zero tilde are, well, the stacked version of the means and a block diagonal of the covariance is same for A and Q. But we have, so this X tilde now contains all the things we want to infer. And the measurement model for the ODE now only depends on X tilde, same as the initial value, same as the observations. And to make things similarly visual, I really find E zero, E one and E beta as the linear operations that access the correct entry, right? Because we will want access to the zero-th derivative that we need here in the vector field. We need the first derivative here and we need beta here. So getting them out of X tilde is just a linear operation. Yes, there’s another question. A tilde and Q tilde? A tilde and Q tilde? Yeah, they should be block diagonal. Yes, exactly. I’m very good question. So same like all the, same as the covariance, which is a block diagonal, these A and Q’s are also block diagonal. But maybe it’s good to mention that the filtering distributions we get out are not block diagonal, right? X and beta, they are correlated. There will be covariance after having done the inference. In the prior, they are not. All right. Okay. So this is the thing that you have to write down in order to solve this COVID latent force inference problem in the way that we, well, I’ve seen already in a bunch of lectures now. And the inference algorithm is like really, it’s an extended current filter and smoother. If you look at the paper by, well, Jonathan and then also on Niko Kramer and Philip Henning, then they defined this in the correct way, in a smart way that shows the, well, discretization of everything. But algorithmically, it’s an extended common filter and smoother. So the motivation is still we define a state-based model. We know how to do state-based models because we know how to do extended common filtering and smoothing. And then we end up with this thing, which gives us a posterior distribution of the same quality as an extended common smoother gives always. That, well, we have a posterior both for I because it’s a part of our, well, capital X state. And also for the beta jointly from all of this data where, well, one part of it is the scatter plot and the other part of the data was the OD itself in one joint inference algorithm. And maybe it’s also nice to mention that. So we only talked about inference algorithms, well, the extended common smooth, smoother goes forward, backward, once, and then gives a posterior. And so to get this plot, we had to get, do the forward backward pass once, which, well, comparison to what I’ve shown you last week with this video that goes on for 1000 iterations is pretty nice. All right. So that’s basically the content for today’s lecture. We looked at ODE’s from a probabilistic perspective. And I argued that solving an ODE is actually solving a state estimation problem only that last week we didn’t talk about it as a state estimation problem. But fundamentally, we try to estimate an unknown quantity and there will be an error. So we want to be probabilistic about things. And we have the tools to do state estimation in an invasion way, in an efficient way, even. And so today we defined a way to solve ODE’s with extended common filtering and smoothing, which leads to this whole class of solvers that are sometimes called ODE filters. And then I, well, showcased that there is this motivation of like, we want uncertainty estimates and we want to be Bayesian about things. And the uncertainty estimates that we get are useful. But there’s this whole different aspect that maybe sometimes doesn’t get talked about so much. But it’s also just sometimes very convenient to be Bayesian about things. And in this particular example, we’ve seen that by having this general inference algorithm we can write down a generative model for a whole bunch of problems. Like, you know how to solve a DAE, you know, or at least like, impractals with code, even though you never heard about it before, probably. And that is only because you know how to use an extended common filter and how to implement it. Maybe sometimes in a slightly non-standard way because you have two measurements in the beginning or in the middle or something. But yeah, we know how to do that. And then that not only allows us to do to solve numerical problems, but also to do inference from data in a very efficient way that combines well physics and general external observations. So one summary of maybe this three lecture block that we had that ends with this lecture today is that in our minds, Bayesian filtering and smoothing is really the way that we can model or formulate dynamical systems and model them in a way that is like flexible because we can just add information if we want to. And well, kind of factor out the inference algorithm as something that we can take care of afterwards. So I thank you all for joining lecture. Please take out your phone right now and do the feedback, like right now. I want to see people scan through our codes. And then yeah, thanks for lecture. I’m around if you want to ask questions and then see you next week for partial differential questions.