Numerics of ML 8 – Partial Differential Equations – Marvin Pförtner

Hi guys, hello again. Today’s the last of the simulation lectures we’ll do and we’ll talk about partial differential equations and also Gaussian process inference, again, surprise, surprise. So first of all, a bit of an outlook of what we’re going to talk about today. So first of all, I’ll clarify what a PDE actually is and why these equations are important, what are the fastest about essentially. And the main part of the lecture will be showing you how to integrate PDE-based models into machine learning models and specifically probabilistic machine learning models. And we’ll do all that by sort of working through a practical modeling example just to make things a little more visual. So first of all, PDEs are used as the language of mechanistic knowledge. Now what do I mean by that? A mechanistic model, it has a term mechanism in it, it describes essentially what’s going on, for example, in the real world, but also in financial markets and situations like that, by describing the mechanism that generated the data. And think about something like opposite charges, attract equal charges, repel, right? You don’t know how to protons, like you don’t directly know how to protons will interact, for example, but you know that they will sort of repel and you know the strength of that force. And that is what I mean by mechanism. You don’t know the trajectories of the protons, but you know how to get to the trajectories from that mechanistic knowledge. Another example of a physical system that is actually described by PDEs, the realm of fluid mechanics, so the description of fluids of all sorts. And actually there’s sort of a fluid in a wider sense. So for example, climate models and weather models are also based on these PDEs. And the important model here, it’s actually a very important system of PDEs called the Navier-Stokes equations. You might have heard that already. And yeah, they’re described to simulate systems like the weather climate, but also oceans. So you could do like a tsunami model, which predicts when a tsunami will hit the coast or whether it will actually come about. Because of, well, you can sort of see from these situations that the models are typically really large scale. So simulating the ocean, simulating the entire Earth climate is over a long period of time. It’s a really large system. So yeah, these problems we’re talking about are typically a very, very large scale. And they’re also very difficult to solve in practice. But also PDEs are kind of an interesting thing to talk about because the theory in practice of PDEs are actually still a highly active field of research after you could argue already a century or maybe even centuries of research going into this field of mathematics or applied mathematics. And it’s actually, in a sense, so difficult that one of the well-known millennium problems is about these Navier-Stokes equations, namely just to state whether the set of equations even has a solution and how smooth that solution is. So that sort of maybe already gives you an insight into how difficult it is sometimes to talk about these models. Now because these are typically difficult and this is sort of, I guess, a common scheme in mathematics, nonlinear or general PDEs are quite difficult to understand. And so in this lecture we are sort of restrict ourselves a little bit to a simpler class, namely the class of linear PDEs. I’ll define what that is in a little bit. But just know that even if we restrict ourselves to that class, this is actually still quite a powerful modeling language. So for example, many of the physical processes that happen around us are or can be described via linear PDEs still quite accurately. So for example, thermal conduction, the diffusion of heat and for example, a piece of metal is described by the so-called heat equation, which is one of these linear PDEs. The phenomenon of electromagnetism, I already talked about proton’s interacting, this is also described by a set actually system of linear PDEs called the Maxwell equation or Maxwell’s equations. Wave mechanics, so essentially the description of water waves and waves that sort of propagate through air can be described to a very good approximation by the wave equation, also linear equation. And the particle velocities of particles and brownian motion are also described by the so-called Focca Planck or Cormacura forward equation. The latter one actually also has quite an important place in mathematical statistics, which relates to stochastic processes. But there’s not just physical models that are described by these equations, but already hinted at financial markets, so the famous black-shoulds equation, which is used in mathematical finance, is also such an equation. And finally, if we actually in practice work with nonlinear partial differential equations, we can use linear approximations to these nonlinear equations in numerical simulation essentially by iteratively re-linearizing. So with all that motivation aside, we use these models typically to describe the behavior of a real-world system. And we don’t know the exact behavior of that system in advance, but we know, as I said, how the mechanism behind the system works. So I’m deliberately phrasing this goal we have for this lecture here as a sort of pseudo-graphical model, because we want to fuse the well-known or the models we know, the probabilistic models with these mechanistic models, with this mechanistic knowledge in order to gain some of the strengths of machine learning in practice. This is also sometimes called hybrid modeling, because we have empirical knowledge from data, from observational data, and then we have mechanistic knowledge, which ties all the nicely together quite nicely. And so we know, which is why this is stated, we know the mechanism, and we want to infer the system behavior. However, PDEs, and PDEs usually have some sort of set of parameters, and we usually don’t know these parameters in advance. I give a couple of examples here. So examples of these parameters are like strengths and distributions of heat sources or charges in electrical problems. Material parameters, such as the speed at which heat diffuses through a certain material or actually just forces in classical mechanics are all examples of this, and we don’t usually know these, but we can measure them. These measurements are typically noisy, but from that we can already see that we actually need to be able to deal with observation noise in our models, and classical descriptions of these things typically don’t handle that too well. So we have these measurements of the system parameters, and the system parameters also sort of need to be known in order to solve for the system behavior, right, because the equation is sort of governed by these parameters. And sometimes we also have measurements of the system itself in terms of, think about like a sort of a heat simulation of the heat distribution, we might just place a thermometer at a certain point in our simulation and measure the temperature of a piece of metal, for example. But we can’t do that at every point of the simulation. And this is why we actually need that mechanistic knowledge to sort of interpolate between our measurements. Question? That depends on the situation. For us as you can just measure, right, there’s like, essentially, measurement devices for that, for example. Sometimes it’s actually more difficult to measure this stuff, in which case you can actually use these measurements and sort of propagate the knowledge back to the system parameters, which is then called an inverse problem. And Natane already talked to this about this quite a lot in the ODE lecture. So this is actually sort of the flip side of what we’re doing here, right? Not necessarily just simulating what the system does, but also inferring what the sort of causal mechanism underlying that is. Yeah. And our approach to this here and actually sort of the goal of this lecture is to use Bayesian statistical estimation to fuse what we know, like the mechanism that we know. And this measurement data and actually also uncertainties that are probably contained in most of the system parameters. Why Bayesian statistical estimation? Well, because we have all these uncertainties and would actually be quite a good thing to propagate all, essentially all that we don’t know onto the solutions that we can give some confidence to the predictions we do. Alright, so let’s actually jump into the PDE world and first of all answer the question, what we actually mean by a linear PDE. So first of all, think about, we want to simulate a physical system. Typically physical systems have some sort of spatial extent. And if it’s a system that evolves over time, we also have a time span over which we want to simulate this. So first of all, we define this set D, which is called the domain. And then we look for a function that describes the a physical system. Say for example, a temperature distribution or like a the force is generated by a set of electrical charges. And this is actually the unknown function you. This is the description of the system that we want to want to simulate. The mechanistic knowledge in these PDE based ones is actually given by this equation here where D is a so-called linear differential operator. I’ll actually show you some examples of that. But generally speaking, this is just a linear combination of partial derivatives of the unknown function you. And we don’t know the function you, but we prescribe a fixed value F, which is called the right-hand side of the equation for this linear combination. And this essentially turns out to be a very elegant description of a lot of physical processes, for example. Now some examples of linear differential operators are given by the probably the most well-known one is the Laplacian, which is just the sum over all second partial derivatives, essentially the trace of the Hessian, right? The sort of non-mixed partial derivatives. And another example of a PDE is actually an F or a linear PDE is actually an affine ODE. So in case where we only have one input variable instead of D input variables to our function, we can construct this differential operator here, which consists of just the one derivative that we can actually build with this function and another linear term here. And then this, if you rearrange terms in the equation, just gives you the vector field of an affine ODE. So in a sense, affine ODE’s or, affine ODE’s are special case of linear PDE’s and in general, and ODE’s a special case of a PDE. It’s not always helpful to think about it this way because of PDE’s tend to be a little bit more difficult to simulate, but it’s just sort of a closure argument here. Now let’s talk about some problems we have with these equations if we want to actually apply them in practice. First of all, usually we do not get an analytic solution. This was already true for a lot of the ODE’s we considered, but here it’s in a sense because it’s a wider class of models, it’s actually, well, we inherit these problems. So we need to use numerical solvers to actually get to these unknown functions, you, which because the function you is actually an infinite dimensional object will introduce this criticisation error inherently. Unless you know something about the problem, but we’re going to talk about this. Secondly, we already sort of talked about this. The PDE has parameters and we can actually pinpoint what these are now a little bit better. The right-hand side function F is one of these parameters. For example, when I said, well, heat sources, we need to know heat sources that are involved in our problem and charges. These are typically described by the right-hand side of the equation. And material parameters, for example, are the coefficients in the linear combination of the partial derivatives of the function U. We don’t know these exactly usually. I already talked about this. Finally, classical solvers that have been developed essentially over the past century are sometimes quite difficult to embed in computational pipelines because of this reason that, well, the parameters are usually not known exactly, but we sort of need one estimate, we need a point estimate of a parameter. And so it tends to be quite difficult to actually propagate these uncertainties with the solvers, which is something that we aim at solving by this Bayesian approach we take here. Now, there’s a bit of a problem here because PDE’s, just the PDE itself, usually does not identify its solution uniquely. For ODE’s, we have already seen that we also need to formulate initial value problems because essentially there’s a constant of integration involved if we solve these equations. And for PDE’s, this is not much different, but the types of additional conditions we need to impose a little bit more difficult than in the ODE case. But let’s actually look at an example first. So we look at the so-called pass-on equation, which is just the Laplacian of the function, having a prescribed value. And let’s, for now, think of a solution candidate to this function, sorry, this equation, which is just linear. And we can see that if we apply the Laplacian to this linear function, I said this is just a trace of the Hessian matrix, but because this is a linear function, the Hessian is zero. It doesn’t have a second order term. So this is just zero, which means that linear functions are actually in the kernel of this differential operator. And so for any solution of the pass-on equation, we can just add a linear term and still get a solution. Because I’m going to talk about this in a little bit, this differential operator is actually linear. So usually, maybe also in the spirit of what you do in the ODE case, uniqueness can be achieved by requiring an additional condition to work, which is usually a boundary condition, or in this case, actually to keep everything linear boundary condition. Where again, we have a linear operator here. I talk about this in more detail in a little bit. But in the physical intuition, why we prescribe something that happens at the boundary of a simulation domain is actually that, well, if you think about the problem you’re simulating, there might be interactions coming into your system from the outside of the simulation domain. And if you don’t simulate these outside influences, so if you essentially don’t model them in your mathematical framework, then essentially anything can happen. If you model a piece of the heat distribution, a piece of metal, and there’s a truck rushing into it, then yeah, what’s going to happen? You don’t know. So you essentially need to model everything that happens outside of the simulation domain by summarizing it to how the outside interacts with your simulation at the boundary, because that’s the only sort of influence you can actually have here. And a PDE, together with a set of boundary conditions, is usually referred to as a boundary value problem. Now, an example of, a specific example of such boundary conditions, are the usually boundary conditions, which just say essentially, well, we, so the operator restricts the function to the boundary, which just says, well, we prescribe the value of the function at the boundary. If you think about the function you being a heat distribution in some piece of material, then this just says, well, we know what the temperature is at the boundary. And this, in physical analogy, this might just be like, you have a huge bucket of ice water, which will not change temperature, whatever happens on the inside of the domain. So you know the boundaries of your domain will always stay at zero essentially. Alright, so I already sort of revolved around this point for quite a bit, but PDEs are actually statements about functions. So we have an unknown function, which is a value. And functions are typically infinite dimensional objects. So you might already know if there’s an infinity involved, you need to be a little bit careful. But it turns out that we have quite a convenient structure for functions, because it turns out that functions, or function spaces, sets of functions. And the certain conditions are actually vector spaces. Right? So from sort of linear algebra classes, you might be very familiar with the vector space RN. So essentially the space of end-dimensional column vectors. And as you know, you can add them. You can multiply them with a scalar. And you can do essentially the same thing with functions just by saying, well, the sum of two functions and the product of a function with a scalar are just to find point-wise. This is what’s set here. So we have vector space structure, but that also comes with all the sort of nice amenities that come with vector spaces. So there can be bases in these vector spaces. Right? It’s just a linear combination of some set of basis vectors. And for function spaces, you have a linear combination of a set of basis functions. Usually, at least for the function spaces we care about here, these bases are actually infinite dimensional. And sometimes they don’t even exist in this form. This is what I mean by you have to be careful. But for this lecture, these exist. They are infinite dimensional and they exist as essentially a series of vectors. You also have linear maps. And we know from finite dimensional vector spaces, we can represent linear maps by matrices. And the linearity property just means we can pull the matrix inside a sum. And we can pull scalars out of the matrix vector product essentially. The linear maps in infinite dimensional vector spaces, specifically in function spaces, are called linear operators. And this is also actually what the term linear differential operator comes from. Because a differential operator maps a function, for example, to its derivative. So it’s a map between vector spaces of function that it has this linearity property. So essentially because you have the sum rule of differentiation and you can pull scalars out of a derivative. So this is actually why we talk about linear equations and linear differential operators. And by sort of applying this knowledge, you can see that a linear PDE is nothing but a linear system in an infinite dimensional vector space. So we will actually in this lecture apply quite a bit of intuition from solving linear systems that we developed with unit-hands lectures to the case of PDE’s. This is actually quite a nice analogy you can always think about when working with these systems. There’s two more details. Well, you can define norms on vector spaces. For example, here we have the maximum norm or the infinity norm on Rn. And well, essentially the analog of this also exists for certain classes of function. So you can just take the supremum over a function. And then that is, that it turns out to be a norm of function spaces. If we have norms, we can also talk about convergence. And if every sequence in these spaces converges, we can talk about a Banach space. And then, well, we know that Rn with the infinity norm is a Banach space. But actually the space of k times continuously differentiable functions on some set is actually also a Banach space with this particular norm. And so again, we can take a lot of the intuition from the finite dimensional case over to the infinite dimensional case, maybe not always. So we have to be careful in these senses again. And the same actually also holds for inner products and Hilbert spaces with a couple of caveats, which I’m actually not going to go into right now. Just notice that usually, sort of these translations involve replacing indexes into column vectors by function evaluations, and then sums by integrals. Sort of a very straightforward way of deriving these analogs here. Yeah. All right. So let’s move on to a bit more practical topic, which is actually this toy example that will serve as the main motivational example with which I’m going to introduce the methods we’re going to use here, which is a simple model of the heat distribution in a central processing unit in a computer CPU. Now, this is what the discrete GPU in a computer usually looks like. Actually, this metal piece on top here is not the chip itself. It’s just a cover which is used to extract heat from the chip. The actual silicon is just this little black box here. Well, chip actually. And these components are particularly limited by the heat they give off. If there’s currents going through it, due to internal resistances and things like that, the chip will produce quite a lot of heat. And if these systems overheat, they may take damage or not function properly. And we want to avoid that. So it’s actually quite a good thing to know what the temperature distribution in your chip is in practice. You might already sort of guess that this thing is actually really thin. And so we can sort of model it as a two-dimensional thing. So for now, we’ll actually restrict ourselves to a two-dimensional modeling and like a mathematical modeling of this system. This is the spatial domain we talk about. So it’s just a Cartesian product of intervals in this case, the length and the width of the chip. But actually, since there’s some homogeneous geometry here, to keep things simple for this lecture, we’ll even restrict ourselves to just this modeling the temperature distribution sort of on this line slicing through the chip right here. We’re actually going to see two-dimensional example towards the end of the lecture. And we’ll see that this is actually quite a good model assumption for this particular case. But now let’s turn to actually moving from a sort of physical model into math, right? So first of all, we already defined the spatial domain that we’re going to work on, which is just this interval from zero to the length of the CPU, which is sort of this coordinate axis right here. Now we know that when we want to simulate problems involving heat, we actually need to know where the heat comes from, like what generates heat in our system. And for now, we’ll actually just assume that the GPU built into this CPU is actually idling, it’s not doing anything, and the cores are actually computing something really hard. So the compute cores of the CPU are working and generating heat. At the same time, these systems are usually built into a computer in a way that there’s a heat sink on top of them, which transports off all of that heat generated by the chip, so as to prevent it from overheating. And actually this sketch might be a little bit misleading. The heat sink by sort of squashing this, what’s called a thermal interface material, sort of a paste around these edges, you could assume that it extracts heat sort of uniformly from all of the surface of the chip, right? So the heat leaves the chip via its surface. And so if we actually wanted to model where the heat sources are in this chip and sort of where the heat sinks also are, where the heat leaves that system, we could look at a function like this, right? So we placed three Gaussian blobs on the cores themselves, which are the heat sources, and you can see the unit here is what per cubic millimeter, so it’s essentially heat per unit volume. And then there’s another negative constant, yeah, negative constant function superimposed onto these Gaussian blobs, which models the heat sink, everything that is getting pulled out of the CPU. Yeah, and now you might wonder, well, we talk about PDE’s here, so what is the PDE that actually models this system? Well, I already talked about the heat equation, so here it is. It’s a linear PDE, it’s also second order, why is it second order? Anybody? Exactly, yeah, the last term contains the second, the mon-mixed second partial derivatives, exactly, yes. So this is where that second order comes from. I’m going to talk about what these individual terms actually mean a little bit later, but for now, note that this function U is going to be the temperature distribution in our chip. We can see here that there’s a temporal derivative, so a derivative of the time variable involved. A reasonable assumption to make things simpler for us here is actually to assume that the temperature distribution stays the same over time, which will eventually happen in a CPU which sort of runs over time, and when the control on the fans of that heat sink actually reach sort of a stationary point, so what we’ll do is we’ll assume, well, at some point we are actually going to reach a stationary temperature distribution, and you can sort of work that into this model by saying, well, okay, if the temperature doesn’t change, then this temporal derivative is just zero. There’s no change in temperature, no, no, no, at least no temporal change in temperature anymore, in which case we arrive at the stationary heat equation, which is just, you know, you obtain from this equation by just setting this entire first term to zero. And since we restricted ourselves here to this one-dimensional subset of this CPU, essentially this line, this actually turns into this equation, right? The Laplacian with d equals zero, with just one dimension, it’s just a second derivative. Now what is that? We’ve actually already seen that in this lecture. This equation, what is it remind you of? It’s just an ODE. Because we restrict ourselves to one-dimensional problem, to one-dimensional problem here, it’s an ODE, which sort of, you might say, well, where’s the point in talking about PDEs here? It’s just, so things are a little more visual for this lecture. Everything that we’re going to talk about in the following actually also applies to the multi-dimensional case. We don’t have any specializations to the one-de-case here. All right. Now how do we inject this mechanistic knowledge we have from this differential equation into our statistical model? A way of thinking about these PDEs is actually interpreting as an observation of this unknown function U. And we’ve done essentially in much the same way we work with in the ODE case. There is an unknown quantity. We don’t know anything about this function U, but what we can observe is a derived quantity. We can observe its image under the differential operator and we know that because we want the PDE to hold in our system, this image of the differential operator, think of it as just differentiation for now, just one partial derivative, we know that it has to be a certain value, namely the value of the right-hand side function. And in physics, this actually has an interpretation. A lot of the most fundamental laws formulated in physics are actually conservation laws. So they describe the conservation of some fundamental quantity like energy, mass, momentum, charge, some physical observable. And in terms of that, these are usually actually expressed as PDEs. And the heat equation we’ve just been talking about, notice that I actually moved one term over to the other side, is a statement about conservation of energy, particularly heat energy. So this left-hand side term here is proportional to the change in temperature, the temperature, and you can actually, so it turns out this is also proportional to the change in heat energy. If temperature changes, heat energy changes. Note that temperature and heat energy are actually not the same thing, but they’re, in this case, proportional to one another via these material parameters here. And we said that, well, every change in heat energy has to be explained via either, a heat source. So this q dot v is actually a heat source. So essentially a known value of heat entering the system at any point, or it has to be explained by heat flowing into a certain point from the surroundings via heat conduction. And this actually sort of makes sense, because the Laplacian computes sort of a curvature estimate of the function. In one, it is actually the curvature, so it’s the second derivative. And if you have a situation where you have sort of a parabolic bowl, then you would expect, because the surroundings of the center point are harder than, well, the center point itself, that heat flows into that, because, well, temperatures tend to equilibriate over time. And this is the statement as well, every change in energy is either, either, explained by conduction from the surroundings or by a heat source being present there. So there is no energy lost or gained other than what we can explain, essentially. And this is a local statement, right, because there are all these derivatives involved, which are computed at every point of that domain. Yeah, I already stated that, normally this, or abstractly speaking, this is just a local mathematical property, so the value of a derivative at a certain point. And we can actually write this equation, which is sort of, first of all, just a notational thing, but we can write it as the difference of the differential operator minus the right-hand side being equal to zero. And we define this as the function i or the operator i, which is in probabilistic numerics also referred to as an information operator. And for the specific case of just, you know, the derivative, you’ve actually already seen these information operators in the ODE lecture. This is exactly what we conditioned on in the probabilistic ODE solver. So this is just a more general sort of framework. It doesn’t just apply to differential equations, but you can essentially express any piece of information with such an information operator. And you can see it as a sort of extension of the notion of a data point, right? A PDE is arguably not a data point in and of itself, but it still provides you information about the problem that you’re trying to solve. So it’s a generalized notion of data or actually information. All right. Now, we actually want to solve this differential equation now. And we have an unknown function U. So what do we do if we have an unknown quantity in Bayesian statistics, well, we just put a prior over it. In this case, surprise, surprise, the Gaussian process prior U. And you can actually see that prior over up here. It’s a matern 7.5 scanner and just, you know, constant mean function. And the observations, which replace the normal point observations in regular vanilla GP inference, is now the information operator, right? We require the differential equation to hold at every point of the domain, where we replace the solution candidate by the GP now. Now, well, how do we actually do this? First of all, how do we actually apply such a differential operator to a GP? What kind of object do we get from that? And another way of friends is how do we take the derivative of a GP? And second of all, how do we actually compute the posterior now? How do we condition on this piece of information? And this is where, oh, well, actually, it turns out that both of these objects are GP’s again. So we have sort of a closure property of GP’s under linear observations. And we still have a bit of a problem because this is actually an infinite set of observations, right? Because we want the PDE to hold at every point of the domain. And the domain is typically like an interval, for example, it’s uncountably infinite. So computationally, this is going to be quite a challenge to do. It’s basically impossible for the general case. And let’s do something analytically about the problem you’re solving. So we relax this piece of information by essentially saying, well, we don’t want it to hold at every point of the domain, but just as at a finite set of training points, x in this case. And these are because of classical methods that are similar to this called collocation points. I think Natana also talked about this in the ODE lecture. So it’s essentially the same approach we take in the ODE lecture, just that we don’t use state estimation here, but an actual GP. All right. Now it turns out that these objects are very, very similar to the forms of what you get if you condition a finite dimensional Gaussian random variable on a linear observation. So this actually popped up already in the lecture on iterDP. It’s just the Gaussian inference theorem essentially on RD. So column vectors essentially. But now recall that we actually said, well, functions are also vector spaces, so we can see a GP as a probability measure, or as sort of like a random variable on function spaces. And yeah, we just use some sort of some GP prior here. And then as before, we choose a linear operator. In this case, it’s actually a linear operator that maps the sample paths of that GP onto RN. So it’s set of N linear observations of that GP. And we introduce some sort of a noise variable here, which is independent of the GP. So this is just independent Gaussian noise, just as in the finite dimensional case. Now it turns out that the prior predictive, so the image of the Gaussian process under this linear operator is actually given by a normal distribution where the mean is given by the image of the Gaussian process mean function through that linear operator. And the covariance matrix is very similar to the covariance matrix we actually get in the finite dimensional case, just that instead of, well, we can’t really apply like a matrix product here, right? Because this is a linear operator, not a matrix, and this is not a linear operator itself, but also a function. But the analog of this a sigma a transpose thing in function space essentially is to apply this linear operator, which remember acts on functions and returns a vector from RN, apply this first to the first argument of the kernel function. So fix the second argument of the kernel function. Then you have a univariate function in x, which returns a real value, and this is actually exactly what we can input into that differential operator, because, well, that’s the space it’s defined on. And then afterwards, seeing it as a function of x again, which now maps from, because we index into the i here, maps from the, well, the spatial domain x to r again, and this is again one of these functions that we can input into the differential operator. So we apply again now. You actually see a concrete example with a concrete differential operator that will make everything a little bit more clear and a little bit. And the posterior of this, this event turns out to actually have a similar structure tool. Essentially, we just replace matrix vector multiplication of the matrix with applying the linear operator to a function. And we replace this a sigma a transpose object by, yeah, that thing we already saw here, this ground matrix. I left out a lot of theoretical detail here. We’re going to return to this in at least some vague in the end of the lecture, but know that there is a lot more involved than just writing down these equations. And you need to be really careful when doing this, because of all these infinities involved. Yeah, we’re going to talk about this a little later. There’s a question. Basically, is, so think about this map from, from, from. The wrench operator will also use the function function that is probably set up for it. Yeah, so this is why it’s mapping to rn instead of, so normally, like just differentiation would map function to function, but this is essentially concatenation of the differentiation with several point evaluations, right? So because point evaluations are actually also linear, because they define that way essentially, you know, because summation of functions is defined point wise. So, yeah, so this is, we’re actually going to see this on the next slide. So in example of this would be just take a derivative and then evaluate it at a point x. Now, if we actually want to compute this object, what does that look like? Anybody have a whole idea? Like for this specific choice of linear operator. You mean this x? It’s just some x. It’s a fixed x. Actually, yeah, I screwed up here. So this is a different x than that. So just, you know, think x till the year. It’s just some fixed x. Say three. It’s three. No, no, it’s three. Think of it as three. In the front here. So it’s just, you know, you differentiate the one argument, then you differentiate the other one argument. It’s simple as that. It’s just a little bit difficult to express in this standard linear operator notation, right? So you first sort of fix the second argument to some value, then while applying the linear operator to the kernel, which is now just a function of the first argument, means I differentiate once with respect to the T1 argument, and then insert x into that essentially. And that, now I see it as a function of the second argument, and then I differentiate with respect to that. So it’s simple as that. For differential operators, differential is actually quite easy to do. And then, well, if you have multiple x points here, then you just build a matrix of all pairwise derivatives essentially between these two points in order to get up this object. It’s actually matrix. So if we sort of enter this case that you just brought up where we don’t actually point evaluate, where we have a proper linear operator between function spaces, we also define these covariance kernels. Well, this is a kernel. These are cross covariance functions, which are essentially the same thing we defined above just that we actually see it as a function of the point we want to differentiate afterwards. And what you can show by essentially applying what you just learned about, what you just learned on the previous slide, is that if you have such an operator and it fulfills certain conditions, which we’re going to talk about later, then the image under that operator of a GP is just a GP again, by definition, essentially. And this is how you get at the derivative of a GP, for example. If this is just a derivative operator, for example, for scalar GP, just differentiate once, then you differentiate the mean and you differentiate the kernel functions on both sides essentially to get a symmetric function again, for example, you can see it like that. And then that’s a GP again under certain conditions. Now let’s apply that to our PDEK. So we return to the same GP prior and the observations and we already saw that. Well, applying the differential operator to the GP is a GP again, with specifically this form of posterior moments. And you can actually see that GP here. So this is the, in this case, the, well, the second derivative is, or the scaled second derivative, because that’s the differential operator we’re working with of that GP. You can see that happening because the samples are actually much less smooth. We lose degrees of differentiability by differentiating these functions, which is also how you know that you’re working with a matern kernel and not with something like a squared exponential kernel, because the samples from a squared exponential kernel are actually smooth, so infinitely differentiable. Yeah, now that we’ve seen that, we can also apply the same theorem or the same table essentially to this problem here, where our linear operator is now, well, first apply D, then evaluate at the set of points. And we can compute the posterior Gaussian process in this case, where we know it’s a GP now. And if we actually apply this to this problem, then we first define the set of collocation points, the blue dash lines here, these x points essentially. And then, well, the observations are given by this black function, because this is essentially the right hand side space of the PDE, right? Once we apply it, the differential operator, the GP has to sort of match the right hand side function, and this is why the y values of these observations in this transform space are just given by point evaluations of the right hand side function. And you can see in this space, we essentially just have a normal Gaussian process regression problem, but we actually propagate that the knowledge we gained from that back to this original GP, which is connected via the differential operator. And, well, we can see that it sort of reacted to it, but it doesn’t seem to really have worked. There’s still a lot of uncertainty left. So what’s the problem here? Think about what I said about PDE’s not being uniquely solvable. The boundary conditions are missing. It actually does work. It’s not broken, in a sense. You can actually see that all the individual samples, which you can actually also see from this plot, all samples sort of approximately solve this equation. There’s just degrees of freedom which the PDE doesn’t fix. And this is actually exactly this linear degree of freedom that I was talking about. So every sample of this GP, least approximately, is different sort of from one another, added linear function. Sort of, you know, just scaled or skewed actually by the exactly this term. And, well, since this posterior is now just the GP again, and if we, for example, impose Dirichlet boundary conditions, so we prescribe the value of this temperature distribution at the boundaries of the interval, the left and right boundary point, which we, like physically, you could interpret this as a measurement from a thermometer. You might sort of attach there. Then, well, we just have a normal Gaussian process regression problem, because the posterior of that previous problem is a GP, which you can just take as a prior for a normal GP regression problem. You just observe two values at the boundaries. This is nothing special about this. And, it works. So we still have a bit of uncertainty left here. Actually, observe that the right hand side basically does not change. So it’s essentially the same from before. But now, uncertainty collapses, and the remaining uncertainty you can see here is just the approximation error. We didn’t actually require the PDE to hold in every point of the domain, just in these collocation points. But this is also why we get some, this reflects actually in the GP’s confidence estimate. It says, well, I’m not certain because I didn’t get all the information, essentially. So let’s recap a little bit. What did we do? We’ve seen that a generalized form of GP inference can actually produce an approximate solution of this boundary value problem that we formulated here. And we get an estimate of this approximation error that is typically quite difficult to handle. Here, we’re actually returning to this graphical model from the beginning, if you remember correctly. So we have the description of a physical system, which we initially don’t know. But we condition it on the observation that some mechanistic knowledge, we have some mechanistic knowledge about the system here. Which is, in this case, we measure the values of the temperature values of the CPU at the boundaries. And we know how heat flows, essentially, through a piece of silicon. Now, what is, unfortunately, a little bit unrealistic about this is the boundary values in deployment are usually not known. These thermometers don’t exist if you have a CPU in your system. There are thermometers on the CPU, but not at the boundaries. So we need to sort of get rid of that in order to actually have a realistic model. And secondly, these heat sources, the values of these heat sources are also not exactly known. The way we modeled this was sort of, well, okay, if the cause actually computes something, then let’s approximate the heat source distribution with like a Gaussian blob, but you don’t know it is actually a Gaussian blob. Like we haven’t measured that. So to get around this, it would be actually kind of cool to have the possibility of adding uncertainty to both the boundary values, because these are also not exactly known, and the exact values of the heat source distribution. Okay, so let’s return to the case where we just conditioned on the PDE. What is a more realistic boundary condition for this, well, more practical boundary condition at least? And it turns out that I already stated that we know that heat is extracted approximately uniformly over the entire surface of the CPU. And actually, these boundary points are sort of the side parts of this box, this idealized box that the CPU is represented by. So physically, you can actually model these boundary conditions. So the information about how much heat is extracted from the surface translates to a so-called Moimann boundary condition, which is instead of setting the value of a boundary point, it says the first derivative of a boundary point. If there’s a lot of heat being extracted, then you can assume that sort of the temperature distribution moves onto the boundary in a relatively steep declining slope. And if there’s a lot of heat entering, then it’s a positive slope, at least on the right slope here. So if we actually do that, we model the uncertain values of that derivative, because yeah, we don’t know exactly, like we don’t know how much exactly it leaves the boundary, but we can sort of add some plus or minus some uncertainty to that. So we model it with another Gaussian, in this case a Gaussian process, but well, it’s actually just a Gaussian distribution because it’s just two values instead of a function. We add an additional information operator, which is given by exactly that derivative at the boundary. Don’t worry about this. This is a directional derivative because normally, if you have a multi-dimensional domain, so say for example, like a plane, it’s not just the derivative, but it’s the derivative in the direction of the normal vector to that boundary. So essentially how much heat flows out of a surface to the thing is, well, given in the direction of that normal vector, the exterior normal vector, but think about it as just the first derivative for now. And well, yeah, this is an information operator describing that Neumann boundary condition. It’s also not a, in this case, not a normal Gaussian process regression problem anymore because we have another observation through a linear operator here. It’s just a different one than the differential operator of the PDE. It’s this B, this boundary operator we saw before. And if we actually do that, well, sorry, we see that we got rid of one of the degrees of freedom, right? So instead of there being a linear degree of freedom where we can actually have an additional slope, we fixed the slope, but the only remaining degree of freedom in the samples of this GP is the offset, the translational degree of freedom. And this kind of makes sense because we just said where heat flows to and where it flows from, not at what absolute scale the system operates. So this could be a system running at a thousand degrees Celsius, or this could be at like the minus 200 below zero. So how we actually fix that absolute scale is by these thermometers that are actually contained on the CPU, just not at the boundaries, they’re contained in the center of the CPU course. And so we have measurements thermal readings, which from these sensors, which are called digital thermal sensors, at three points along this cut through the CPU. And we actually add those, this collapses. But it doesn’t collapse fully because well, there’s measurement uncertainty on these thermometers. And so we only can do so much, right? We essentially what we can learn from the absolute scale of these three measurements. Now, again, we already said that, well, we don’t actually know the heat source term. There should be some uncertainty on it. It was sort of an eyeballing rough estimate of what might be going on. So how do we fix this? Maybe even in tutorial, just add another GP. All GP’s everything. So instead of saying, well, the right hand side of this PDE is some fixed function, we say, well, okay, it’s just some GP. It’s a probability measure, an uncertain estimate of a function that we actually should know. So we model our prior belief about what that function actually is, and by another GP. And we can use the same inference technique now. But there’s a bit of a problem because the physics of the problem actually, well, first of all, this deterministic boundary function we used before was carefully chosen such that it actually integrates to zero. This makes sense because if it wouldn’t integrate to zero, then in total, there would be more heat entering the system than is leaving it. And in this case, we would never reach thermal equilibrium. The system would just keep on heating up because there’s always more energy entering the system. So we need to choose this function, this right hand side function, to integrate to zero. But how do you do that with a GP? A GP, if you just add Gaussian noise to that single estimate, then some of these samples will consistently lie above the mean, which we use as the original estimate of the right hand side function. And then, well, they will have a bigger integral. So how do we solve this problem? How do we essentially guarantee that this GP has area 1 in all of its samples? Exactly. Integration is a linear operator. Integrals are linear because, well, the integral of a sum of two functions is the sum of the integrals of the two functions. So we can actually formulate another linear observation. Well, linear information operator, which exactly formulates this specific condition, which I call a stationarity condition, because refers to thermal stationarity. You actually have to add in the boundary effects too, because well, there’s heat leaving via the boundary. So the heat that leaves via the boundary and the heat that leaves on the interior, which is modeled by that negative term in the right hand side function, and then plus all the heat that is generated by the CPU course, that needs to be zero, that needs to equilibrate essentially. And so by actually adding this additional constraint, which now doesn’t actually affect our function U in the first place, it’s just the right hand side function and the boundary function, QV and QA. Well, the prior over these two functions changes, and you can actually see if you look closely that samples that start out below the mean here, which the mean actually already did integrate to zero, will actually in the end, like consistently above it. And then this happens in such a way that this actually, this integral zero condition holds for all of the sample paths. And well, now we can sort of apply all of these information operators to the, well, this joint GP prior, this is actually a multi output GP prior now, to wherever this system. And this is, I would argue, a much more realistic model in terms of the assumptions that it makes about the problem than the one we started out with. And you can see that, well, there’s still some uncertainty left, which is makes sense because we don’t have a certain right hand side function, we don’t have certain boundary conditions and we don’t have certain measurements. So all of these uncertainties contribute to the posterior uncertainty about the solution, but you can actually see that the red area, the red shaded area down here, it’s actually, I realize quite difficult to discern that. The red shaded area in here lies essentially consistent, sorry, lies consistently within this blue shaded area, the blue shaded area is the uncertainty about the right hand side function. So all of, essentially, or most of our estimates for the curvature or the image of the GP under the differential operator, agree with uncertainty or their consistent with uncertainty about the right hand side. So that makes a lot of sense. All right, let’s recap again. So we’ve seen that this GP approach combined with the notion of an information operator enables us to integrate prior knowledge about the system’s behavior, which is formulated as a GP, well, as the assumptions made by a maternity colonel in this case. We can inject mechanistic knowledge in the form of this linear PDE that we’ve been formulating. We can have uncertain boundary conditions and right hand sides, which actually appear quite often in practice, specifically also in the context of inverse problems. We can use a noisy empirical measurements to get rid of some of the uncertainty that we get from, well, uncertain boundary conditions and right hand sides and the approximation uncertainty. And all this happens while we provide quantification of the approximation error that we get by, you know, not requiring the PDE to hold exactly at every point of the domain. We get an error propagation from these uncertain estimates of the system parameters, the right hand side, the boundary conditions. And the bigger story behind this is that all of this is only possible because instead of just giving a point estimate of the solution, we actually relax that a little bit and say, well, we just answer with an infinite set of solution candidates, which are weighted by a probability measure. It’s essentially just one of the, essentially, the promises of Bayesian inference in a different form. But yeah, it’s very powerful. Instead of using a point estimate of that function, just use a probability measure over functions, which can give you confidence, which can give you samples, which all, in this case, exactly fulfill the conditions you subjected to. Yeah. And it’s also probably more honest, because instead of saying, well, we actually know that this is the right hand side function we want to solve the PDE with you acknowledge, well, there is uncertainty in that this practically always uncertainty in these estimates. You almost never know these exactly. And this is a way of actually modeling that. Now, I already, here’s this. You can actually simulate exactly the same model also in 2D. This approaches exactly the same, only that you replace, well, your one-dimensional grid of points by two-dimensional grid of points, essentially, and change the prior a little bit. You can see that it works. And actually, because there’s almost the same temperature across the Y axis here, you could argue that, well, the one-D model was actually quite a good model in the first place, because there’s not a lot of variation. And by positing this one-D model the way we did was, well, we said it’s essentially the same temperature along that dimension. So it was probably fine to model it like that, just as a proof of concept here. Now, so in contrast to the ODE lecture, we didn’t really talk about time here. But what is time other than just another spatial dimension? You could argue, space time domain, essentially, and you can apply exactly that to a one-D version of the heat equation. So now the actual heat equation, not the one where we set the temperature derivative to zero, where now this axis is time axis, it’s just, as I said, another spatial dimension. And this is the one spatial variable we have. And now you can actually see that this in a sense resembles a lot in an ODE with an infinite dimensional state space. So you have a state, you can think about it as a state variable per spatial domain point. And then, well, you have an ODE part which describes this evolution over time. And if you slice through this function over time and actually animate what’s happening, then you get a physically plausible simulation of a heat diffusion, which also has all these uncertainties because it’s a GP, right? There’s not a lot of uncertainty here because I didn’t add uncertainty to the right-hand side and the boundary conditions, and I used a relatively dense grid, but in the end, you can sort of see that the grid spasifies, and there’s more uncertainty, specifically, at the edges. So you can simulate temporal problems also with this approach by essentially saying what time is nothing special, special time is just another input dimension, essentially. All right, so I talked a little bit about classical numerical methods for these PDEs, which have been developed over this past century. So how does this approach fit into it? It might seem a little bit weird to use a GP when there’s all this stuff already developed, but it actually turns out that the posterior mean of this method we’ve been developing, assuming that you don’t add any uncertainty. So you have exact boundary conditions, and you have exact right-hand sides of the PDE, is just the point estimate produced by some classical method. In this case, it’s called symmetric collocation, which is also why these points are called collocation points. So this is just one method. So where’s the plethora of other methods? More generally, we’ve actually showed that, not in this lecture, but you can actually show that all so-called weighted residual methods can be realized as these posterior means of a Gaussian process, just without the uncertainty quantification, obviously. And collocation methods are actually an instance of these, but there’s also finite volume methods, which you might have heard. And there’s also the large class of what are called galercan or petrof-galercan methods, which contain probably the most famous method for solving PDEs, which is the finite element method, and also spectral methods. And the way you get to these methods is essentially by changing the way you discretize the equation. Remember here, we took this PDE residual, this du-f, and just evaluated it at a couple of points. Well, why would you just evaluate? You can essentially project onto any other function, because you have a Hilbert space, usually. So you have an inner product on functions, and you can just put in another function there. And essentially by realizing it this way, you get at all these other different methods, by carefully choosing the functions that you project on, and by carefully choosing the prior you work from. With a bit different scheme, you can actually also show that finite difference discretization of PDEs can be realized via GP inference. This is another paper from our group. And now you might wonder, well, why would you even use GPEs in this or additionally through this plethora of other methods? Well, we get this uncertainty quantification, and what’s kind of nice is, because we know that the posterior means of these methods are the same as the classical methods. We can essentially use them as drop-in replacements of these classical methods, because you get the same solution, just plus uncertainty quantification. And that’s kind of cool, because you can reuse existing software stacks, essentially. All right, quick summary. In general, you’ve seen that GPEs can be used to solve linear PDEs, but maybe even more importantly, we showed that these information operators are quite an elegant language to realize regression, so function estimation, essentially, from very heterogeneous types of information, not just point evaluations, but also it’s of linear functionals. And that turned out to be quite helpful in these hybrid or physical, physics-informed, or mechanistic models. But in the beginning, I said it’s like incredibly hard to solve these equations. So where is all that hard math that should be necessary according to that statement to actually solve the equations? And I mean, so far, we just really needed derivatives and a little bit of linear algebra to actually express what we’re doing here. So where is all that? Well, as I said, there are some very important failure cases that we should be aware about. And mainly there’s two points. So maybe you can come up with one of the points where actually what we just did is not working if we formulate a model in the wrong way. There’s one very obvious one and one not so obvious one. So someone have an idea? Think about what, well, I mean, we use derivatives of GPs, right? So what can go wrong with the derivative? Anybody? Well, a function might not be differentiable. If you differentiate the non-deferential function, you essentially, it’s undefined, right? You don’t know what’s happening. So first of all, we need to make sure that the sample paths of our GPs are actually differentiable. Because otherwise, it’s kind of meaningless what we do here. And the second thing is that, well, GPs and specifically evaluations of GPs are random variables, right? And you might know from your probability theory statistics, whereas random variables are functions which need to be measurable in order to actually produce a meaningful sort of consistent statistical model here. Now, if we differentiate a GP, then that derivative, because the GP was random, we think of this derivative of the GP, specifically the point evaluated derivative, also as a random variable. It should be a random variable, right? Because, well, the quantity that we differentiated is random. So that derivative better be measurable. Otherwise, yeah, we’re again, an undefined territory. So everything that we do in this case doesn’t work. Fortunately, there’s a way of choosing your prior and choosing, specifically, essentially tuning it to that differential operator you’re using, such that all of this works out. But you need to verify some important conditions. And I’m going to try to go over the formal details of that theorem and a little bit. This is actually from an upcoming publication of ours, specifically on that topic, to make everything rigorous, in a sense. So first of all, let’s talk a bit about the sample paths of the GP. First of all, what even is the sample path of the GP, formally speaking? And when are they differentiable? Well, first of all, actually, let me say because I’m not going to come back to that, this is what’s what’s called bounded here, or the statement that a linear operator is continuous. In this case, under the assumption that, well, everything else in year holds, sort of gives you the measureability of the derivatives. So I’m not actually going to talk about this so much more. Just recall that matrices are always continuous. So the linear maps defined by a matrix, all linear maps in a finite dimensional vector space, in fact, are always continuous. In infinite dimensions, it doesn’t hold anymore. And you need to verify this for every particular operator. Well, and we sort of require this here, but we’re going to talk about this in the last slide, maybe a little bit. So first of all, let’s think about how our GP is even defined. Well, a GP is just a collection of random variables, of real valued random variables, one of such of these random variables at every point of the domain of the GP. So essentially, if you say, well, I evaluate the GP at a point x1, then that is a random quantity that is a random variable, it’s a real valued random variable. And in this set of random variables, that define a GP, there is specifically one corresponding to that output of the GP at point x1. Don’t be confused by this omega. This just comes from the definition of a random variable. A random variable is a function from this sample space of the probability space, this omega, to some real value in this case, if you fix x. And I guess you can think about this omega as like a random number generator. If you sample something from a random variable, you first sample an omega from your probability space, which that’s what the random number generator does, and then you transform that thing through this function f or f of x in this case to produce you a value in it. Well, if you chose this random number generator and that function f correctly, then you get, for example, a Gaussian random variable, yeah, you choose that in the correct way. So for a GP, we first of all only know what the distribution of a finite combination of evaluations of a GP is. That’s essentially what you do when you sample from a GP, for example. You evaluate the kernel function at all the points you want to sample at, and then, well, this, you know from the definition that that is a multi-variant normal distribution, and you sample from that. But we use GPs to model functions, not the evaluations of functions. So where do the functions enter in this definition? If we actually fix one such value, so imagine that we actually were to continuously sample a function, continuously sample a sample path from a GP. We would do this in the following way. We would fix an omega, so we would essentially say, well, random number generator generate like a random event. And then we transform that omega. Essentially, via all these functions, so there’s for every x, there’s one such function. And if we fix omega, then, well, we get one such function by looking essentially at the collective of all of these f-axis with omega fixed. And this is what’s called a sample path. This is what we talk about when we say, well, we want a sample path of the GP. And this is also why we think about GPs as models of unknown functions. This is obviously something we cannot do in a computer, right? Because there’s an, usually, uncountably infinite number of these x’s. But this is what you were doing. Well, think about it as constructing like an infinite-dimensional matrix, then taking the tralesque of that and then, you know, using this to compute your regular GP sample. And if we look at the collective of all of the, all of such possible sample paths by sort of look at, well, taking one for every possible event generated by the random number generator, then we get what’s called a sample path of the GP. And what we need to make sure for our purposes is that all of these sample paths in this sample space are sufficiently differentiable. Because otherwise, we’re doing something that is undefined, right? Second of all, if we, if we state that we want to compute LF, so the application, or the, the image of some GPF under a linear operator, L, what we actually mean is, first of all, fix an omega, then you have a sample path, a function which just maps x to the real numbers. And then we map that through the linear operator. And then afterwards, we sort of let omega go again. So it’s the concatenation of essentially a sample path with that, that operator. And in this case, because we choose the operator to map to Rn again. This is just an Rn value, random variable, so the random vector. Even though there is like an infinite dimensional object in between there, which is this GP, right? And I already said that, well, this is exactly the random, the thing that needs to be a random variable. This needs to be measurable because we, we condition the GP on the fact that this has absurd and prescribed value, right? So, yeah, this battery measurable, again, not really going into details here. It turns out that the sample paths of GPs are actually reprodu- or can be made into a reproducing kernel Hilbert space by choosing an appropriate kernel. I’m not really going to go over this. There’s just for the people that know what an RKHS is. However, usually the problem is, actually, I believe always, the problem is that the reproducing kernel Hilbert space off the actual kernel of the GP. So the kernel function that you choose is not the space from which the samples come. The samples are informally speaking usually rougher in terms of different ability than samples from the- well, elements from that original RKHS. And you sort of need to choose a larger space in which these samples are contained. And you can actually show that a sample from a GP is almost surely not from that space. So with probability zero, you draw a sample from specific that space. Let’s look at a concrete example, which is also very useful in practice. So this is actually why I’m ending on that. This is a matern kernel. So, the scary expression, you can simplify this quite a bit for specific parameters P here, which is actually an integer. And it turns out that this parameter P controls the different ability of the GP. So the higher the P is the more derivatives you can take of the GP, essentially. First of all, the RKHS generated by that kernel. So the function space that is generated by this kernel contains functions that are P times differentiable. So you can actually get you can- yeah, well, if you get a function from that space, you can differentiate it P times, actually continuously differentiated P times. But the problem is that if you have a GP with specifically this kernel, then the samples are not P times differentiable. The samples are actually less- have less continuous derivatives. And you can show that- well, informally speaking, you can show that there are actually D times less- a D half times less partial derivatives, where D is the input dimension of the kernel. So for example, in RKHS, we chose a matern 7.5 kernel, which in the covariance RKHS, I’ll call it, the kernel’s RKHS, which would mean that the functions are actually three times differentiable. But since we modeled a D equal one problem, the functions we draw from that GP actually one times less differentiable. So two times differentiable, which is exactly what we need here. And for a rule of thumb, you can- if you need sort of partial derivatives up to order at most M, remember we have a second order equation, so for us M is two. Then you can choose- you can use this formula to compute the parameter P for the matern kernel, which you actually need in this case. And a nice thing is, if you actually choose this specific formula, then you actually know that the differential operator on the path of that GP is actually bounded. So you get a random variable. That’s a nice pass. With that, sorry. And yeah. So this is actually- you can generalize this process. I actually should have done that here because I talk about the mention. These are normally just Euclidean norms. So instead of just computing an absolute value, it’s the- well, you have a Euclidean norm of the difference. So you can deploy matern kernels in an arbitrary number of input dimensions. But specifically for PDEs, and you will actually see that on the exercise sheet this week, it’s better to choose- to construct a D-dimensional kernel by taking products of 1D matern kernels over the dimensions. And this is specifically useful if you have sort of mixed orders of derivatives. Remember in the heat equation, there was only one- so if the first partial derivative with respect to time, but second partial derivatives with respect to the spatial variables. So it actually makes sense to use a- well, a matern kernel that gives you two partial derivatives in the spatial part of the kernel. And one that is rougher, maybe is exactly one degree of differentiability rougher for the time- time- dimension. That helps you adapt to the concrete equation you’re trying to solve. Yeah, with that I come to an end. We’ve seen that PDEs are this important and actually sort of ubiquitous language for- for modeling problems, physical problems specifically, but also other problems from other domains in the real world. We can- we have seen that we can use our recurring framework of GP inference to actually solve these PDEs, but more generally, GPEs provide this- this framework in which we can fuse or combine very heterogeneous information sources into a single regression model by using these affine- information operators. And in the end, we’ve seen that you need to take quite some mathematical care, so there’s not to make mistakes in the model construction, but this mostly applies to the- construction of the prior and I’ve given you sort of a short hand for how to get at a specific- a prior for specific equation in practice. Alright, thanks for listening and I’m happy to answer any questions.

AI video(s) you might be interested in …