Learning Partial Differential Equations From Data Using Neural Networks // Prakash Rai

Link to paper: Learning Partial Differential Equations From Data Using Neural Networks

Prerequisites

A Partial Differential Equation relates a function with its partial derivatives. It can be represented as follows:

$F(\textbf{x}, f(\textbf{x}), \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}...\frac{\partial ^t f}{\partial x_n^t}, \frac{\partial ^2 f}{\partial x_1 \partial x_2}, ...) = 0$

where $\textbf{x} \medspace \epsilon \medspace \R^n, t \medspace \epsilon \medspace \Z$ .

If one restricts $F$ to be linear, then above equation can be represented as

$\textbf{D} \phi = \sum_{i=0}^{k}\phi_iD_i = 0$

where $\textbf{D}$ is pre-specified set of functions(aka Dictionary functions) on which you expect the PDE to depend and $\phi$ is vector of linear coefficients. It is evident that $\phi \medspace \epsilon \medspace N(\textbf{D})$ where $N(\textbf{D})$ is null space of $\textbf{D}$ .

Key points to note

Dictionary functions: Set of functions on which you expect your PDE to depend
Null Space of a matrix is spanned by it's right singular vectors whose associated singular value is $0$ .

Available Data

Three things are available to us

$\textbf{X}, f(\textbf{X}), \textbf{D}$

where,

$\textbf{X}=(x_1, x_2, x_3......x_n)$ and $x_i \medspace \epsilon \medspace \R^d$

$f(\textbf{X})=(f(x_1), f(x_2), f(x_3)......f(x_n))$ and $f(x_i) \medspace \epsilon \medspace \R$

$\textbf{D} = (D_1, D_2, D_3.........D_k)$ ; Dictionary functions; $D_i$ is either some function of $f$ or relates $f$ with its partial derivate(s).

Output we want

An approximation of $f$
Underlying PDE as linear combination of dictionary functions

Proposed Architecture

Since, Artificial Neural Networks are universal function approximators(under certain assumptions, of course), one can use them to approximate a given function $f$ . Let's call the approximated function $\bar f$

Only issue with this approach is, it overfits if data is noisy. And here is where PDE shines.

How PDEs will be used for parameter estimation

We add a regularizer term which adds a penalty if the function generated by neural network isn't a solution to PDE (We'll discuss the penalization aspect later).

But we still haven't figured out how to come up with $\phi$ . So lets focus on that.

FInding $\phi$

The approach is as follows:

Sample $k$ elements from $\textbf{X}$ . Call the set of points $\bar \textbf{X}$
Evaluate dictionary functions $D$ on $\bar \textbf{X}$ . This operation will be represented as $D(\bar f, \bar \textbf{X})$ . Explained more in furthur paragraphs.
If there were $L$ dictionary functions, then step (2) will yield a matrix of dimension $k \times L$ . Let's call the matrix $\bar K$
Using some technique, find null vectors of $\bar K$ . Note that any null vector of $\bar K$ is capable of being $\phi$ .

But there is a small catch here. In step 2, we are evaluating dictionary functions on sampled data points. This has one issue: We don't know exact representation of $f$ . We are relying on our neural network to give us an approximate representation of $f$ and while neural networks are theoretically capable of approximating a function(again, under few assumptions) to arbitrary accuracy, they can't produce the exact representation. So, the matrix $\bar K$ has some implicit error which can't be removed. This also means that $\bar K$ may be a full-rank matrix(because of independent implicit error). Hence, we may not find any null vector in step (4).

But, if one assumes that every dictionary function is distributive and $|f-\bar f| < \epsilon$ , where $\epsilon$ is an arbitrary positive number, then one can easily prove that $\bar K$ will approximate $D(f, \bar \textbf{X})$ , where $D(f, \bar \textbf{X})$ denotes dictionary function being evaluated on $\bar \textbf{X}$ , using the exact representation of $f$ . Thus, the right singular vector associated with smallest singular of $\bar K$ will be an approximation for $\phi$ (Remember, singular values are always non-negative, and null space is spanned by right singular vectors whose associated singular value is $0$ . So, if $\bar K$ is an approximation to $D(f, \bar \textbf{X})$ , then right singular vector corresponding to singular value closest to $0$ (i.e. minimum singular value) will be an approximation to $\phi$ )

Finally, we know what to find and why to find it. We just don't know how. There are several methods of finding smallest singular value and corresponding right singular vector for a given matrix. Authors of this paper choose to go with Min-Max theorem for Singular values. Based on quality and quantity of your data, you may choose to go with some other method.

Loss function

Intuition behind loss function

We want our approximated function to fit the data as well as underlying PDEs. And since it's Machine Learning, it won't be complete without Occam's Razor :). Hence, we can say that

$L_{net} = G(L_{model}, L_{PDE}, L_{complexity})$

Assuming data points( $\textbf{X}$ ) are I.I.D. and coming from a Guassian distribution, $L_{model}$ can simply be MSE. In most practical cases, people tend to ignore the Gaussian distribution prior assumption(never understood why they do it. Humans are complicated, aren't they?). Assuming $\textbf{n}$ data points:

$L_{model} = \sqrt{\dfrac{1}{2n}\sum_{i=0}^n(||f(x_i) - \bar f(x_i)||_{_2}^{^2}})$

$L_{PDE}$ is also simple. Assuming you get $\phi^*$ as your coefficients for underlying PDE, $||D(\bar f, \bar \textbf{X})\cdot\phi^*||_{_2}^{^2}$ should give $L_{PDE}$ . Why? Because if $\bar f$ approximates $f$ , then $D(\bar f, \bar \textbf{X})\cdot\phi^*$ should be $0$ .(Remember, $\phi$ lies in null space of $D$ )

$L_{PDE} = ||D(\bar f, \bar \textbf{X})\cdot\phi^*||_{_2}^{^2}$

Coming to $L_{complexity}$ , I would say it depends on data and inductive bias. Here, authors assume that you want your underlying PDE to be sparse i.e. it should be dependent on as few dictionary functions as possible. Hence, they applied $L_1$ regularization on $\phi$ .

$L_{complexity}=||\phi||_{_1}$

But based on your priors, you can always choose(or invent) a suitable regularizer.

Formulating the loss function

Now, let's think of representing loss function in terms of three losses described above. We would want PDE to intervene when model is overfitting. Generally, model overfits when data is noisy i.e. it contains outliers. Outliers cause a spike in loss value, thus increasing the mean (remember, $L_{model}$ is MSE) and consequently forcing model to perfectly fit the outlier. However, in the process of fitting every outlier, the function which model learns becomes less and less smooth. And that means underlying PDE will have to be dependent on more and more dictionary functions.

But that can't happen. Why? Because of $L_1$ regularization applied on $\phi$ . (The Occam's razor finally strikes!!!!!).

In order to make all of this happen, we must ensure that whenever $L_{model}$ is high, $L_{PDE}$ and $L_{complexity}$ dominate the loss function. $L_{complexity}$ will force the underlying PDE to be dependent on fewer dictionary functions and $L_{PDE}$ will force the model to learn a smooth function.

Also, when normal data points(i.e. data points which are not outliers) are encountered, $L_{model}$ should dominate the loss function.

Summarizing above reasoning:

When $L_{model}$ is high, we want $L_{PDE}$ and $L_{complexity}$ to dominate the net loss.
When $L_{model}$ is low, we want $L_{model}$ to dominate the net loss.

A simple function satisying above conditions is:

$L_{net} = G(L_{model}, L_{PDE}, L_{complexity}) = \lambda_{model}L_{model}(1 + \lambda_{PDE}L_{PDE} + \lambda_{complexity}L_{complexity})$

where, $\lambda_{model}$ , $\lambda_{PDE}$ and $\lambda_{complexity}$ are Lagrange multipliers.

Results

Results claimed by the authors seem to be decent. Error computation is as follows:

Normalise original PDE coefficients( $\phi$ ) and predicted PDE coefficients( $\phi^*$ ),
Compute their dot product,
Subtract it from 1
Take square root of the difference.

Obviously, because of normalization, dot product will be in between -1 and 1. Hence, in step 3, the difference will never be negative, so loss will always be real.

However, the authors haven't reported the MSE values on train and test data. That will be a good thing to try.

Refer to the paper for exact values.

My two cents

This paper proposes a new way of tackling overfitting. It assumes that as more and more noisy data points are encountered by the model, overfitting will decrease the smoothness of function being approximated by it. This assumption does make sense and is true for all practical cases I've seen till now. In that sense it provides an implicit solution for overfitting, which eliminates the need of overfitting mitigation methods like cross validation, manually checking results on validation data, etc.

However, there are two clear disadvantages

Pre-specifying dictionary functions
Assumption that PDE is linear combination of dictionary function.

Because of reasons mentioned above, this method is useless for domains where we don't have any credible information of nature of underlying PDE.

However, this method will be very useful in domains like physics and finance, where we already have a set of potential dictionary functions.

Up Next

Simulate their work and extend it for some other PDEs.
Report MSE on train and test set.
Improve hyperparameter search by using ASHA, PBT or other techniques.