Link to paper: Learning Partial Differential Equations From Data Using Neural Networks

Prerequisites

A Partial Differential Equation relates a function with its partial derivatives. It can be represented as follows:

F(x,f(x),fx1,fx2...tfxnt,2fx1x2,...)=0F(\textbf{x}, f(\textbf{x}), \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}...\frac{\partial ^t f}{\partial x_n^t}, \frac{\partial ^2 f}{\partial x_1 \partial x_2}, ...) = 0

where xϵRn,tϵZ\textbf{x} \medspace \epsilon \medspace \R^n, t \medspace \epsilon \medspace \Z.

If one restricts FF to be linear, then above equation can be represented as

Dϕ=i=0kϕiDi=0 \textbf{D} \phi = \sum_{i=0}^{k}\phi_iD_i = 0

where D\textbf{D} is pre-specified set of functions(aka Dictionary functions) on which you expect the PDE to depend and ϕ\phi is vector of linear coefficients. It is evident that ϕϵN(D)\phi \medspace \epsilon \medspace N(\textbf{D}) where N(D)N(\textbf{D}) is null space of D\textbf{D}.

Key points to note
  • Dictionary functions: Set of functions on which you expect your PDE to depend
  • Null Space of a matrix is spanned by it's right singular vectors whose associated singular value is 00.

Available Data

Three things are available to us

X,f(X),D\textbf{X}, f(\textbf{X}), \textbf{D}

where,

X=(x1,x2,x3......xn)\textbf{X}=(x_1, x_2, x_3......x_n) and xiϵRdx_i \medspace \epsilon \medspace \R^d

f(X)=(f(x1),f(x2),f(x3)......f(xn))f(\textbf{X})=(f(x_1), f(x_2), f(x_3)......f(x_n)) and f(xi)ϵR f(x_i) \medspace \epsilon \medspace \R

D=(D1,D2,D3.........Dk)\textbf{D} = (D_1, D_2, D_3.........D_k) ; Dictionary functions; DiD_i is either some function of ff or relates ff with its partial derivate(s).

Output we want

  • An approximation of ff
  • Underlying PDE as linear combination of dictionary functions

Proposed Architecture

Since, Artificial Neural Networks are universal function approximators(under certain assumptions, of course), one can use them to approximate a given function ff. Let's call the approximated function fˉ\bar f

Only issue with this approach is, it overfits if data is noisy. And here is where PDE shines.

How PDEs will be used for parameter estimation

We add a regularizer term which adds a penalty if the function generated by neural network isn't a solution to PDE (We'll discuss the penalization aspect later).

But we still haven't figured out how to come up with ϕ\phi. So lets focus on that.

FInding ϕ\phi

The approach is as follows:

  1. Sample kk elements from X\textbf{X}. Call the set of points Xˉ\bar \textbf{X}
  2. Evaluate dictionary functions DD on Xˉ\bar \textbf{X}. This operation will be represented as D(fˉ,Xˉ)D(\bar f, \bar \textbf{X}). Explained more in furthur paragraphs.
  3. If there were LL dictionary functions, then step (2) will yield a matrix of dimension k×Lk \times L. Let's call the matrix Kˉ\bar K
  4. Using some technique, find null vectors of Kˉ\bar K. Note that any null vector of Kˉ\bar K is capable of being ϕ\phi.

But there is a small catch here. In step 2, we are evaluating dictionary functions on sampled data points. This has one issue: We don't know exact representation of ff. We are relying on our neural network to give us an approximate representation of ff and while neural networks are theoretically capable of approximating a function(again, under few assumptions) to arbitrary accuracy, they can't produce the exact representation. So, the matrix Kˉ\bar K has some implicit error which can't be removed. This also means that Kˉ\bar K may be a full-rank matrix(because of independent implicit error). Hence, we may not find any null vector in step (4).

But, if one assumes that every dictionary function is distributive and ffˉ<ϵ|f-\bar f| < \epsilon, where ϵ\epsilon is an arbitrary positive number, then one can easily prove that Kˉ\bar K will approximate D(f,Xˉ)D(f, \bar \textbf{X}), where D(f,Xˉ)D(f, \bar \textbf{X}) denotes dictionary function being evaluated on Xˉ\bar \textbf{X}, using the exact representation of ff. Thus, the right singular vector associated with smallest singular of Kˉ\bar K will be an approximation for ϕ\phi (Remember, singular values are always non-negative, and null space is spanned by right singular vectors whose associated singular value is 00. So, if Kˉ\bar K is an approximation to D(f,Xˉ)D(f, \bar \textbf{X}), then right singular vector corresponding to singular value closest to 00(i.e. minimum singular value) will be an approximation to ϕ\phi)

Finally, we know what to find and why to find it. We just don't know how. There are several methods of finding smallest singular value and corresponding right singular vector for a given matrix. Authors of this paper choose to go with Min-Max theorem for Singular values. Based on quality and quantity of your data, you may choose to go with some other method.

Loss function

Intuition behind loss function

We want our approximated function to fit the data as well as underlying PDEs. And since it's Machine Learning, it won't be complete without Occam's Razor :). Hence, we can say that

Lnet=G(Lmodel,LPDE,Lcomplexity)L_{net} = G(L_{model}, L_{PDE}, L_{complexity})

Assuming data points(X\textbf{X}) are I.I.D. and coming from a Guassian distribution, LmodelL_{model} can simply be MSE. In most practical cases, people tend to ignore the Gaussian distribution prior assumption(never understood why they do it. Humans are complicated, aren't they?). Assuming n\textbf{n} data points:

Lmodel=12ni=0n(f(xi)fˉ(xi)22)L_{model} = \sqrt{\dfrac{1}{2n}\sum_{i=0}^n(||f(x_i) - \bar f(x_i)||_{_2}^{^2}})

LPDEL_{PDE} is also simple. Assuming you get ϕ\phi^* as your coefficients for underlying PDE, D(fˉ,Xˉ)ϕ22||D(\bar f, \bar \textbf{X})\cdot\phi^*||_{_2}^{^2} should give LPDEL_{PDE}. Why? Because if fˉ\bar f approximates ff, then D(fˉ,Xˉ)ϕD(\bar f, \bar \textbf{X})\cdot\phi^* should be 00.(Remember, ϕ\phi lies in null space of DD)

LPDE=D(fˉ,Xˉ)ϕ22L_{PDE} = ||D(\bar f, \bar \textbf{X})\cdot\phi^*||_{_2}^{^2}

Coming to LcomplexityL_{complexity}, I would say it depends on data and inductive bias. Here, authors assume that you want your underlying PDE to be sparse i.e. it should be dependent on as few dictionary functions as possible. Hence, they applied L1L_1 regularization on ϕ\phi.

Lcomplexity=ϕ1L_{complexity}=||\phi||_{_1}

But based on your priors, you can always choose(or invent) a suitable regularizer.

Formulating the loss function

Now, let's think of representing loss function in terms of three losses described above. We would want PDE to intervene when model is overfitting. Generally, model overfits when data is noisy i.e. it contains outliers. Outliers cause a spike in loss value, thus increasing the mean (remember, LmodelL_{model} is MSE) and consequently forcing model to perfectly fit the outlier. However, in the process of fitting every outlier, the function which model learns becomes less and less smooth. And that means underlying PDE will have to be dependent on more and more dictionary functions.

But that can't happen. Why? Because of L1L_1 regularization applied on ϕ\phi. (The Occam's razor finally strikes!!!!!).

In order to make all of this happen, we must ensure that whenever LmodelL_{model} is high, LPDEL_{PDE} and LcomplexityL_{complexity} dominate the loss function. LcomplexityL_{complexity} will force the underlying PDE to be dependent on fewer dictionary functions and LPDEL_{PDE} will force the model to learn a smooth function.

Also, when normal data points(i.e. data points which are not outliers) are encountered, LmodelL_{model} should dominate the loss function.

Summarizing above reasoning:

  • When LmodelL_{model} is high, we want LPDEL_{PDE} and LcomplexityL_{complexity} to dominate the net loss.
  • When LmodelL_{model} is low, we want LmodelL_{model} to dominate the net loss.

A simple function satisying above conditions is:

Lnet=G(Lmodel,LPDE,Lcomplexity)=λmodelLmodel(1+λPDELPDE+λcomplexityLcomplexity)L_{net} = G(L_{model}, L_{PDE}, L_{complexity}) = \lambda_{model}L_{model}(1 + \lambda_{PDE}L_{PDE} + \lambda_{complexity}L_{complexity})

where, λmodel\lambda_{model}, λPDE\lambda_{PDE} and λcomplexity\lambda_{complexity} are Lagrange multipliers.

Results

Results claimed by the authors seem to be decent. Error computation is as follows:

  1. Normalise original PDE coefficients(ϕ\phi) and predicted PDE coefficients(ϕ\phi^*),
  2. Compute their dot product,
  3. Subtract it from 1
  4. Take square root of the difference.

Obviously, because of normalization, dot product will be in between -1 and 1. Hence, in step 3, the difference will never be negative, so loss will always be real.

However, the authors haven't reported the MSE values on train and test data. That will be a good thing to try.

Refer to the paper for exact values.

My two cents

This paper proposes a new way of tackling overfitting. It assumes that as more and more noisy data points are encountered by the model, overfitting will decrease the smoothness of function being approximated by it. This assumption does make sense and is true for all practical cases I've seen till now. In that sense it provides an implicit solution for overfitting, which eliminates the need of overfitting mitigation methods like cross validation, manually checking results on validation data, etc.

However, there are two clear disadvantages

  • Pre-specifying dictionary functions
  • Assumption that PDE is linear combination of dictionary function.

Because of reasons mentioned above, this method is useless for domains where we don't have any credible information of nature of underlying PDE.

However, this method will be very useful in domains like physics and finance, where we already have a set of potential dictionary functions.

Up Next

  • Simulate their work and extend it for some other PDEs.
  • Report MSE on train and test set.
  • Improve hyperparameter search by using ASHA, PBT or other techniques.

Have any issues/complaints/suggestions, feel free to mail me at rishurai24@gmail.com, or open an issue here