Neural networks are universal function approximators

According to the universal approximation theorem, a neural network can approximate any function provided that its weights and biases are correctly chosen. The function approximation can be represented by a point in a multidimensional space, each parameter (weight or bias) of the neural network corresponding to one dimension. Finding the best approximation for a given function can be done by gradient descent, like you can find the lowest point of a terrain by always going down.

Here is an example with a function with one input and one output, which is approximated by a neural network with one intermediate layer with 10 neurons.

Here is the J code implementing this network :


NB. Neural network universal function approximator

load 'plot'

X =: 0.1 * _40 + i. 81  NB. Input values from _4 to 4 with step 0.1

f =: 3 : '(0.5*y^3) + (y^2) + (_5*y) + 6'  NB. Function to approximate

N =: 10                       NB. Number of intermediate neurons

NB. Initial random values
b =: 0.1 * _100 + ? N # 200   NB. Biases
w1 =: 0.1 * _100 + ? N # 200  NB. Input weights 
w2 =: 0.1 * _100 + ? N # 200  NB. Output weights

g =: (3 : '+/ w2 * 0 >. (w1 * y) + b')"0  NB. Neural approximation

loss =: +/ *: (g X) - (f X)

NB. Compute loss 
NB. loss =: computeloss b, w1, w2
computeloss =: 3 : 0"_ 1
 b =. (i. N) { y
 w1 =. (N + i. N) { y
 w2 =. ((2*N) + i. N) { y
 t =. f X
 p =. +/ w2 * 0 >. (w1 */ X) + b 
 loss =. +/ *: p - t
 loss
)
 
eps =: 0.0001

NB. Gradient descent
descent =: 3 : 0
 P =. b, w1, w2
 nP =. # P
 loss =. computeloss P
 step =. 0
 for. i. 10000 do.
  step =. step + 1
  loss1 =. loss
  loss =. computeloss P
  Pplus =: |: P + eps * (i. nP) =/ i. nP
  Pminus =: |: P - eps * (i. nP) =/ i. nP
  gradient =: (1%2*eps) * (computeloss Pplus) - (computeloss Pminus)
  echo 'Step ', (": step), ' : loss = ', (": loss)
  P =. P - 0.00001 * gradient
 end.
 P
)

P =: descent 0
b =: (i. N) { P
w1 =: (N + i. N) { P
w2 =: ((2*N) + i. N) { P

plot X ; (f X) ,: (g X)
pd 'pdf nnufa.pdf'

After training the network with gradient descent, it gives an approximation of the original function.

Link :