AI from scratch

My first neural net

Like everybody else I've been playing a lot with ChatGPT recently and I'm gobsmacked by how good it is!  This has led me to start researching how exactly these things work.  In particular I wanted to understand how neural nets - a core component of the technology - are trained.

Running a neural net is simple.  The neural net consists of layers of nodes connected by weights and biases.  The first layer is an input layer and setting the activation levels of each node in that layer causes the next layer to adopt values determined by the weights and biases.  The second layer determines the activation levels in the next layer and so on until the final layer, which is interpreted as the output. 

The boxes on the right represent the logistic function $y = 1/(1 + e^{-x})$.  Without these the intermediate layers become redundant.  This is because, however many layers there are, the output $y_{out}$ would always be $A\ y_{in} + b$ for some matrix $A$ and vector $b$.  So neural nets are a nice example of how introducing non-linearity to a system also introduces a lot more complexity in the behaviour.  (NB: other non-linear functions are available.)

The clever bit about neural nets is how you train them.  3Blue1Brown does a fantastic job of explaining the method known as "back-propagation".  In essence, an error function $E(\mathbb{w},\mathbb{b})$ is defined which maps the weights and biases to a positive real number.  $E$ is defined using a training set which consists of inputs paired with "expected" outputs.  The value is just half the square of the difference between the expected output and the actual output,  summed over all nodes in the output layer, and all cases in the training set.  Then it is simply a case of doing a gradient descent to minimise $E$.

The term back-propagation comes from the fact that as part of the gradient descent we need to work out the $\partial E_c/\partial y_{li}$ for each case $c$ in the training set.  Doing this requires working out the $\partial E_c/\partial y_{li}$ for the output layer ($l=L-1$) first, then for the layer immediately before that, and so on until you reach the input layer.

After watching the 3Blue1Brown vids I wondered if I could write my own neural net.  I wanted to do it from scratch rather than use one of the many libraries, as this is the only way to be really sure you've got it.  However, trying to reproduce the neural net described in that video would be quite hard because the large number of neurons and the huge training set.  (The application was optical character recognition.)  I really wanted to find a minimal application just for the purpose of checking my own understanding.  Fortunately I came across just the thing in the seminal 1986 paper that introduced neural nets by Rumelhart, Hinton, and Williams.

Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).

The neural net above can be trained using a training set with just 64 cases.  That's because its purpose is to determine if a 6 bit input is symmetric.  The picture from the paper shows the weights above the connecting lines, and biases inside the nodes.  These weights and biases successfully output 1 (approximately) whenever a symmetric word like 0b110011 is presented to the input, and 0 (approximately) whenever a non-symmetric word like 0b110001 is presented.

It's easy to show that this problem genuinely requires an intermediate layer, making it one of the smallest non-trivial neural nets, and one of the easiest to train.  This makes it a perfect target to try to reproduce for a lazy programmer in a hurry.  And here's what I got after training a neural net written from scratch in Python (I  know!) with 5000 applications of the training set (0b000000, 1), (0b000001, 0), ....

The resulting weights and biases do actually work, in that they successfully identify which inputs are symmetric and which are not.  (That surprised me, actually.)  Also, although the values are slightly different to the ones obtained by Hinton et al. the key features are the same.  In particular the 2 weights from each node in the input layer are negatives of one another.  They are also negatives of the 2 weights from the other node in the input layer that it is paired with when symmetry-testing.  Furthermore, the absolute values have a 1:2:4 ratio.  The output node is essentially on by default but if there is any asymmetry at least one of the two intermediate nodes gets activated, suppressing the output.  The 1:2:4 ratio of the weights between the 1st and 2nd layers serves to make it impossible for multiple asymmetric pairs in the input array to cancel.  Although these weights and biases were arrived at by nothing more than gradient descent, it's hard not to get the feeling something designed them!

I did learn something from the exercise.  Firstly, when training the net I found that it very quickly reached $E\approx 4.0$ and spent a long time hovering around there before starting to converge towards zero.  Later I realized that is because 56 of the 64 training entries have an expected output of 0.0 and so a good first approximation is to just always output zero.  The conclusion from this observation is that the net would train more quickly if each batch used to train it was half symmetric and half non-symmetric (chosen randomly of course).  Another observation is that it wasn't obvious exactly how to do the gradient descent.  Determining $\nabla E$ was straightforward enough, and it was clear that the weights and biases should change by $-\epsilon \nabla E$, but what value of $\epsilon$ should be used?  I found that for this particular net and training set the value 1.0 worked fine, but I would have been happier if I had found a more general answer. (See POSTSCRIPT.)

Here's my code.

import numpy as np


class NeuralNet:
    """
    Usage:
      nn = NeuralNet([3,2,3]) # 3 input nodes, 2 intermediate nodes, 3 output nodes

      # randomize weights
      nn.randomize()

      # training: repeat the following 3 steps for each batch of inputs and expected output
      nn.batch_start()
      nn.batch_update([1,0,0.5], [3,4,4])  # input and expected output
      nn.batch_update([3,.1,.2], [3,2,.1]) # ""
      nn.batch_update([1,1,0.5], [.3,4,7]) # ""
      nn.batch_complete()                  # updates weights

      # use of pre-trained neural net
      nn.set_input([1,0,0.5])
      nn.get_output()                      # might print e.g. [3,4,4]
    """

    def __init__(self, n: list[int]):
        """
        Create a neural net with

            * n[0] nodes in input layer
            * n[1] nodes in next layer
              ...
            * n[-1] nodes in output layer

        All weights and biases are initially zero
        """
        # L = number of layers
        self._L = len(n)
        self._n = n
        n_max = max(n)
        L = self._L

        # y[l,i] = activation level of ith node in lth layer (l<L, i<n[l])
        self._y = np.zeros((L, n_max))

        # w[l,i,j] = signal weight for ith node of lth layer to jth node of l+1st layer
        self._w = np.zeros((L - 1, n_max, n_max))

        # b[l,j] = bias level of jth node in l+1st layer
        self._b = np.zeros((L - 1, n_max))

        # E = cumulative error in this batch
        self._E = 0.0

        # dEdw[l,i,j] = partial derivative of E (0.5 x sum of squares of errors in batch)
        self._dEdw = np.zeros((L - 1, n_max, n_max))

        # dEdb[l,j] = partial derivative of E
        self._dEdb = np.zeros((L - 1, n_max))

        # dEdy[l,i] = partial derivative of E (only used in batch_update() - needn't persist between calls)
        self._dEdy = np.zeros((L, n_max))

        # w,b are updated by -epsilon * dEdw, -epsilon * dEdb on batch_complete
        self._epsilon = 1.0

    @staticmethod
    def _logistic(x):
        return 1.0 / (1.0 + np.exp(-x))

    def randomize(self, loc=0, scale=1):
        self._w = np.random.normal(loc, scale, size=self._w.shape)
        self._b = np.random.normal(loc, scale, size=self._b.shape)
        for l in range(self._L - 1):
            n = self._n[l]
            n2 = self._n[l + 1]
            self._w[l, n:, :] = 0
            self._w[l, :, n2:] = 0
            self._b[l, n2:] = 0

    def set_input(self, x: list[float]):
        """:param x: input values all between 0 and 1"""
        n = self._n[0]
        y = self._y[0]
        assert len(x) == n
        y[0:n] = x
        L = self._L

        for l in range(L - 1):
            n = self._n[l]
            n2 = self._n[l + 1]
            w = self._w[l, :n, :n2]
            y = self._y[l]
            y2 = self._y[l + 1]
            b = self._b[l, :n2]

            # Update layer l+1
            x = sum(w * y[:n, None]) + b
            y2[:n2] = self._logistic(x)

    def get_output(self):
        """:return: output values all between 0 and 1"""
        n = self._n[-1]
        return self._y[-1, :n]

    def batch_start(self):
        self._dEdw[:, :, :] = 0.0
        self._dEdb[:, :] = 0.0
        self._E = 0.0

    def batch_update(self, i: list[float], d: list[float]):
        """
        :param i: input node values (should all be between 0 and 1)
        :param d: desired output node values (ditto)
        """
        # Update all the y[l,i]
        self.set_input(i)

        # Use back propagation to set all dEdy[l,i] starting at l=L-1, stopping at l=1
        L = self._L
        l = L - 1
        n = self._n[l]
        y = self._y[l]

        # Update cumulative error E
        delta = y[:n] - d
        self._E += 0.5 * sum(delta * delta)

        # Calculate dE/dy for nodes in the output layer and this case only
        dEdy = self._dEdy
        dEdy[l, :n] = delta

        # Update dE/dy for nodes in earlier layers and this case only
        while l > 0:
            l -= 1
            n2 = n
            y2 = y
            n = self._n[l]
            y = self._y[l]
            w = self._w[l]
            for k in range(n):
                # TODO get rid of loop!
                dEdy[l, k] = sum(dEdy[l + 1, :n2] * y2[:n2] * (1 - y2[:n2]) * w[k, :n2])

        # Update all the dEdw[l,i,j]
        dEdw = self._dEdw
        for l in range(L - 1):
            y2 = self._y[l + 1]
            n2 = self._n[l + 1]
            n = self._n[l]
            for i in range(n):
                # TODO get rid of loop!
                y = self._y[l, i]
                dEdw[l, i, :n2] += dEdy[l + 1, :n2] * y2[:n2] * (1 - y2[:n2]) * y

        # Update all the dEdb[l,j]
        dEdb = self._dEdb
        for l in range(L - 1):
            y2 = self._y[l + 1]
            dEdb[l, :] += dEdy[l + 1, :] * y2 * (1 - y2)

    def batch_complete(self):
        # Update all the weights using gradient descent
        self._w += -self._epsilon * self._dEdw
        self._b += -self._epsilon * self._dEdb

    def graph(self):
        s = [
            "digraph NeuralNet {",
            "    overlap=false;",
        ]
        for l in range(self._L - 1):
            for i in range(self._n[l]):
                suffix = f"\nbias {self._b[l-1,i]:.02f}" if l > 0 else ""
                i_label = f"y_{l}_{i}\n{self._y[l,i]:.02f}{suffix}"
                for j in range(self._n[l + 1]):
                    suffix = f"\nbias {self._b[l,j]:.02f}"
                    j_label = f"y_{l+1}_{j}\n{self._y[l+1,j]:.02f}{suffix}"
                    s.append(
                        f'    "{i_label}" -> "{j_label}" [label="{self._w[l,i,j]:.02f}"];'
                    )
        s.append("}")
        return "\n".join(s)


if __name__ == "__main__":
    import os

    nn = NeuralNet([6, 2, 1])
    nn.randomize(0, 0.3)

    # train it up
    for batch in range(5000):
        nn.batch_start()
        for case in range(64):
            fwd = f"{case:06b}"
            rev = fwd[-1::-1]
            _input = [(case >> n) & 1 for n in range(6)]
            desired_output = [float(fwd == rev)]
            nn.batch_update(_input, desired_output)
        nn.batch_complete()
        print(f"batch {batch} E={nn._E}\r", end="")

    # test it out
    print("\ntesting cases: ", end="")
    failed = []
    for case in range(64):
        fwd = f"{case:06b}"
        rev = fwd[-1::-1]
        _input = [(case >> n) & 1 for n in range(6)]
        desired_output = float(fwd == rev)
        nn.set_input(_input)
        if abs(nn.get_output()[0] - desired_output) >= 0.5:
            failed.append(case)
    print(f"{64-len(failed)} successful {len(failed)} failed, failed cases={failed}")

    # Plot dot graph
    graph = nn.graph()
    open("/tmp/1.dot", "w").write(graph)
    os.system("neato -Tpng -o/tmp/1.png /tmp/1.dot && open /tmp/1.png")

POSTSCRIPT

After a bit of digging I found a solution to the problem of how to choose the coefficient for the gradient descent.  To recap: if we define $\mathbf{x} = (\mathbf{w},\mathbf{b})$ then $E=E(\mathbf{x})$ and gradient descent involves stepping by $d\mathbf{x} = -\epsilon \nabla E$.  But what value should be used for $\epsilon$?  An answer is provided by the Adam Optimizer.  This solves the problem in three steps.  The following explanation simplifies these steps slightly, but gets the gist:
  1. Normalize $\nabla E$ by replacing it with the unit vector $\mathbf{u} = \nabla E/\left|\nabla E\right|$.  This step ensures that the step size is the same each time and not dependent on the current value of $\mathbf{x}$.
  2. Pass $\mathbf{u}$ through a low pass filter $\hat{\mathbf{u}}_{t+1} = \beta \hat{\mathbf{u}}_{t} + (1-\beta)\mathbf{u}_t$ where $\beta = 0.9$.  This step creates momentum preventing $\mathbf{x}$ getting stuck in local minima with small basins of attraction.
  3. Choose a value $\alpha$ and each step update $\mathbf{x}$ by $d\mathbf{x} = -\alpha \hat{\mathbf{u}}$.  Since $\mathbf{u}$ a unit vector it is reasonable to interpret $\alpha$ as a tolerance for the error in $\mathbf{x}$.
I modified the above code to use an Adam Optimizer with $\alpha =0.1$ and found it converged 10 times faster.

Comments

Popular posts from this blog

How To Make ASCII Diagrams Beautifuller

Why growth is falling in all developed countries (as a long term trend)

Three ways to look at the Bell/GHZ experiment