Monday, January 6, 2014

Commonality between neural network algorithms

Neurons are utterly remarkable.
These days, researchers, aided by wits and math, are able to study deep neural network models in a great level of detail, calculating probabilistic distributions over some esoteric energy function or, for example, using gradients to minimize the L2 norm of the difference from the training data and the output on supervised neural networks. They can do it because they have the whole picture. The eagle-eye, all encompassing scope of the problem domain. And they know math.
The humble neuron, alone, knows nothing of the sort. All it "knows", the only data it has available, are the electrochemical activations of its neighboring neurons, a minuscule patch of cells within an unfathomably vast agglomerate of neurons. And yet, from such appalling lack of perspective, it still manages to find the right strengths for its synapses and bring about that emergent property which we call "intelligence". It does all that without knowing math.
How does the neuron manages such a feat? Whatever rule set it follows, it should be simple both in its computation formula and its input data.
In this article, we'll take a cursory look into three distinct network learning methods and see how, given proper approximations, each results in similar update rules at the neuron level.

Pre-requirements

The reader should already be familiar with artificial neural networks, at least with the standard back-propagation algorithm on typical multi-layer feed-forward networks. For further reference, please consult wikipedia's entry on the subject. Also, we will use matrices, so basic linear algebra knowledge is needed.

Common elements

We separate the neurons into two groups: the visible units (v), which receive the data input, and all the others, called hidden units (h). The strength of the connections between units (and consequentially the network's topology) may be represented in matrix form  where i and j are the indexes of the units being connected.

The state of the network can be represented in vector form  where  is the activation value for unit i. If we have a set of such activation state vectors, we can represent them in matrix form as  for each vector in our list. For simultaneously activated networks, the activation stage k can thus be written as

where  is the typical non-linear activation function.

There is usually an additive bias term added to the computation of such a unit's activation. Such bias term may (usually) be safely incorporated in the weight matrix, as long as we clamp a constant value to an extra purposely-built unit. If we wish to explicitly refer to it, it becomes incorporated in the activation formula as

where  is the matrix obtained by repeating the row vector  vertically. But I'll just ignore the bias term from now on.

In every case, the learning methods studied will be unsupervised, meaning that no labels will be presented during the training phase. Instead, the networks will be expected to extract regularities from the data through a dimensionality reduction approach, where the hidden units, albeit smaller in number, should be able to reconstitute the training data up to a given approximation.

Hopfield networks

The learning method that most immediately brings to light the nature of the synapse construction rule is clearly the Hopfield network. Built to more naturally represent the processes underlying biological neuron firing and growth, they differ from the typical neuronal networks we're used to work in the following ways:

- The update rule is randomly sequential (units are chosen at random, one at a time, for updating).

- Usually discrete (each unit may assume only binary values).

- The Heaviside step function H is used as the activation function (again with the trick of merging the bias term within the weight matrix), which forces each unit to assume binary values.

- Every unit (which is usually is a visible unit) is initialized with the desired training pattern before the network is allowed to drift (other than the unit assigned to the bias factor, which will always have its value clamped to 1).

The only restriction on the weight matrix is that it must be symmetrical with diagonal zero. Such restriction implies that after enough iterations of the activation process, each unit will have converged to a fixed value that minimizes a certain energy function. Therefore, each training pattern is initially applied to every unit and then the network is allowed to converge to a final fixed output state using the activation rules described above. It turns out that, given a properly trained network, even if only a partial training set is initially presented, the output network state will have "filled the gaps" and reconstructed the full original input. An obvious use in classification problems would be to train the network by presenting training patterns together with their corresponding labels, and then the classification stage would present the training patterns alone, letting the network drift and stabilize in order to obtain the expected label.

How do we train it? This is where it gets interesting: There are several variations of the training procedure, but they all boil down to the same motto: reinforce the connection between units that seem to activate in synchrony, and dampen down connections where the opposite happens. This mimics what's observed in the biologic neuron - connections are reinforced between synchronous neurons, and pruned if they are out of rhythm.

The usual rule , the Hebbian learning rule, named after Donald O. Hebb, is something as simple as



and results from the attempt to provide a simplistic neuron activation model based on the observation of actual biological processes.

When both units i and j are in an active state, their connection is reinforced, otherwise it remains unchanged. And it works, but just barely.

Hopfield networks aren't very useful*:

- The update rule is slow and very often leads to local minima of the energy function;

- the learning rule is very coarse and neglects symmetrically-activated units;

- the subsequent states of the network are ignored during the weight training process;

- it allows for only binary activation values.  
 
* note: there have been significant advances to the field in the last few years, making this statement less valid.

The next paragraph presents a learning method that improves upon some of these issues.

Boltzmann Machines

Boltzmann machines, while in many aspects similar to Hopfield networks, improve upon the latter through the use of extra (hidden) units to simulate unknown model parameters, having a more sophisticated stochastic activation function, and a better training algorithm where weights are updated according with the gradient of a measure of the information gain

where  is the probability of the training vector v to appear on the visible units, dependent exclusively on the underlying distribution of the training data, and  is the distribution the system converges to when starting from v. In other words, an approximation of the difference of the information entropy between the two distributions.

 essentially measures how far the system drifts from the original training data set, which is a sensible minimization scheme.

One finds that the gradient of  can be written as

where  is the averaged probability that unit i and j are both activated, measured once the network converges while having the training vectors always clamped on the visible units, and   is the same process, except the training data isn't clamped.

In other words, we compare the stable state of the network while confronted with the training data against its stable state when allowed to drift.

a) If, when presented with the training data, both units are active and later drift apart, their connections are reinforced (in the hope that it will help them stay together when drifting, even when the training data is no longer present).

b) If they were not initially active and later end up with the same average activations, their connections are stunted so as to keep them out-of-synch.

c) Otherwise the connections remain unchanged.

Notice how a concept somewhat similar to (a) is also used in the trivial Hopfield training rule, while (b) is a new improvement.

Boltzmann machines still aren't very useful, since the duration of the training process depends exponentially on the problem size.

Restricted Boltzmann Machines


The Restricted Boltzmann machine model offers significant improvements upon BM.
From now on we'll use the logistic sigmoid function



as the activation function, and will apply it freely to both scalars and matrices (element-wise). Note how  for large enough , so we can see it as a generalization of the discrete activation function from the previously studied stochastic networks.

Using the same energy function as in a Hopfield model and in unrestricted Boltzmann Machines, we construct from it the probability distribution




where Z is a normalizing term required for a probability measure. RBM's find the weight configurations that minimize the (log of the) probability (basically navigating the space of all possible visible and hidden unit values, looking for the configuration that both maximises the probability of the training data and minimizes the probability of samples generated by the model), by doing a stochastic gradient descent on the Logarithm of the distribution formula above.

In practice, the calculation of such a gradient is intractable, so in RBM's we force both the weights between visible-visible units and the weights between hidden-hidden units to zero. In this case the weight matrix (besides the bias terms) becomes  and we'll just represent it directly by W (while adding one last row and one last column for the biases). In other words, visible and hidden units are conditionally independent given one-another.

Let the set of input data points be represented in matrix form as  (the batch index j is often dropped when it's implied by the context), where each row is a data vector. The number of rows in  is the number of samples in batch j, while the number of columns matches the number of visible units, plus 1 for the constant bias contribution. Thus, the activation values of the hidden units are given by

 for odd k where

and the activation values of the visible units are given by

 for even k

(remember to always clamp 1 to the bias column)
Since the k index fully identifies which units are associated with each activation matrix, we'll simply drop the v/h notation from A's representation.

Consequently, the energy function simplifies to

giving rise to a simple weight update rule:


(the angle brackets are used to denote expectations under the corresponding distribution) Estimating those expectation values becomes easier using another approximation technique called Contrasting Divergence:


The choice of n determines the CD's level of approximation of the gradient estimation. Since we're usually interested in minimizing the reconstruction error between the original training pattern and its final model estimate, it turns out that n larger than 2 typically results in worse reconstructions, since the actual gradient being followed is a measure of something other than a reconstruction error.
CD is originally a stochastic algorithm where probabilities are replaced by sampling from their respective distributions, but we described the version where such replacement doesn't take place, Resulting in a fully deterministic algorithm.

From the rather approximated and simplified CD update formula above we may conclude that the RBM method may be expressed by the following heuristic:

1) For a given weight between a visible uint v and a hidden unit h, we'll compare the activations between v's original data value and h's respective activation given that original training pattern. From such comparison, it may result that both are active, i.e., originally in synch.

2) We now move the original training pattern back and forth from hidden to visible units an even number n of times, which will result in new values for the activation of visible units and subsequently for their connected hidden units. We're basically letting the model's activation pattern converge, as per the standard Hopfield model.

3) We again compare the activations as per 1), finding whether the visible and hidden units for the given weight remain in synch or not. We end up with 4 basic possibilities:

- They were initially in synch and so remained, which results in no changes to the weight.

- They were in synch, but no longer remain so, which results in a strengthening of the weight between them, in the hope that the synchronicity becomes preserved as the model runs. 

- They were not originally in synch, but end up synchronized. It's bad, so we stunt the connection weight.

- Otherwise, we change nothing.

Unsurprisingly, these rules are very similar to the generic BM, since we're minimising measures based on the same underlying energy function.

RMBs work rather well when combined into more hidden layers and trained using slightly more sophisticated schemes.

One has to wonder what rule would result from minimizing the reconstruction error, instead of minimizing the log-likelihood of the distribution associated with the Hopfield model's energy.

To that effect, in the next paragraph we'll minimize the L2 distance of the reconstruction error under the exact same architecture as the RBM, and compare the update rule at the connection/neuron level with those from the previous methods.

Approximate Matrix Back Propagation

In this paragraph we'll derive an approximate method for calculating the gradient of the reconstruction error under the same network topology as a RBM, which we'll call Approximate Matrix Back Propagation (AMBP).

Using the notation introduced in the previous paragraph, and keeping the same activation function, we'll start by considering a typical 3-layer network configuration for feed-forward nets, initially represented by weight matrices  and :

In order to study the reconstruction error, we'll assign the training data to the output layer, resulting in a typical dimensionality reduction problem. Again, we embed the bias terms within the weight matrices.

- Based on the matrix-form of the BP we define the activation result matrices as

 for k>0, and


- Since our training pattern is also the output pattern, we will minimize the reconstruction error


- Backpropagating the error, we have
, and



where L is the number of layers,  is the Hadamard product,  and g is such that


Finally, E's gradient contribution to  becomes

 , for

At this point, we'll replicate a RBM layout by enforcing

which results in

It's now time to introduce some approximations, in an attempt to clarify the underlying mechanisms behind the gradient computation.

Let's assume our activation function approximates a linear function near the origin. In the case of the logistic function, its order 3 Taylor expansion near the origin has no quadratic term, so such approximation is legitimate. Therefore it turns out that the operator

approximates to an element-wise constant product for small enough .

We may therefore write
 and

which leads to an increment (with 1/2 included as an averaging term) on W of


Due to 's linearity near the origin, we can further simplify , leading to




Notice the differences and similarities with the RBM's update formula.

- Both methods compare the initial synchronicity between visible and hidden units against a posterior measure of synchronicity when the units are allowed to drift to an equilibrium state.

- In the case of the RBM, both units drift simultaneously,

- While AMBP averages the original value of a unit against its opposite unit's drifted value.

It can be seen experimentally that AMBP exhibits very good convergence for a series of different problems, and its delta values end up being very much in step with the RBM's deltas.

Interestingly, both terms R and J complement each other very nicely. Remove one of them, and it will no longer converge to an approximate minimum of the reconstruction error.

Interpretation of common elements

One could fairly argue that, since they share the same energy function and associated probability distribution, the Hopfield and RBM models ought to have similar weight update rules (which they do), but the same can't be said of AMBP since its foundations are entirely different. And yet the weight update scheme is similar.

However you look at it (from biological observations, from energy distribution minimization or from reconstruction error minimization), it seems that a good way for artificial neurons to extract patterns from data is to follow that same rule:

- If, over time, it turns out that neuron A's activations keep being one of the main driving forces behind B's activation (they exhibit strong correlation), then their connections should be strengthened. Otherwise the connection's strength should decay.

This last rule is eerily similar to one of Donald O. Hebb's conclusions:

"When one cell repeatedly assists in firing another, the axon of the first cell develops synaptic knobs (or enlarges them if they already exist) in contact with the soma of the second cell."

In biology?

The hard truth is that all these considerations don't explain much about the origin of intelligence in the brain.

1st, it's a theoretical framework, utterly simplifying the behaviour of that extraordinarily complex analogue processing unit, the neuron, and its intricate web of interconnections.

Secondly, it studies only immediately neighbouring neurons and offers no explanation whatsoever for the layout of that vast network of inter-area connections within the brain. Those veritable information highways connect neurons located huge distances apart, in totally distinct areas of the brain. The rules that gave birth to such connections obviously can't be explained by electrochemical signals in a neuron's near substrate.

No comments:

Post a Comment