I have a friend who is an expert in bond math, and he publishes a prominent mathematical finance blog where he occasionally posts interesting math puzzles relevant to financial modeling. One of his more popular games also serves as a lesson in the benefits of deep-layered neural networks.
In the past, many people believed that deep learning was unnecessary because a neural network with a single hidden layer can approximate highly complex functions when given enough neurons to work with.
In more recent times, researchers have come to appreciate the efficiency and effectiveness of stacking neural network nodes into deep-layered networks. (See the excellent book Hands-On Machine Learning with Scikit-Learn and TensorFlow for a discussion of the usefulness of multilayered neural networks.)
The following puzzle shows the advantages of using multiple neuron levels for problems that rely on a combination of inputs.
Are the Data Random?
Suppose you are given a data set to analyze. You don’t know anything about it except that three features, which all range in value continuously from 0 to 1, lead to a discrete output that is either 0 or 1.
You explore the data by plotting various combinations of the feature set onto two-dimensional charts, but everything looks like Figure 1. No matter how you slice and dice the data, everything seems random.
Next, you add a dimension to your plots by color-coding the points so that various colors represent values of an additional dimension. Your charts now look like Figure 2 in that they are showing a clear, but unexplained, pattern.
Expanding the plots into three dimensions illustrates the answer. It turns out that the output is 1 if one or three of the inputs is greater than 0.5, and the output is 0 if two or none of the inputs is greater than 0.5.
In other words, there is a dividing line at 0.5, and any value larger than this is above the divide. The label is activated to a value of 1 only if one or three of the inputs is above 0.5.
Figure 3 displays this pattern. Looking at the cube from any of its sides will show a random scattering of data points. But rotating the plot in three dimensions reveals that this is like a 3D binary Sudoku problem.
Can a neural network unlock this puzzle? If so, does a deep network help to solve this math game?
Here, I will show the function that constructs the data set. For some people, seeing this code might make the nature of the puzzle clearer:
# Make data for the math puzzle that is to be solved here. # # There are 3 feature indices. All values are between 0 and 1. # # The label is TRUE if 1 or 3 of the features is greater than 0.5. # The label is FALSE if 0 or 2 of the features is greater than 0.5. # # This can be thought of as a hypothetical 3D binary Sudoko. # def create_data(n_obs): x = np.random.uniform(size=(n_obs, 3)) up_val = np.zeros((n_obs, 3)) up_val[np.where(x > 0.5)] = 1. up_val_sum = np.sum(up_val, axis=1) y = np.zeros((n_obs, 1)) y[np.where((up_val_sum == 1) | (up_val_sum == 3))] = 1. return x, y
Solving the Puzzle
Fully connected (i.e., dense-layered) neural networks run 2,000 epochs of the Adam optimizer using 4,096 training observations in batches of 1,024. There are also 4,096 validation observations, which are generated separately. The dense layers all use batch normalization.
The different neural network configurations give these accuracy results for the validation sample:
Some conclusions are:
The validation-set results posted here are a significant improvement over the same study using only 1,024 observations. This implies that the neural networks are still on a learning curve and may require more data to approach 100% accuracy.
All three activations show the utility of deep learning because they give good results using two to six layers with a relatively small number of nodes per layer. Trying to solve this puzzle with one layer requires many more total neurons.
The hyperbolic tangent activation performs well in small networks but gets lost with a larger number of layers and nodes per layer. This may be one reason the Tanh activation has largely been replaced by newer alternatives.
On the other hand, the Tanh function is the only one that is able to obtain good results using the same number of nodes per layer (3) as there are features.
The ReLU activation requires many more nodes per layer than the other options. This could be because the ReLU function can suffer from the "dying ReLU" problem in which a neuron's activation value goes irreversibly to zero.
When this happens, its gradient is zero and the optimizer is no longer able to update the weights flowing into the neuron.
This implies that switching to ReLU from another activation may require reconfiguring a neural network—possibly even doubling the number of neurons per layer.
The ELU activation offers good results early and remains stable as the neural networks increase in size. A trade-off might be that the calculation for the ELU is more involved than the simple ReLU max function.
This blog post presented a math puzzle that requires a neural network to unlock a specific pattern. The deep learning network needs to solve for distinct combinations of feature values, and this may be a step up in intelligence from basic curve fitting.
There are countless real-world situations that resemble the simulated example shown here. For instance, a set of blood tests might lead to very different diagnoses based on various mixtures of results. This is the type of analysis that traditional models might find difficult to untangle, but which deep learning neural networks can solve smoothly and efficiently.
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter
November 23, 2015
Published as a conference paper at ICLR 2016
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
1 edition (April 9, 2017)
Win Smith, Win Analytics LLC
CS231n: Convolutional Neural Networks for Visual Recognition