Recurrent neural networks (RNNs) have shown high levels of intelligence at difficult tasks such as language translation. But RNNs do have limitations, such as how much information they can encode into a cell state.

To help these neural networks evaluate long sequences effectively, researchers invented an attention mechanism that allows RNNs to focus on the parts of a series that are relevant to a specific prediction.

This blog post applies an LSTM model with attention to U.S. government bond yields and shows that the attention mechanism gives heightened focus to key events during the 2013 bond market “taper tantrum.”

Rolling Forecasts

In many practical applications, it is necessary to estimate a model, obtain a prediction, and repeat the process as additional information becomes available. A common approach uses linear regressions that are moved forward through time.

A disadvantage of rolling regressions is that every new observation modifies the total feature dataset by only a small amount. This means that a forecasting model might not adapt quickly enough to changes in how the features (the X data) affect the dependent variable (y).

Rolling weighted regressions.
Weighted linear regressions rolled forward through time.

One enhancement is to use weighted regressions that place greater emphasis on more recent observations. Placing more weight on newer data can transform a slow-moving rolling regression model into a responsive forecasting tool that modifies its abilities as relationships evolve through time.

Although weighted regressions can be an improvement over a basic linear model, they don’t gather intelligence from how the data points affect each other as a sequence. For instance, a series that increases from -2 to 2 should give a different prediction than a sequence that moves downward from 2 to -2. Autoregressive moving average models (ARMA, ARIMA, ARIMAX) are often used to obtain forecasts from time series where the ordering of the data has implications for the progression of the series.

Into the Deep Learning Era

Recurrent neural networks that use an attention mechanism seem to have the skills we would want to bring to sequence analysis. These models evaluate how observations interact with each other, how the data changes through time, and how the dataset should be weighted to give the best prediction.

Recurrent neural network with attention.
Recurrent neural network with attention.

This LSTM model is augmented with an intermediate fully connected neural network that acts as an attention mechanism. This attention network allows the model to focus on different regions of the feature set.

The LSTM model with attention is like a weighted regression, except the weighting scheme is not merely a simple transformation. Instead, the focus weights come from an unknown function of the inputs, and this function is calculated separately for every output variable.

U.S. government bond yields 1993–2018.

This study uses an attention model to evaluate U.S. government bond rates from 1993 through 2018. The feature set consists of ten constant-maturity interest rate time series published by the Federal Reserve Bank of St. Louis. These interest rates, which come from the U.S. Treasury’s yield curve calculations, vary in maturity from three months to 30 years and indicate broad interest rate movements rather than the yield for any individual bond.

This model samples weekly interest rate data in 52-week windows to deliver a single prediction (for week 53) or a four-week pattern of predictions (for weeks 53–56). The predicted variable is the ten-year interest rate, and this means that the ten-year series appears in both the X data matrix and the y prediction vector.

Simple Forecasting vs. Investment Backtesting

The purpose of this blog post is to examine the inner workings of an RNN attention model to see how it concentrates on data to arrive at a prediction.

The model presented here isn’t intended to be used as a forecasting input to a trading system. A professional-quality investment methodology would consider not only the signal-to-noise ratio of a predictive model, but also how the model’s accuracy interrelates to trading costs, leverage, risk tolerance, and many other considerations.

Attention Model Code

The following Python routine produces an LSTM+attention neural network using Keras and TensorFlow. This is a class module, and it contains methods for building, training, and saving the model. It is also able to extract weights from the attention mechanism and draw these attentions in a chart.

# Copyright (c) 2018 Matthew J. Hergott
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this library
# except in compliance with the License. You may obtain a copy of the License at
#
# www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under the
# License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
# either express or implied. See the License for the specific language governing permissions and
# limitations under the License.


import numpy as np
import tensorflow as tf
import os

from tensorflow import keras

K = keras.backend

from tensorflow.python.keras.models import Model
from tensorflow.python.keras.layers import RepeatVector, Concatenate, Activation
from tensorflow.python.keras.layers import Reshape, Input, Dense, Dot, LSTM
from tensorflow.python.keras import regularizers

from tensorflow.python.keras.models import load_model as keras_load_model

import matplotlib.pyplot as plt
import pylab


# Sometimes helpful to implement own softmax activation function to
# better manage calculations along specific axes.
def softmax_activation(x):
    e = K.exp(x - K.max(x, axis=1, keepdims=True))
    s = K.sum(e, axis=1, keepdims=True)
    return e / s


class AttentionModel(object):

    def __init__(self, x, y,
                 layer_1_rnn_units,
                 attn_dense_nodes=0,
                 epochs=100,
                 batch_size=128,
                 shared_attention_layer=True,
                 chg_yield=False,
                 float_type='float32',
                 regularization=(0.00001, '00001'),
                 window=52,
                 predict=1):
        K.clear_session()
        tf.reset_default_graph()
        self.set_learning(True)

        # Scientific computing uses 'float64' but
        # machine learning works much faster with 'float32'.
        self.float_type = float_type
        K.set_floatx(self.float_type)

        # Capture inputs to instance variables.
        self.x = x
        self.y = y
        self.epochs = epochs
        self.batch_size = batch_size
        self.shared_attention_layer = shared_attention_layer

        self.layer_1_rnn_units = layer_1_rnn_units
        self.layer_2_rnn_units = self.layer_1_rnn_units
        self.attn_dense_nodes = attn_dense_nodes

        self.num_obs = self.x.shape[0]
        self.input_len = self.x.shape[1]
        self.input_dims = self.x.shape[2]
        self.num_outputs = self.y.shape[1]

        self.regularization = regularization[0]

        assert self.x.shape[0] == self.y.shape[0]

        # Set the directory structure.
        self.model_dir = f'models//window_{window}_predict_{predict}//'

        self.model_name = f'{"yield_changes" if chg_yield==True else "yield_levels"}//' \
                          f'model_{layer_1_rnn_units}_rnn_{attn_dense_nodes}_dense_attn_' \
                          f'{epochs}_epochs_' \
                          f'{batch_size}_batch_' \
                          f'{"shared_attention" if shared_attention_layer else""}_' \
                          f'{"change_yield" if chg_yield else "level_yield"}_' \
                          f'{regularization[1]}_reg'

        # Activation function for the attention mechanism dense layer(s).
        self.attn_dense_activation = 'selu'
        self.attn_dense_initializer = 'lecun_normal'
        
    def delete_model(self):
        try:
            os.remove(f'{self.model_dir}{self.model_name}.h5')
        except Exception as e:
            print(e)

    def load_model(self):
        try:
            self.model = keras_load_model(f'{self.model_dir}{self.model_name}.h5',
                                          custom_objects={'softmax_activation': softmax_activation})
        except Exception as e:
            print(e)
            return False
        return True

    def save_model(self):
        try:
            self.model.save(f'{self.model_dir}{self.model_name}.h5')
        except Exception as e:
            print(e)
            return False
        return True

    def set_learning(self, learning):
        if learning:
            self.is_learning_phase = 1
            K.set_learning_phase(self.is_learning_phase)
            tf.keras.backend.set_learning_phase(True)
        else:
            self.is_learning_phase = 0
            K.set_learning_phase(self.is_learning_phase)
            tf.keras.backend.set_learning_phase(False)

    # Method that constructs shared layers. A shared layer means its learned parameters
    # are the same no matter where the layer is used in the neural network.
    #
    def make_shared_layers(self):
        if self.regularization > 0.:
            self.kernel_reg = regularizers.l2(self.regularization)
            self.bias_reg = regularizers.l2(self.regularization)
            self.recurrent_reg = regularizers.l2(self.regularization)
            self.recurrent_dropout = 0.1
        else:
            self.kernel_reg = self.bias_reg = self.recurrent_reg = None
            self.recurrent_dropout = 0.0

        if self.shared_attention_layer:

            # This is an optional intermediate dense layer in the attention network.
            # If it is not present, the attention mechanism goes straight from inputs to weights.
            if self.attn_dense_nodes > 0:
                self.attn_middle_dense_layer = Dense(self.attn_dense_nodes,
                                                     kernel_regularizer=self.kernel_reg,
                                                     bias_regularizer=self.bias_reg,
                                                     activation=self.attn_dense_activation,
                                                     kernel_initializer=self.attn_dense_initializer,
                                                     name='attention_mid_dense_shared')

            # This is the layer in the attention mechanism that gives the attention weights.
            self.attention_final_dense_layer = Dense(1,
                                                     kernel_regularizer=self.kernel_reg,
                                                     bias_regularizer=self.bias_reg,
                                                     activation=self.attn_dense_activation,
                                                     kernel_initializer=self.attn_dense_initializer,
                                                     name='attention_final_dense_shared')

        # Output-level LSTM cell.
        self.layer_2_LSTM_cell = LSTM(self.layer_2_rnn_units,
                                      kernel_regularizer=self.kernel_reg,
                                      recurrent_regularizer=self.recurrent_reg,
                                      bias_regularizer=self.bias_reg,
                                      recurrent_dropout=self.recurrent_dropout,
                                      return_state=True,
                                      name='layer_2_LSTM')

        # Final output (i.e., the prediction).
        self.dense_output = Dense(1,
                                  kernel_regularizer=self.kernel_reg,
                                  bias_regularizer=self.bias_reg,
                                  activation='linear',
                                  name='dense_output')

    # Builds the neural network. An LSTM+attention model doesn't need this much code.
    # This method is long because it sets lots of layer parameters and because
    # it handles four contingencies: (1) whether the attention mechanism is
    # always the same or is different for every prediction node, and (2) whether or
    # not the attention mechanism has an intermediate dense layer.
    #
    def build_attention_rnn(self):
        self.make_shared_layers()

        inputs = Input(shape=(self.input_len, self.input_dims), dtype=self.float_type)

        X = LSTM(self.layer_1_rnn_units,
                 kernel_regularizer=self.kernel_reg,
                 recurrent_regularizer=self.recurrent_reg,
                 bias_regularizer=self.bias_reg,
                 recurrent_dropout=self.recurrent_dropout,
                 return_sequences=True)(inputs)

        X = Reshape((self.input_len, self.layer_2_rnn_units))(X)

        h_start = Input(shape=(self.layer_2_rnn_units,), name='h_start')
        c_start = Input(shape=(self.layer_2_rnn_units,), name='c_start')
        h_prev = h_start
        c_prev = c_start

        outputs = list()

        # This section constructs the attention mechanism and the output-level LSTM
        # layer that leads to the predictions.
        #
        # There is an extra LSTM cell that is not attached to any prediction but
        # which begins the output-level RNN sequence. This avoids sending in a bunch
        # of zero values to the first usage of the attention mechanism.
        #
        # One way to avoid this extra LSTM cell might be to set the LSTM intial state
        # tensors "h_start" and "c_start" as trainable (instead of zeros).
        #
        for t in range(self.num_outputs + 1):
            h_prev_repeat = RepeatVector(self.input_len)(h_prev)
            joined = Concatenate(axis=-1)([X, h_prev_repeat])

            if self.attn_dense_nodes > 0:
                if self.shared_attention_layer:
                    joined = self.attn_middle_dense_layer(joined)
                else:
                    joined = Dense(self.attn_dense_nodes,
                                   kernel_regularizer=self.kernel_reg,
                                   bias_regularizer=self.bias_reg,
                                   activation=self.attn_dense_activation,
                                   kernel_initializer=self.attn_dense_initializer,
                                   name=f'attention_mid_dense_{t}')(joined)

            if self.shared_attention_layer:
                e_vals = self.attention_final_dense_layer(joined)
            else:
                e_vals = Dense(1,
                               kernel_regularizer=self.kernel_reg,
                               bias_regularizer=self.bias_reg,
                               activation=self.attn_dense_activation,
                               kernel_initializer=self.attn_dense_initializer,
                               name=f'attention_final_dense_{t}')(joined)

            alphas = Activation(softmax_activation, name=f'attention_softmax_{t}')(e_vals)
            attentions = Dot(axes=1)([alphas, X])

            h_prev, _, c_prev = self.layer_2_LSTM_cell(attentions, initial_state=[h_prev, c_prev])

            if t > 0:
                out = self.dense_output(h_prev)
                outputs.append(out)

        self.model = Model(inputs=[inputs, h_start, c_start], outputs=outputs)
        self.model.compile(loss='mse', optimizer='adam', metrics=['mse'])

        print(self.model.summary())

    def fit_model(self):
        self.set_learning(True)

        h_start = np.zeros((self.num_obs, self.layer_2_rnn_units))
        c_start = np.zeros((self.num_obs, self.layer_2_rnn_units))

        y_split = np.split(self.y, indices_or_sections=self.num_outputs, axis=1)

        self.model.fit([self.x, h_start, c_start],
                       y_split,
                       epochs=self.epochs,
                       batch_size=self.batch_size,
                       shuffle=True,
                       verbose=2,
                       validation_split=0.1)

    def calculate_attentions(self, x_data):
        self.set_learning(False)

        softmax_layer_names = [f'attention_softmax_{t}' for t in range(self.num_outputs + 1)]
        softmax_layers = list()

        for i, layer_name in enumerate(softmax_layer_names):
            if i == 0:
                continue
            intermediate_layer = Model(inputs=self.model.input,
                                       outputs=self.model.get_layer(layer_name).output)
            softmax_layers.append(intermediate_layer)

        num_obs = x_data.shape[0]
        attention_map = np.zeros((num_obs, self.num_outputs, self.input_len))

        h_start = np.zeros((1, self.layer_2_rnn_units))
        c_start = np.zeros((1, self.layer_2_rnn_units))

        for t in range(num_obs):
            print(t)
            for l_num, layer in enumerate(softmax_layers):
                softmax_results = layer.predict([np.expand_dims(x_data[t], axis=0),
                                                 h_start,
                                                 c_start])
                softmax_results = softmax_results[0, :, 0]
                attention_map[t, l_num, :] = softmax_results

        return attention_map

    def heatmap(self, data, title_supplement=None):
        plt.rcParams['axes.labelweight'] = 'bold'
        plt.rcParams['axes.labelsize'] = 22
        plt.rcParams['axes.titlesize'] = 22
        plt.rcParams['axes.titleweight'] = 'bold'
        plt.rcParams['xtick.labelsize'] = 18
        plt.rcParams['ytick.labelsize'] = 18
        plt.rcParams['axes.titlepad'] = 12
        plt.rcParams['axes.edgecolor'] = '#000000'  # '#FD5E0F'

        # Other common color schemes: 'viridis'  'plasma'  'gnuplot'
        color_map = 'inferno'
        pylab.pcolor(data, cmap=color_map, vmin=0.)
        pylab.colorbar()

        num_predictions = data.shape[0]
        num_timesteps = data.shape[1]

        if num_predictions == 4:
            pylab.yticks([0.5, 1.5, 2.5, 3.5], ['t+1', 't+2', 't+3', 't+4'])
            pylab.ylabel('y: t+1 to t+4')

            plt.axhline(y=1., xmin=0.0, xmax=51.0, linewidth=1, color='w')
            plt.axhline(y=2., xmin=0.0, xmax=51.0, linewidth=1, color='w')
            plt.axhline(y=3., xmin=0.0, xmax=51.0, linewidth=1, color='w')

        elif num_predictions == 1:
            pylab.yticks([0.5], ['t+1'])
            pylab.ylabel('y: t+1')

        assert num_timesteps == 52

        pylab.xticks([1.5, 11.5, 21.5, 31.5, 41.5, 51.5],
                     ['t-50', 't-40', 't-30', 't-20', 't-10', 't'])
        pylab.xlabel('x: t-51 to t')

        pylab.title(f'{self.model_name} {title_supplement}')

        mng = plt.get_current_fig_manager()
        mng.window.showMaximized()
        pylab.show()

Interest Rate Levels vs. Changes in Rates

The most basic type of forecast uses 52 weeks of data (time t-51 to t) from all ten bond series to give a prediction for the 10-year rate over the subsequent week (time t+1). One question is whether to use interest rate levels or changes in interest rates.

The following chart shows the maximum attention given to each week of the 52-week windows over the entire dataset when using yield levels:

Maximum attention given to each week of the 52-week window over the entire dataset when using yield levels.

The model places nearly all its concentration on the last one or two yield levels prior to the forecast.

A basic random walk.
A basic random walk.

This is consistent with a random walk, which states that the best forecast of a stock or bond price is its last price plus random noise. (This model uses yields, but the point is the same: stock prices and yield levels are nonstationary in that their mean values change through time.)

When the interest rate data is changed to use the weekly change in yields, attention is distributed much more evenly across the 52-week windows:

Maximum attention given to each week of the 52-week window over the entire dataset when using yield changes.

Taper Tantrum

One of the most dramatic recent events in U.S. bond markets was when then-Fed Chair Ben Bernanke announced on May 22, 2013, that the Federal Reserve would begin to wind down its program of massive bond purchases, which were called large-scale asset purchases or quantitative easing.

The anticipation of a multi-trillion-dollar buyer disappearing from the market caused a significant selloff in bonds, and this came to be known as the “taper tantrum.”

When the LSTM+attention neural network is asked to give a simple t+1 forecast using the 2013 data, the attention mechanism concentrates exactly on the week of the Fed’s announcement:

Attention model concentration for a single forecast.
Attention model concentration for a single forecast.

There could be pros and cons to this focus. On the one hand, this model is trying to make a prediction for the first week of 2014, so should it really be concentrating on data seven months before the forecast?

On the other hand, this was the defining event for these interest rates in 2013, and the 10-year rate continued to rise into early 2014. Therefore, focusing on May of 2013 could be appropriate for this prediction.

Sequence Prediction

A sequence prediction.
A sequence prediction.

The attention mechanism was developed to help recurrent neural networks translate long sentences. There isn’t an exact equivalent to financial forecasts since it would be unheard of to be able to predict an extended sequence of market moves.

Nonetheless, the model here can be set to produce multiple consecutive predictions. Doing this allows us to see how the neural network comes up with a forecast that is not so much a point estimate but is closer to pattern recognition:

Attention model concentration for a sequence forecast.
Attention model concentration for a sequence forecast.

The attention shifts away from the Fed announcement to a region that has a pattern similar to the time when the forecast needs to be made. In other words, the two-month period before the attention has a similar shape to the two months before the prediction date.

This could indicate that the attention mechanism formulates a sequence by finding parts of the data that match the data at the time of the forecast.

Conclusion

Recurrent neural networks have been spectacularly successful at bringing sequential analysis into the new world of deep learning. RNNs do have some limitations such as how much intelligence they can express in a single cell state. To help these neural networks handle long series, researchers created an attention mechanism that allows an RNN to concentrate on the parts of a sequence that are most relevant to a specific prediction.

This blog post applied an LSTM attention model to ten interest rate time series to see where the attention mechanism places focus on a financial time series.

One lesson relates to the difference between prices (or yields) versus changes in those prices:

  • Using yield levels, the attention mechanism concentrates on the last data point. This is consistent with a random walk model in which the best forecast is centered around the last price (or interest rate).
  • When yield changes are used, the attention model has a much wider dispersion of focus. This implies it is searching for more complex patterns in the data.

Another lesson is that an attention model can look for different things depending on what type of result it is asked to produce:

  • When the model gives a single forecast using data from the bond market "taper tantrum" of 2013, the attention focuses closely on the time surrounding the Fed announcement.
  • But when the model generates a sequence of forecasts, the attention mechanism appears to look for patterns that match the data at the time of the prediction.

Attention in neural networks has tremendous potential, and innovative techniques are being developed rapidly. A team of researchers at Google even developed a sequence model that relies solely on attention and eliminates the recurrent neural network entirely.

What seems clear is that attention models will play a key role in the future of deep learning, and they will help to open up new frontiers in the analysis of language, time series, and many different types of actions and events.


References

Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio
September 1, 2014

Board of Governors of the Federal Reserve System (US), FRED, Federal Reserve Bank of St. Louis
3-Month Treasury Constant Maturity Rate
6-Month Treasury Constant Maturity Rate
1-Year Treasury Constant Maturity Rate
2-Year Treasury Constant Maturity Rate
3-Year Treasury Constant Maturity Rate
5-Year Treasury Constant Maturity Rate
7-Year Treasury Constant Maturity Rate
10-Year Treasury Constant Maturity Rate
20-Year Treasury Constant Maturity Rate
30-Year Treasury Constant Maturity Rate

Long short-term memory
Sepp Hochreiter and Jürgen Schmidhuber
Neural Computation: 9 (8): 1735–1780.
1997

Taper Tantrum
Investopedia

The Unreasonable Effectiveness of Recurrent Neural Networks
Andrej Karpathy
May 21, 2015

Treasury Yield Curve Methodology
U.S. Department of the Treasury

Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin
June 12, 2017

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio
February 10, 2015