Automating Keras hyperparameter optimization with Talos

A Deep Learning model is NOT a black box. It requires tuning for good performance.

2 February 2022 Updated 2 February 2022

https://www.pexels.com/nl-nl/@monoar-rahman-22660

In the two previous posts I showed you my first steps with Keras. I used examples found on the internet and changed the dataset into something trivial, meaning I generate the data myself and know the expected values. But I also told you that I had no idea why parameters like neurons, epochs, batch_size had these values.
So what we have is not really a black box. On the outside there are also some switches and screws that very much need our attention. In this post I am using Talos, 'Hyperparameter Optimization for Keras, TensorFlow (tf.keras) and PyTorch', see links below, which is intended to automate the process of selecting the optimal parameters.

At the end of this post is the code so you can try it yourself.

Hyperparameters

Parameters like neurons, epochs and batch_size are called hyperparameters, and tuning them is essential for good performance of the model. There are some interesting articles on the internet on how to tune these parameters. You can call this the holy grail of neural networks: hyperparameter optimization. Merriam-Webster website: Holy Grail is an object or goal that is sought after for its great significance. Unless you know what you are doing, it is easy to select not-optimal, or even totally wrong, parameters.

Loss functions

To perform optimization we need values that indicate how good our model performs. These values are calculated by loss functions. We use different loss functions for regression and classification. See for example the article 'Overview of loss functions for Machine Learning', see links below.

Regression loss functions:

Mean Squared Error (MSE)
Mean Absolute Error (MAE)
Huber
Log-Cosh
Quantile

Classification loss functions:

Binary Cross Entropy
Multi-Class Cross Entropy

Talos summary

I like the sentence in the article 'Hyperparameter Optimization with Keras', see links below:

Make no mistake; EVEN WHEN WE DO GET THE PERFORMANCE METRIC RIGHT (yes I'm yelling), we need to consider what happens in the process of optimizing a model.

With Talos we parameterize our model. The number of combinations of epochs, batch_size, etc. can be huge. Talos picks randomly a number of combinations and creates the new model with training and validation data. This can take minutes but also hours or even days.
Once finished we can use the scores, change parameter values and/or add some parameters and run again. In the mean time we can do other things like drinking coffee, talking to a friend, or even better, try to learn more about Machine Learning optimization. Let's see how this works.

Example

I will run Talos for a very simple Neural Network model based on 'Keras 101: A simple (and interpretable) Neural Network model for House Pricing regression', see links below.
The dataset is generated with this function:

# define input sequence
def fx(x0, x1):
    y = x0 + 2*x1
    return y

The model:

model = Sequential()
# add layers
model.add(Dense(100, input_shape=(2,), activation='relu', name='layer_input'))
model.add(Dense(50, activation='relu', name='layer_hidden_1')
model.add(Dense(1, activation='linear', name='layer_output')
# compile
model.compile(optimizer='adam', loss='mse', metrics=['mean_absolute_error'])

And the fit function:

history = model.fit(X_train, y_train, epochs=100, validation_split=0.05)

Like I told you before, I have no idea why the parameters have these values. Ok, well a little bit.

Talos

To use this in Talos we replace the hardcoded parameters by variables that can be controlled by Talos. First we create a parameter dictionary with the values we want to vary. Do not start with all parameters you want to change. Before you know it, you will be waiting and waiting. Also, I started for example the batch_size parameter with two values that are far apart to get an idea. With the parameters below Talos did 16 runs and the total time on my PC (without GPU) was 42 seconds.

# parameters
p = dict(
	first_neuron=[24, 192],
	activation=['relu', 'elu'],
	epochs=[50, 200],
	batch_size=[8, 32]
)

Then we change the model for Talos:

model = Sequential()
# add layers
model.add(Dense(
    params['first_neuron'],
    input_shape=(2,), 
    activation=params['activation'],
    name='layer_input')
)
model.add(Dense(
    50,
    activation=params['activation'],
    name='layer_hidden_1')
)
model.add(Dense(
    1,
    activation='linear',
    name='layer_output')
)
# compile
model.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mean_absolute_error'],
)

And the fit function becomes:

history = model.fit(
    x=x_train, 
    y=y_train,
    validation_data=[x_val, y_val],
    epochs=params['epochs'],
    batch_size=params['batch_size'],
    verbose=0,
)

Run Talos and analyze

Time to run! After the run I print the full table of results.

# perform scan
so = talos.Scan(
	x=X,
	y=y,
	model=dlm.model2scan, 
	params=p,
	experiment_name='model2scan',
	val_split=0.3,
)
print('scan details {}'.format(so.details))
print('analyze ...')
a = talos.Analyze(so)
a_table = a.table('val_loss', sort_by='val_loss', exclude=['start', 'end', 'duration'])
print('a_table = \n{}'.format(a_table))

Here is the table. Note that it is sorted by val_loss, meaning that the last row is the winner:

   activation  epochs  round_epochs  first_neuron  val_mean_absolute_error  mean_absolute_error  batch_size   val_loss        loss
4        relu      50            50            24                 7.786795            10.516150          32  94.053894  143.701889
12        elu      50            50            24                 5.021943             2.713378          32  30.090796   10.598367
13        elu      50            50           192                 2.880891             2.392172          32  10.355382    7.829942
5        relu      50            50           192                 2.306647             1.435794          32   8.680595    3.228125
8         elu      50            50            24                 2.205534             1.599179           8   5.929257    3.529320
9         elu      50            50           192                 1.564395             0.991934           8   3.059430    1.329003
1        relu      50            50           192                 0.813473             0.418985           8   1.315141    0.284857
14        elu     200           200            24                 0.934512             0.557581          32   1.210560    0.448432
0        relu      50            50            24                 0.680375             0.463401           8   0.818936    0.343270
15        elu     200           200           192                 0.773358             0.466824          32   0.776512    0.313476
6        relu     200           200            24                 0.510728             0.256091          32   0.515720    0.105524
7        relu     200           200           192                 0.473588             0.219744          32   0.471352    0.076440
2        relu     200           200            24                 0.562183             0.213781           8   0.431688    0.075045
3        relu     200           200           192                 0.261547             0.062857           8   0.179677    0.006133
10        elu     200           200            24                 0.327835             0.207104           8   0.140866    0.063864
11        elu     200           200           192                 0.218479             0.116624           8   0.071611    0.028682

Looking at the data we see that epochs=200 gives much better results than epochs=50. We can change it to 100 to see if this also is good. The batch_size=8 also gives better results than batch_size=32. We can change it to 16 to see if this also is good. We can also try to reduce first_neuron. The new parameters for Talos:

p = dict(
	first_neuron=[128, 192],
	activation=['relu', 'elu'],
	epochs=[100, 200],
	batch_size=[8, 16]
)

Let's rerun with these values. The result:

    batch_size activation  epochs  val_mean_absolute_error  val_loss      loss  round_epochs  first_neuron  mean_absolute_error
13          16        elu     100                 0.884498  1.053809  0.726826           100           192             0.706334
12          16        elu     100                 0.779756  0.834097  0.669265           100           128             0.696979
4           16       relu     100                 0.415210  0.235131  0.124713           100           128             0.291552
9            8        elu     100                 0.395605  0.204896  0.157592           100           192             0.321956
5           16       relu     100                 0.290187  0.109819  0.064319           100           192             0.211642
15          16        elu     200                 0.234876  0.108920  0.070559           200           192             0.220270
14          16        elu     200                 0.274542  0.107709  0.075080           200           128             0.216383
0            8       relu     100                 0.280294  0.104701  0.049495           100           128             0.188951
1            8       relu     100                 0.268457  0.100130  0.041598           100           192             0.160658
6           16       relu     200                 0.228168  0.079410  0.033531           200           128             0.138397
11           8        elu     200                 0.175497  0.070203  0.041988           200           192             0.154247
10           8        elu     200                 0.150712  0.057377  0.021882           200           128             0.108372
8            8        elu     100                 0.205943  0.055572  0.048045           100           128             0.187398
2            8       relu     200                 0.182463  0.046856  0.018731           200           128             0.096890
7           16       relu     200                 0.135524  0.025975  0.010142           200           192             0.073783
3            8       relu     200                 0.078327  0.009304  0.004181           200           192             0.042721

The best results improved but not that much. The epochs=200 is still best as is batch_size=8. The final parameters are:

p = dict(
	first_neuron=192,
	activation='relu',
	epochs=200,
	batch_size=8,
)

Full code

Below is the code in case you want to try yourself. Select 'run_optimizer' to run Talos. The dataset is split into training data and test data. The training data is split again into training data and validation data.

After running the optimizer you can plug the new parameter values into the model, generate graphs, run evaluation against test data and run some predictions.

# optimizing keras hyperparameters with talos
from keras.models import Sequential, load_model
from keras.layers import Dense
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
import talos

# use plotly or pyplot
from matplotlib import pyplot

# your input: select talos optimizer or normal operation
run_optimizer = False
#run_optimizer = True

# your input: train model or use saved model
use_saved_model = False
#use_saved_model = True

# create dataset
def fx(x0, x1):
    y = x0 + 2*x1
    return y

X_items = []
y_items = []
for x0 in range(0, 18, 3):
    for x1 in range(2, 27, 3):
        y = fx(x0, x1)
        X_items.append([x0, x1])
        y_items.append(y)

X = np.array(X_items).reshape((-1, 2))
y = np.array(y_items)
print('X = {}'.format(X))
print('y = {}'.format(y))
X_data_shape = X.shape
print('X_data_shape = {}'.format(X_data_shape))

class DLM:
    
    def __init__(
        self,
        model_name='my_model',
    ):
        self.model_name = model_name
        self.layer_input_shape=(2, )
        # your input: final model parameters
        self.params = dict(
            # layers
            first_neuron=192,
            activation='relu',
            # compile
            # fit
            val_split=0.3,
            epochs=200,
            batch_size=8,
            verbose=0,
        )

    def data_split_train_test(
        self,
        X,
        y,
    ):
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X, y, test_size=0.3, random_state=1)
        print('self.X_train = {}'.format(self.X_train))
        print('self.X_test = {}'.format(self.X_test))
        print('self.y_train = {}'.format(self.y_train))
        print('self.y_test = {}'.format(self.y_test))

        print('training data row count = {}'.format(len(self.y_train)))
        print('test data row count = {}'.format(len(self.y_test)))

        X_train_data_shape = self.X_train.shape
        print('X_train_data_shape = {}'.format(X_train_data_shape))

    def get_model(
        self,
    ):
        self.model = self.get_model_for_params(self.params)
        return self.model

    def get_model_for_params(self, params):
        model = Sequential()
        # add layers
        model.add(Dense(
            params['first_neuron'],
            input_shape=self.layer_input_shape, 
            activation=params['activation'],
            name='layer_input')
        )
        model.add(Dense(
            50,
            activation=params['activation'],
            name='layer_hidden_1')
        )
        model.add(Dense(
            1,
            activation='linear',
            name='layer_output')
        )
        # compile
        model.compile(
            optimizer='adam',
            loss='mse',
            metrics=['mean_absolute_error'],
        )
        return model

    def model_summary(
        self,
        model,
    ):
        model.summary()

    def fit(
        self, 
        model,
        plot=False,
    ):
        # split training data 
        X_train, X_val, y_train, y_val = train_test_split(self.X_train, self.y_train, test_size=self.params['val_split'], random_state=1)
        print('X_train = {}'.format(X_train))
        print('y_train = {}'.format(y_train))
        print('X_val = {}'.format(X_val))
        print('y_val = {}'.format(y_val))

        print('training data row count = {}'.format(len(y_train)))
        print('validation data row count = {}'.format(len(y_val)))

        history = self.fit_model(model, X_train, y_train, X_val, y_val, self.params)

        if plot:

            fig = go.Figure()
            fig.add_trace(go.Scattergl(y=history.history['loss'], name='Train'))
            fig.add_trace(go.Scattergl(y=history.history['val_loss'], name='Valid'))
            fig.update_layout(height=500, width=700, xaxis_title='Epoch', yaxis_title='Loss')
            fig.show()

            fig = go.Figure()
            fig.add_trace(go.Scattergl(y=history.history['mean_absolute_error'], name='Train'))
            fig.add_trace(go.Scattergl(y=history.history['val_mean_absolute_error'], name='Valid'))
            fig.update_layout(height=500, width=700, xaxis_title='Epoch', yaxis_title='Mean Absolute Error')
            fig.show() 

            pyplot.plot(history.history['loss'], label='Train')
            pyplot.plot(history.history['val_loss'], label='Valid')
            pyplot.legend()
            #pyplot.show()
            pyplot.savefig('ex_dl_loss.png')

        return history

    def fit_model(self, model, x_train, y_train, x_val, y_val, params):
        history = model.fit(
            x=x_train, 
            y=y_train,
            validation_data=[x_val, y_val],
            epochs=params['epochs'],
            batch_size=params['batch_size'],
            verbose=0,
        )
        return history

    def model2scan(self, x_train, y_train, x_val, y_val, params):
        model = self.get_model_for_params(params)
        history = self.fit_model(model, x_train, y_train, x_val, y_val, params)
        return history, model

    def evaluate(
        self, 
        model,
    ):
        score = model.evaluate(self.X_test, self.y_test)
        print('test data - loss = {}'.format(score[0]))
        print('test data - mean absolute error = {}'.format(score[1]))
        return score

    def predict(
        self,
        model,
        x0,
        x1,
        fx=None,
    ):
        x = np.array([[x0, x1]]).reshape((-1, 2))
        predictions = model.predict(x)
        expected = ''
        if fx is not None:
            expected = ', expected = {}'.format(fx(x0, x1))
        print('for x = {}, predictions = {}{}'.format(x, predictions, expected))
        return predictions

    def save_model(
        self,
        model,
    ):
        model.save(self.model_name)

    def load_saved_model(
        self,
    ):
        self.model = load_model(self.model_name)
        return self.model


dlm = DLM()

if not run_optimizer:
    # create & save or used saved
    if use_saved_model:
        model = dlm.load_saved_model()    
    else:
        dlm.data_split_train_test(X, y)
        model = dlm.get_model()
        # remove plot=True for no plot
        dlm.fit(model, plot=True)
        dlm.evaluate(model)
        dlm.save_model(model)

    # predict
    dlm.predict(model, 4, 17, fx=fx)
    dlm.predict(model, 23, 79, fx=fx)
    dlm.predict(model, 40, 33, fx=fx)
    dlm.predict(model, 140, 68, fx=fx)
else:
    # talos
    # your input: parameters run 1
    p = dict(
        first_neuron=[24, 192],
        activation=['relu', 'elu'],
        epochs=[50, 200],
        batch_size=[8, 32]
    )
    # your input: parameters run 2 (change p2 to p)
    p2 = dict(
        first_neuron=[128, 192],
        activation=['relu', 'elu'],
        epochs=[100, 200],
        batch_size=[8, 16]
    )
    # perform scan
    so = talos.Scan(
        x=X,
        y=y,
        model=dlm.model2scan, 
        params=p,
        experiment_name='model2scan',
        val_split=0.3,
    )
    print('scan details {}'.format(so.details))
    print('analyze ...')
    a = talos.Analyze(so)
    # dump table
    a_table = a.table('val_loss', sort_by='val_loss', exclude=['start', 'end', 'duration'])
    print('a_table = \n{}'.format(a_table))

Some thoughts

Are we really converging in the right direction? This is very complex problem. We have an N-dimensional world with highs and lows everywhere. This means the starting point can be very important. In the end you should always do a run with as much parameters as possible. If you have a large dataset you should start by reducing it by picking random samples.

I hope it is clear that you need a very good understanding of what you are doing. But a system that is speed optimized (expensive GPU) also helps a lot.

Summary

Talos is a very nice tool that does much work for you. The documentation could be improved but I am not complaining. It has much more features but did not try them all. I could not make my (multi-step) univariate LSTM examples work with it because of the more complex nature of the dataset. Must look into this more.
Can we optimize the optimizer? Of course. As a next step we could take the results of the table and have it automatically analyzed to make a new selection for the parameters for the next run.

Links / credits

10 Hyperparameter optimization frameworks
https://towardsdatascience.com/10-hyperparameter-optimization-frameworks-8bc87bc8b7e3

5 Regression Loss Functions All Machine Learners Should Know
https://heartbeat.comet.ml/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0

How do I choose the optimal batch size?
https://ai.stackexchange.com/questions/8560/how-do-i-choose-the-optimal-batch-size

How to Develop LSTM Models for Time Series Forecasting
https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting

How to get reproducible results in keras
https://stackoverflow.com/questions/32419510/how-to-get-reproducible-results-in-keras

How to Manually Optimize Machine Learning Model Hyperparameters
https://machinelearningmastery.com/manually-optimize-hyperparameters

How to tune the number of epochs and batch_size in Keras-tuner?
https://kegui.medium.com/how-to-tune-the-number-of-epochs-and-batch-size-in-keras-tuner-c2ab2d40878d

Hyperparameter Optimization for Keras, TensorFlow (tf.keras) and PyTorch
https://github.com/autonomio/talos

Hyperparameter Optimization with Keras
https://towardsdatascience.com/hyperparameter-optimization-with-keras-b82e6364ca53

Keras 101: A simple (and interpretable) Neural Network model for House Pricing regression
https://towardsdatascience.com/keras-101-a-simple-and-interpretable-neural-network-model-for-house-pricing-regression-31b1a77f05ae

Overview of loss functions for Machine Learning
https://medium.com/analytics-vidhya/overview-of-loss-functions-for-machine-learning-61829095fa8a

What is batch size in neural network?
https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network

Automating Keras hyperparameter optimization with Talos

Hyperparameters

Loss functions

Talos summary

Example

Talos

Run Talos and analyze

Full code

Some thoughts

Summary

Links / credits

Read more

Deep Learning Machine Learning

Leave a comment

Comments

Leave a reply

Recent

Most viewed

Tags