Chapter3
Models, Layers and Activations
In this chapter, we will introduce how to organize the components of neural network models.
0. Setups for this section
import time
import pandas as pd
import tensorflow as tf
from functools import reduce
from matplotlib import pyplot as plt
from pprint import pprint
print(tf.__version__)
tf.random.set_seed(42)
true_weights = tf.constant(list(range(5)), dtype=tf.float32)[:, tf.newaxis]
x = tf.constant(tf.random.uniform((32, 5)), dtype=tf.float32)
y = tf.constant(x @ true_weights, dtype=tf.float32)
2.1.0
1. Models
A model is a set of parameters and the computation methods using these parameters. Since these two elements are tied together, we could organize them better by encapsulating the two into a class.
Recall our linear regression model. Its parameters and computation are as follows:
weights = tf.Variable(tf.random.uniform((5, 1)), dtype=tf.float32)
y_hat = tf.linalg.matmul(x, weights)
We can write a simple class to do the job.
class LinearRegression(object):
def __init__(self, num_parameters):
self._weights = tf.Variable(tf.random.uniform((num_parameters, 1)), dtype=tf.float32)
@tf.function
def __call__(self, x):
return tf.linalg.matmul(x, self._weights)
@property
def variables(self):
return self._weights
Note that we decorated the __call__
method with tf.function
, hence a graph will be generated to back up the computation.
With this class, we can rewrite the previous training code as:
model = LinearRegression(5)
@tf.function
def train_step():
with tf.GradientTape() as tape:
y_hat = model(x)
loss = tf.reduce_mean(tf.square(y - y_hat))
gradients = tape.gradient(loss, model.variables)
model.variables.assign_add(tf.constant([-0.05], dtype=tf.float32) * gradients)
return loss
t0 = time.time()
for iteration in range(1001):
loss = train_step()
if not (iteration % 200):
print('mean squared loss at iteration {:4d} is {:5.4f}'.format(iteration, loss))
pprint(model.variables)
print('time took: {} seconds'.format(time.time() - t0))
mean squared loss at iteration 0 is 18.7201 mean squared loss at iteration 200 is 0.0325 mean squared loss at iteration 400 is 0.0017 mean squared loss at iteration 600 is 0.0002 mean squared loss at iteration 800 is 0.0000 mean squared loss at iteration 1000 is 0.0000 <tf.Variable 'Variable:0' shape=(5, 1) dtype=float32, numpy= array([[-3.1365452e-03], [ 1.0065703e+00], [ 2.0000944e+00], [ 3.0032609e+00], [ 3.9934685e+00]], dtype=float32)> time took: 0.40303897857666016 seconds
Here, we still decorated the train_step
function with tf.function
. The reason is that the loss and gradient calculation can also benefit from graphs.
This simple model class works fine. But it can be better if we subclass from tf.keras.Model
.
class LinearRegression(tf.keras.Model):
def __init__(self, num_parameters, **kwargs):
super().__init__(**kwargs)
self._weights = tf.Variable(tf.random.uniform((num_parameters, 1)), dtype=tf.float32)
@tf.function
def call(self, x):
return tf.linalg.matmul(x, self._weights)
This model class has a few differences compare with the organic version. First is that we implemented the call
method rather than the dunder version of it. Under the hood, tf.keras.Model
’s __call__
method is a wrapper over this call
method, and it is performing, among many other things, things like converting inputs to tensors and graph building. The second thing is that we dropped the accessor for the variables because we inherited one.
With this subclass model, we need to modify the training code a bit to make it work. The .variables
accessor from tf.keras.Model
gives us a collection of references to the model’s variables, to accommodate complex models with many sets of variables. So, as a result, the corresponding gradients
will be a collection too.
model = LinearRegression(5)
@tf.function
def train_step():
with tf.GradientTape() as tape:
y_hat = model(x)
loss = tf.reduce_mean(tf.square(y - y_hat))
gradients = tape.gradient(loss, model.variables)
for g, v in zip(gradients, model.variables):
v.assign_add(tf.constant([-0.05], dtype=tf.float32) * g)
return loss
t0 = time.time()
for iteration in range(1001):
loss = train_step()
if not (iteration % 200):
print('mean squared loss at iteration {:4d} is {:5.4f}'.format(iteration, loss))
pprint(model.variables)
print('time took: {} seconds'.format(time.time() - t0))
mean squared loss at iteration 0 is 18.7201 mean squared loss at iteration 200 is 0.0325 mean squared loss at iteration 400 is 0.0017 mean squared loss at iteration 600 is 0.0002 mean squared loss at iteration 800 is 0.0000 mean squared loss at iteration 1000 is 0.0000 [<tf.Variable 'Variable:0' shape=(5, 1) dtype=float32, numpy= array([[-3.0575476e-03], [ 1.0064486e+00], [ 2.0001044e+00], [ 3.0031927e+00], [ 3.9935682e+00]], dtype=float32)>] time took: 0.39172911643981934 seconds
Through subclassing tf.keras.Model
, we get to use many methods from it, like print out a summary and the Keras model training/testing methods.
print(model.summary())
model.compile(loss='mse', metrics=['mae'])
print(model.evaluate(x, y, verbose=-1))
Model: "linear_regression" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= Total params: 5 Trainable params: 5 Non-trainable params: 0 _________________________________________________________________ None [4.27835811933619e-06, 0.0016658232]
Let’s also try using the .fit
API to train our linear regression model.
model = LinearRegression(5)
model.compile(optimizer='SGD', loss='mse')
model.optimizer.lr.assign(.05)
t0 = time.time()
history = model.fit(x, y, epochs=1001, verbose=0)
pprint(history.history['loss'][::200])
pprint(model.variables)
print('time took: {} seconds'.format(time.time() - t0))
[19.127595901489258, 0.02714746817946434, 0.0010814086999744177, 6.190096610225737e-05, 6.568436219822615e-06, 1.0572820201559807e-06] [<tf.Variable 'Variable:0' shape=(5, 1) dtype=float32, numpy= array([[-1.2103192e-03], [ 1.0031627e+00], [ 2.0000432e+00], [ 3.0014870e+00], [ 3.9966750e+00]], dtype=float32)>] time took: 1.2371714115142822 seconds
Not surprisingly we recovered the ground truth, but it took significantly more time compared to our custom training loop, it is probably due to the convenient .fit
API is doing a bunch more stuff than updating parameters.
Now, let’s spice up the model a bit by adding a additional useless bias term and initialize it with a large value. Ideally, after training, this bias term would become an insignificantly small number.
class LinearRegressionV2(tf.keras.Model):
def __init__(self, num_parameters, **kwargs):
super().__init__(**kwargs)
self._weights = tf.Variable(tf.random.uniform((num_parameters, 1)), dtype=tf.float32)
self._bias = tf.Variable([100], dtype=tf.float32)
@tf.function
def call(self, x):
return tf.linalg.matmul(x, self._weights) + self._bias
model = LinearRegressionV2(5)
t0 = time.time()
for iteration in range(1001):
loss = train_step()
pprint(model.variables)
[<tf.Variable 'Variable:0' shape=(5, 1) dtype=float32, numpy= array([[0.95831835], [0.01680839], [0.3156035 ], [0.16013157], [0.7148702 ]], dtype=float32)>, <tf.Variable 'Variable:0' shape=(1,) dtype=float32, numpy=array([100.], dtype=float32)>]
Hmm, something is very wrong, the bias term is not updated at all. Guess where might be the problem? It’s in the train_step
function. Since the input signature has not changed(there is none), the function is been lazy and did not realize it should do a retracing. Thus it is training with the old graph, in which the bias term simply does not exist!
To address this issue, we can make model
as an input to train_step
, so that when the function is invoked with a different model it will create or grab a graph accordingly.
@tf.function
def train_step(model):
with tf.GradientTape() as tape:
y_hat = model(x)
loss = tf.reduce_mean(tf.square(y - y_hat))
gradients = tape.gradient(loss, model.variables)
for g, v in zip(gradients, model.variables):
v.assign_add(tf.constant([-0.05], dtype=tf.float32) * g)
return loss
model = LinearRegression(5)
for iteration in range(1001):
loss = train_step(model)
pprint(model.variables)
model = LinearRegressionV2(5)
for iteration in range(5001):
loss = train_step(model)
pprint(model.variables)
print(train_step._get_tracing_count())
[<tf.Variable 'Variable:0' shape=(5, 1) dtype=float32, numpy= array([[-3.4687696e-03], [ 1.0070834e+00], [ 2.0001261e+00], [ 3.0035236e+00], [ 3.9930055e+00]], dtype=float32)>] [<tf.Variable 'Variable:0' shape=(5, 1) dtype=float32, numpy= array([[-2.6756483e-03], [ 9.9403673e-01], [ 1.9959891e+00], [ 2.9956400e+00], [ 3.9983556e+00]], dtype=float32)>, <tf.Variable 'Variable:0' shape=(1,) dtype=float32, numpy=array([0.00964569], dtype=float32)>] 2
We see that Tensorflow traced two graph for our training function, one for each model.
2. Layers
We typically use the Model class for the overall model architecture, that is how may combine many smaller computational units to do the job. It is ok to code up small units with tf.keras.Model
and then combine them, but since we won’t really need many of the model specific functionalities with these smaller building blocks(for example, its unlikely we would ever want to call the .fit
method on a unit). It is better to use the tf.keras.layers.Layer
class. Actually, the Model class is a wrapper over the Layer class.
Lets say that we want to ‘upgrade’ the linear regression model to be a composition of a few linear transformations. We can code up the linear transformation as a Layer class and then just combine a bunch of its instances of it them in one model.
class Linear(tf.keras.layers.Layer):
def __init__(self, num_inputs, num_outputs, **kwargs):
super().__init__(**kwargs)
self._weights = tf.Variable(tf.random.uniform((num_inputs, num_outputs)), dtype=tf.float32)
@tf.function
def call(self, x):
return tf.linalg.matmul(x, self._weights)
class Regression(tf.keras.Model):
def __init__(self, num_inputs_per_layer, num_outputs_per_layer, **kwargs):
super().__init__(**kwargs)
self._layers = [Linear(num_inputs, num_outputs)
for (num_inputs, num_outputs) in zip(num_inputs_per_layer, num_outputs_per_layer)]
@tf.function
def call(self, x):
for layer in self._layers:
x = layer(x)
return x
In Linear
class definition above, we swapped the super class and generalized it with the option to specify output size. In the Regression
model class, we have the option to use one or a chain of this Linear layers. One obvious benefit with this set up, is that we now separated the concern of how individual computing units should work and the overall architecture design of the model. We use Layer to handle the first, and use Model to tackle the latter.
Let’s see if the model is trainable.
model = Regression([5, 3], [3, 1])
for iteration in range(1001):
loss = train_step(model)
print('Mean absolute error is: ', tf.reduce_mean(tf.abs(y - model(x))).numpy())
Mean absolute error is: 1.2218952e-06
One problem with this linear layer is that it needs the complete sizing information and allocates resources for all the variables upfront. Ideally, we want it to be a bit lazy, it should calculate variable sizes and occupy resources only when needed. To archive this, we implement the build
method which will handle the variable initialization. The build
method can be explicitly called, or it will be invoked automatically the first time there is data flow to it. With this, the constructor now only stores the hyperparameters for the layer.
class Linear(tf.keras.layers.Layer):
def __init__(self, units, **kwargs):
super(Linear, self).__init__(**kwargs)
self.units = units
def build(self, input_shape):
self._weights = self.add_weight(shape=(input_shape[-1], self.units))
super().build(input_shape)
@tf.function
def call(self, x):
output = tf.linalg.matmul(x, self._weights)
return output
class Regression(tf.keras.Model):
def __init__(self, units, **kwargs):
super().__init__(**kwargs)
self._layers = [Linear(unit) for unit in units]
@tf.function
def call(self, x):
for layer in self._layers:
x = layer(x)
return x
model = Regression([3, 1])
pprint(model.variables) # should be empty
for iteration in range(1001):
loss = train_step(model)
print('Mean absolute error is: ', tf.reduce_mean(tf.abs(y - model(x))).numpy())
pprint(model.variables)
[] Mean absolute error is: 8.5681677e-07 [<tf.Variable 'linear_2/Variable:0' shape=(5, 3) dtype=float32, numpy= array([[ 0.31275296, -0.34405178, 0.3254243 ], [-0.00516788, 0.7107771 , 0.00381865], [ 1.0885493 , 0.08547033, 0.54993755], [ 1.5066153 , 0.12321236, 0.01402206], [ 1.1230973 , 1.2026962 , -0.6002657 ]], dtype=float32)>, <tf.Variable 'linear_3/Variable:0' shape=(3, 1) dtype=float32, numpy= array([[ 1.8777134 ], [ 1.4221834 ], [-0.30101135]], dtype=float32)>]
Tensorflow has a lot of layer options, we will cover a sample of them in later application specific chapters. For now, we will just quickly see if the linear layer(called Dense
) from Tensorflow works the same.
class Regression(tf.keras.Model):
def __init__(self, units, **kwargs):
super().__init__(**kwargs)
self._layers = [tf.keras.layers.Dense(unit, use_bias=False) for unit in units] # the only change
@tf.function
def call(self, x):
for layer in self._layers:
x = layer(x)
return x
model = Regression([3, 1])
for iteration in range(1001):
loss = train_step(model)
print('Mean absolute error is: ', tf.reduce_mean(tf.abs(y - model(x))).numpy())
pprint(model.variables)
Mean absolute error is: 1.206994e-06 [<tf.Variable 'dense/kernel:0' shape=(5, 3) dtype=float32, numpy= array([[ 0.3859551 , -0.5379335 , -0.3105871 ], [-0.3746996 , -0.28152603, 0.45071942], [ 1.0012726 , -0.08671893, 0.7469904 ], [ 0.2887405 , -0.875283 , 1.0489686 ], [ 0.43472984, -0.5105482 , 1.6707851 ]], dtype=float32)>, <tf.Variable 'dense_1/kernel:0' shape=(3, 1) dtype=float32, numpy= array([[ 0.41473213], [-0.86908275], [ 2.020607 ]], dtype=float32)>]
Indeed, it is working as expected.
3. Activations
So our newest model is a composition of two linear transformation, but the composition of two linear transformations is just another linear transformation.
reduced_model = reduce(tf.linalg.matmul, model.variables)
print(reduced_model)
print(tf.reduce_all(tf.abs(model(x) - x @ reduced_model) < 1e-6))
tf.Tensor( [[2.2053719e-06] [9.9999624e-01] [1.9999999e+00] [2.9999964e+00] [4.0000048e+00]], shape=(5, 1), dtype=float32) tf.Tensor(True, shape=(), dtype=bool)
Without anything interesting in between, this is just adding unnecessary complexity. The simplest thing to do is to add some non-linear in-place transformation to the intermediate results, and this is call activations.
Let’s add activations as layers.
class ReLU(tf.keras.layers.Layer):
def __init__(self, **kwargs):
super().__init__(**kwargs)
@tf.function
def call(self, x):
return tf.maximum(tf.constant(0, x.dtype), x)
class NeuralNetwork(tf.keras.Model):
def __init__(self, units, last_linear=True, **kwargs):
super().__init__(**kwargs)
layers = []
n = len(units)
for i, unit in enumerate(units):
layers.append(Linear(unit))
if i < n - 1 or not last_linear:
layers.append(ReLU())
self._layers = layers
@tf.function
def call(self, x):
for layer in self._layers:
x = layer(x)
return x
model = NeuralNetwork([3, 1])
for iteration in range(1001):
loss = train_step(model)
print('Mean absolute error is: ', tf.reduce_mean(tf.abs(y - model(x))).numpy())
pprint(model.variables)
reduced_model = reduce(tf.linalg.matmul, model.variables)
print(reduced_model)
print(tf.reduce_any(tf.abs(model(x) - x @ reduced_model) < 1e-6))
Mean absolute error is: 0.005225517 [<tf.Variable 'linear_4/Variable:0' shape=(5, 3) dtype=float32, numpy= array([[ 0.10789682, -0.31562465, -0.00532613], [-0.05060532, -0.46457097, 0.42605096], [-0.61450577, 0.45144582, 0.8922052 ], [ 0.439215 , -0.59760475, 1.2671946 ], [ 0.36582235, 0.14377609, 1.7026857 ]], dtype=float32)>, <tf.Variable 'linear_5/Variable:0' shape=(3, 1) dtype=float32, numpy= array([[ 0.13811348], [-0.3345939 ], [ 2.3246977 ]], dtype=float32)>] tf.Tensor( [[0.10812643] [1.138893 ] [1.8381848 ] [3.206461 ] [3.960648 ]], shape=(5, 1), dtype=float32) tf.Tensor(False, shape=(), dtype=bool)
With our underline ground truth to be a simple linear model, the added non-linear activations is not really helpful, and in fact it made the optimization harder.
Since activation functions usually follows immediately after linear transformations, we can fuse them together, so that the model code can be simpler.
class Linear(tf.keras.layers.Layer):
def __init__(self, units, use_bias=True, activation='linear', **kwargs):
super(Linear, self).__init__(**kwargs)
self.units = units
self.use_bias = use_bias
self.activation = activation
def build(self, input_shape):
self._weights = self.add_weight(shape=(input_shape[-1], self.units))
if self.use_bias:
self._bias = self.add_weight(shape=(self.units), initializer='ones')
super().build(input_shape)
@tf.function
def call(self, x):
output = tf.linalg.matmul(x, self._weights)
if self.use_bias:
output += self._bias
if self.activation == 'relu':
output = tf.maximum(tf.constant(0, x.dtype), output)
return output
class NeuralNetwork(tf.keras.Model):
def __init__(self, units, use_bias=True, last_linear=True, **kwargs):
super().__init__(**kwargs)
layers = [Linear(unit, use_bias, 'relu') for unit in units[:-1]]
layers.append(Linear(units[-1], use_bias, 'linear' if last_linear else 'relu'))
self._layers = layers
@tf.function
def call(self, x):
for layer in self._layers:
x = layer(x)
return x
model = NeuralNetwork([3, 1])
for iteration in range(1001):
loss = train_step(model)
print('Mean absolute error is: ', tf.reduce_mean(tf.abs(y - model(x))).numpy())
pprint(model.variables)
Mean absolute error is: 0.0075564235 [<tf.Variable 'linear_6/Variable:0' shape=(5, 3) dtype=float32, numpy= array([[ 0.13394988, 0.51800126, 0.09352156], [-0.15037392, 0.25468782, 0.49990052], [-0.69800115, -0.42260593, 0.8586105 ], [-0.08989831, -0.33655766, 1.2849052 ], [ 1.0911363 , -0.32361698, 1.6646731 ]], dtype=float32)>, <tf.Variable 'linear_6/Variable:0' shape=(3,) dtype=float32, numpy=array([ 0.5917124, 0.5500173, -0.576148 ], > dtype=float32)>, <tf.Variable 'linear_7/Variable:0' shape=(3, 1) dtype=float32, numpy= array([[ 0.14004707], [-0.45114964], [ 2.2325916 ]], dtype=float32)>, <tf.Variable 'linear_7/Variable:0' shape=(1,) dtype=float32, numpy=array([1.4508461], dtype=float32)>]
There are also many activation
functions we can choose from that ship with Tensorflow. We will do a survey on them later.
4. Fully Connected Networks
With the code above, we just made a fully connected network, or historically called multi layer perceptron(with out any actual perceptron) as well as feed forward neural network. Its essentially a sequence of linear transformation with in-place non-linear activations sandwiched in between. We usually think of the initial layers as feature extractors that is performing some kind on implicit feature engineering and selection, and think of the last layer as a regressor or classifier per task.
Note how we are using a list to host the layers and applying them sequentially in the call method. Lets quickly implement a quality of life improvement model class called Sequential
to do this. It is pretty much a water down version of tf.keras.Sequential
.
class Sequential(tf.keras.Model):
def __init__(self, layers, **kwargs):
super().__init__(**kwargs)
self._layers = layers
@tf.function
def call(self, x):
for layer in self._layers:
x = layer(x)
return x
class MLP(tf.keras.Model):
def __init__(self, num_hidden_units, num_targets, hidden_activation='relu', **kwargs):
super().__init__(**kwargs)
if type(num_hidden_units) is int: num_hidden_units = [num_hidden_units]
self.feature_extractor = Sequential([tf.keras.layers.Dense(unit, activation=hidden_activation)
for unit in num_hidden_units])
self.last_linear = tf.keras.layers.Dense(num_targets, activation='linear')
@tf.function
def call(self, x):
features = self.feature_extractor(x)
outputs = self.last_linear(features)
return outputs
Let’s try to apply our MLP model to a real regression problems: the boston housing dataset shipped with Tensorflow. The dataset is splitted into two sets, a training set and a testing set. We will train the model on training set only, but record the loss on both sets to see if the reduction in training set loss is inline with reduction in the unseen testing set.
(x_tr, y_tr), (x_te, y_te) = tf.keras.datasets.boston_housing.load_data()
y_tr, y_te = map(lambda x: np.expand_dims(x, -1), (y_tr, y_te))
x_tr, y_tr, x_te, y_te = map(lambda x: tf.cast(x, tf.float32), (x_tr, y_tr, x_te, y_te))
@tf.function
def train_step(model, x, y):
with tf.GradientTape() as tape:
loss = tf.reduce_mean(tf.square(y - model(x)))
gradients = tape.gradient(loss, model.variables)
for g, v in zip(gradients, model.variables):
v.assign_add(tf.constant([-0.01], dtype=tf.float32) * g)
return loss
@tf.function
def test_step(model, x, y):
return tf.reduce_mean(tf.square(y - model(x)))
def train(model, n_epochs=1000, his_freq=10):
history = []
for iteration in range(1, n_epochs + 1):
tr_loss = train_step(model, x_tr, y_tr)
te_loss = test_step(model, x_te, y_te)
if not iteration % his_freq:
history.append({
'iteration': iteration,
'training_loss': tr_loss.numpy(),
'testing_loss': te_loss.numpy()
})
return model, pd.DataFrame(history)
mlp, mlp_history = train(MLP(4, 1))
pprint(mlp_history.tail())
ax = mlp_history.plot(x='iteration', kind='line', logy=True)
fig = ax.get_figure()
fig.savefig('ch3_plot_1.png')
95 960 84.622253 83.714134 96 970 84.622231 83.713562 97 980 84.622299 83.713020 98 990 84.622231 83.712601 99 1000 84.622269 83.712318
It may seem that our model has nicely converged. Since there is not much a discrepancy between training set and testing set performance. However if we look at the numbers more closely, they are pretty bad. A simple constant prediction have MSE around 83. What could go wrong? We will look at other optimizers in the next chapter.
print(tf.reduce_mean(tf.square(y_te - tf.reduce_mean(y_te))))
tf.Tensor(83.24384, shape=(), dtype=float32)