Unsupervised Learning – Deep Learning with TensorFlow 2 and Keras – Second Edition


Unsupervised Learning

This chapter delves into unsupervised learning models. In the previous chapter we explored Autoencoders, novel neural networks that learn via unsupervised learning. In this chapter we will delve deeper into some other unsupervised learning models. In contrast to supervised learning, where the training dataset consists of both the input and the desired labels, unsupervised learning deals with the case where the model is provided only the input. The model learns the inherent input distribution by itself without any desired label guiding it. Clustering and dimensionality reduction are the two most commonly used unsupervised learning techniques. In this chapter we will learn about different machine learning and NN techniques for both. We will cover techniques required for clustering and dimensionality reduction, and will go into the details of Boltzmann machines, and finally we will cover the implementation of the aforementioned techniques using TensorFlow. The concepts covered will be extended to build Restricted Boltzmann Machines (RBMs). The chapter will include:

  • Principal component analysis
  • K-Means clustering
  • Self-organizing maps
  • Boltzmann machines
  • RBMs

Principal component analysis

Principal component analysis (PCA) is the most popular multivariate statistical technique for dimensionality reduction. It analyzes the training data consisting of several dependent variables, which are, in general, inter-correlated, and extracts important information from the training data in the form of a set of new orthogonal variables called principal components. We can perform PCA using two methods either using eigen decomposition or using singular value decomposition (SVD).

PCA reduces the n–dimensional input data to r–dimensional input data, where r<n. In the most simple terms, PCA involves translating the origin and performing rotation of the axis such that one of the axes (principal axis) has the highest variance with data points. A reduced-dimensions dataset is obtained from the original dataset by performing this transformation and then dropping (removing) the orthogonal axes with low variance. Here we employ the SVD method for PCA dimensionality reduction. Consider X, the n-dimensional data with p points that is, X is a matrix of size p × n. From linear algebra we know that any real matrix can be decomposed using singular value decomposition:

Where U and V are orthonormal matrices (that is, U.UT = V.VT = 1) of size p × p and n × n respectively. is a diagonal matrix of size p × n. The U matrix is called the left singular matrix, and V the right singular matrix, and , the diagonal matrix, contains the singular values of X as its diagonal elements. Here we assume that the X matrix is centered. The columns of the V matrix are the principal components, and columns of are the data transformed by principal components.

Now to reduce the dimensions of the data from n to k (where k < n) we will select the first k columns of U and the upper-left k × k part of . The product of the two gives us our reduced-dimensions matrix:

The data Y thus obtained will be of reduced dimensions. Next we implement PCA in TensorFlow 2.0.

PCA on the MNIST dataset

Let us now implement PCA in TensorFlow 2.0. We will be definitely using TensorFlow, we will also need NumPy for some elementary matrix calculation, and Matplotlib, Matplotlib toolkits, and Seaborn for plotting:

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

Next we load the MNIST dataset. Since we are doing dimension reduction using PCA, we do not need a test dataset or even labels; however, we are loading labels so that after reduction we can verify the PCA performance. PCA should cluster similar datapoints in one cluster, hence if we see the clusters formed using PCA are similar to our labels it would indicate that our PCA works:

((x_train, y_train), (_, _)) = tf.keras.datasets.mnist.load_data()

Before we do PCA we should preprocess the data. We first normalize it so that all data has values between 0 and 1, and then reshape the image from being 28 × 28 matrix to a 784-dimensional vector, and finally center it by subtracting the mean:

x_train = x_train / 255.
x_train = x_train.astype(np.float32)
x_train = np.reshape(x_train, (x_train.shape[0], 784))
mean = x_train.mean(axis = 1)
x_train = x_train - mean[:,None]

Now that our data is in the right format, we make use of TensorFlow's powerful linear algebra (linalg) module to calculate the SVD of our training dataset. TensorFlow provides the function svd() defined in tf.linalg to perform this task. And then use the diag function to convert the sigma array (s, a list of singular values) to a diagonal matrix:

s, u, v = tf.linalg.svd(x_train)
s = tf.linalg.diag(s)

This provides us with a diagonal matrix s of size 784 × 784; a left singular matrix u of size 60000 × 784; and a right singular matrix of size 784 × 784. This is so because the argument "full_matrices" of the function svd() is by default set to False. As a result it does not generate the full U matrix (in this case, of size 60000 × 60000), instead, if input X is of size m × n it generates U of size p = min(m,n).

The reduced-dimension data can now be generated by multiplying respective slices of u and s. We reduce our data from 784 to 3 dimensions, we can choose to reduce to any dimension less than 784, but we chose 3 here so that it is easier for us to visualize later. We make use of tf.Tensor.getitem to slice our matrices in the Pythonic way:

k = 3
pca = tf.matmul(u[:,0:k], s[0:k,0:k])

A comparison of the original and reduced data shape is done in the following code:

print('original data shape',x_train.shape)
print('reduced data shape', pca.shape)
       original data shape (60000, 784)
       reduced data shape (60000, 3)

Finally let us plot the data points in the three-dimensional space.

Set = sns.color_palette("Set2", 10)
color_mapping = {key:value for (key,value) in enumerate(Set)}
colors = list(map(lambda x: color_mapping[x], y_train))
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(pca[:, 0], pca[:, 1],pca[:, 2], c=colors)

You can see that the points corresponding to the same color and hence same label are clustered together. We have therefore successfully used PCA to reduce the dimensions of MNIST images. Each original image was of size 28 × 28. Using the PCA method we can reduce it to a smaller size. Normally for image data, dimensionality reduction is necessary. This is because images are large in size and contain a significant amount of redundant data.

TensorFlow Embedding API

TensorFlow also offers an Embedding API where one can find and visualize PCA and tSNE [1] clusters using TensorBoard. You can see the live PCA on MNIST images here: http://projector.tensorflow.org. The following image is reproduced for reference:

Figure 1: A visualization of a principal component analysis, applied to the MNIST dataset

You can process your data using TensorBoard. It contains a tool called Embedding Projector that allows one to interactively visualize embedding. The Embedding Projector tool has three panels:

  • Data Panel: It is located at the top left, and you can choose the data, labels, and so on in this panel.
  • Projections Panel: Available at the bottom left, you can choose the type of projections you want here. It offers three choices: PCA, t-SNE, and custom.
  • Inspector Panel: On the right-hand side, here you can search for particular points and see a list of nearest neighbors.

Figure 2: Screenshow of the Embedding Projector tool

K-means clustering

K-means clustering, as the name suggests, is a technique to cluster data, that is, to partition data into a specified number of data points. It is an unsupervised learning technique. It works by identifying patterns in the given data. Remember the sorting hat of Harry Potter fame? What it is doing in the book is clustering—dividing new (unlabeled) students into four different clusters: Gryffindor, Ravenclaw, Hufflepuff, and Slytherin.

Humans are very good at grouping objects together; clustering algorithms try to give a similar capability to computers. There are many clustering techniques available, such as Hierarchical, Bayesian, or Partitional. K-means clustering belongs to partitional clustering; it partitions the data into k clusters. Each cluster has a center, called the centroid. The number of clusters k has to be specified by the user.

The k-means algorithm works in the following manner:

  1. Randomly choose k data points as the initial centroids (cluster centers)
  2. Assign each data point to the closest centroid; there can be different measures to find closeness, the most common being the Euclidean distance
  3. Recompute the centroids using current cluster membership, such that the sum of squared distances decreases
  4. Repeat the last two steps until convergence is met

In the previous TensorFlow versions the KMeans class was implemented in the Contrib module; however, the class is no longer available in TensorFlow 2.0. Here we will instead use the advanced mathematical functions provided in TensorFlow 2.0 to implement k-means clustering.

K-means in TensorFlow 2.0

To demonstrate k-means in TensorFlow, we will use randomly generated data in the code that follows. Our randomly generated data will contain 200 samples, and we will divide them into three clusters. We start with importing all the required modules and defining the variables, determining the number of sample points (points_n), the number of clusters to be formed (clusters_n), and the number of iterations we will be doing (iteration_n). We also set the seed for random number to ensure that our work is reproducible:

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
points_n = 200
clusters_n = 3
iteration_n = 100
seed = 123

Now we randomly generate data and from the data select three centroids randomly:

points = np.random.uniform(0, 10, (points_n, 2))
centroids = tf.slice(tf.random.shuffle(points), [0, 0], [clusters_n, -1])

You can see the scatter plot of all the points and the randomly selected three centroids in the following graph:

plt.scatter(points[:, 0], points[:, 1], s=50, alpha=0.5)
plt.plot(centroids[:, 0], centroids[:, 1], 'kx', markersize=15)

Figure 3: Randomly generated data, from three randomly selected centroids, plotted

We define the function closest_centroids() to assign each point to the centroid it is closest to:

def closest_centroids(points, centroids):
    distances = tf.reduce_sum(tf.square(tf.subtract(points, centroids[:,None])), 2)
    assignments = tf.argmin(distances, 0)
    return assignments

We create another function move_centroids(). It recalculates the centroids such that the sum of squared distances decreases:

def move_centroids(points, closest, centroids):
    return np.array([points[closest==k].mean(axis=0) for k in range(centroids.shape[0])])

Now we call these two functions iteratively for 100 iterations. We have chosen the number of iterations arbitrarily; you can increase and decrease it to see the effect:

for step in range(iteration_n):
    closest = closest_centroids(points, centroids)
    centroids = move_centroids(points, closest, centroids)

In the following graph, you can see the final centroids after 100 iterations. We have also colored the points based on which centroid they are closest to. The yellow points correspond to one cluster (nearest the cross in its center), and the same is true for the purple and green cluster points:

plt.scatter(points[:, 0], points[:, 1], c=closest, s=50, alpha=0.5)
plt.plot(centroids[:, 0], centroids[:, 1], 'kx', markersize=15)

Figure 4: Plot of the final centroids after 100 iterations

Please note that the plot command works in Matplotlib 3.1.1 or higher versions.

In the preceding code we decided to limit the number of clusters to 3, but in most cases with unlabeled data, one is never sure how many clusters exist. One can determine the optimal number of clusters using the elbow method. The method is based on the principle that we should choose the cluster number that reduces the sum of squared error (SSE) distance. If k is the number of clusters, then as k increases, the SSE decreases, with SSE = 0; when k is equal to the number of data points, each point is its own cluster. We want a low value of k, such that SSE is also low. For the famous Fisher's Iris data set, if we plot SSE for different k values, we can see from the plot below that for k=3, the variance in SSE is the highest; after that, it starts reducing, thus the elbow point is k=3:

Figure 5: Plotting SSE against the Number of Clusters, using Fisher's Iris data set

K-means clustering is very popular because it is fast, simple, and robust. It also has some disadvantages, however, the biggest being that the user has to specify the number of clusters. Second, the algorithm does not guarantee global optima; the results can change if the initial randomly chosen centroids change. Third, it is very sensitive to outliers.

Variations in k-means

In the original k-means algorithm each point belongs to a specific cluster (centroid); this is called hard clustering. However, we can have one point belong to all the clusters, with a membership function defining how much it belongs to a particular cluster (centroid). This is called fuzzy clustering or soft clustering. This variation was proposed in 1973 by J. C. Dunn and later improved upon by J. C. Bezdek in 1981. Though soft clustering takes longer to converge, it can be useful when a point can be in multiple classes, or when we want to know how similar a given point is to different clusters.

The accelerated k-means algorithm was created in 2003 by Charles Elkan. He exploited the triangle inequality relationship (that is, that a straight line is the shortest distance between two points). Instead of just doing all distance calculations at each iteration, he also kept track of the lower and upper bounds for distances between points and centroids.

In 2006, David Arthur and Sergei Vassilvitskii proposed the k-means++ algorithm. The major change they proposed was in the initialization of centroids. They showed that if we choose centroids that are distant from each other, then the k-means algorithm is less likely to converge on a suboptimal solution.

Another alternative can be that at each iteration we do not use the entire dataset, instead using mini-batches. This modification was proposed by David Sculey in 2010.

Self-organizing maps

Both k-means and PCA can cluster the input data; however, they do not maintain topological relationship. In this section we will consider Self-organized maps (SOM), sometimes known as Kohonen networks or Winner take all units (WTU). They maintain the topological relation. SOMs are a very special kind of neural network, inspired by a distinctive feature of the human brain. In our brain, different sensory inputs are represented in a topologically ordered manner. Unlike other neural networks, neurons are not all connected to each other via weights; instead, they influence each other's learning. The most important aspect of SOM is that neurons represent the learned inputs in a topographic manner. They were proposed by Tuevo Kohonen in 1989 [2].

In SOMs, neurons are usually placed at nodes of a (1D or 2D) lattice. Higher dimensions are also possible but are rarely used in practice. Each neuron in the lattice is connected to all the input units via a weight matrix. The following diagram shows a SOM with 6 × 8 (48 neurons) and 5 inputs. For clarity, only the weight vectors connecting all inputs to one neuron are shown. In this case, each neuron will have seven elements, resulting in a combined weight matrix of size (40 × 5):

A SOM learns via competitive learning. It can be considered as a nonlinear generalization of PCA and thus, like PCA, can be employed for dimensionality reduction.

In order to implement SOM, let's first understand how it works. As a first step, the weights of the network are initialized either to some random value or by taking random samples from the input. Each neuron occupying a space in the lattice will be assigned specific locations. Now as an input is presented, the neuron with the least distance from the input is declared the winner (WTU). This is done by measuring the distance between the weight vectors (W) and input vectors (X) of all neurons:

Here, dj is the distance of weights of neuron j from input X. The neuron with the lowest d value is the winner.

Next, the weights of the winning neuron and its neighboring neurons are adjusted in a manner to ensure that the same neuron is the winner if the same input is presented next time.

To decide which neighboring neurons need to be modified, the network uses a neighborhood function ; normally, the Gaussian Mexican hat function is chosen as a neighborhood function. The neighborhood function is mathematically represented as follows:

Here, is a time-dependent radius of influence of a neuron and d is its distance from the winning neuron. Graphically the function looks like a hat (hence its name), as you can see in the following figure:

Figure 6: The "Gaussian Maxican hat" function, visualized in graph form

Another important property of the neighborhood function is that its radius reduces with time. As a result, in the beginning, many neighboring neurons' weights are modified, but as the network learns, eventually a few neurons' weights (at times, only one or none) are modified in the learning process. The change in weight is given by the following equation:

The process is repeated for all the inputs for a given number of iterations. As the iterations progress, we reduce the learning rate and the radius by a factor dependent on the iteration number.

SOMs are computationally expensive and thus are not really useful for very large datasets. Still, they are easy to understand, and they can very nicely find the similarity between input data. Thus, they have been employed for image segmentation and to determine word similarity maps in NLP [3].

Colour mapping using SOM

Some of the interesting properties of the feature map of the input space generated by SOM are:

  • The feature map provides a good representation of the input space. This property can be used to perform vector quantization so that we may have a continuous input space, and using SOM we can represent it in a discrete output space.
  • The feature map is topologically ordered, that is, the spatial location of a neuron in the output lattice corresponds to a particular feature of the input.
  • The feature map also reflects the statistical distribution of the input space; the domain that has the largest number of input samples gets a wider area in the feature map.

These features of SOM make them the natural choice for many interesting applications. Here we use SOM for clustering a range of given R, G, and B pixel values to a corresponding color map. We start with the importing of modules:

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

The main component of the code is our class WTU. The class __init__ function initializes various hyperparameters of our SOM, the dimensions of our 2D lattice (m, n), the number of features in the input (dim), the neighborhood radius (sigma), the initial weights, and the topographic information:

# Define the Winner Take All units
class WTU(object):
    #_learned = False
   def __init__(self, m, n, dim, num_iterations, eta = 0.5, sigma = None):
      m x n : The dimension of 2D lattice in which neurons       are arranged
      dim : Dimension of input training data
      num_iterations: Total number of training iterations
      eta : Learning rate
      sigma: The radius of neighbourhood function.
      self._m = m
      self._n = n
      self._neighbourhood = []
      self._topography = []
      self._num_iterations = int(num_iterations)
      self._learned = False
      self.dim = dim
      self.eta = float(eta)
      if sigma is None:
          sigma = max(m,n)/2.0 # Constant radius
          sigma = float(sigma)
      self.sigma = sigma
      print('Network created with dimensions',m,n)
      # Weight Matrix and the topography of neurons
      self._W = tf.random.normal([m*n, dim], seed = 0)
      self._topography = np.array(list(self._neuron_location(m, n)))

The most important function of the class is the train() function, where we use the Kohonen algorithm as discussed before to find the winner units and then update the weights based on the neighborhood function:

def training(self,x, i):
    m = self._m
    n= self._n
    # Finding the Winner and its location
    d = tf.sqrt(tf.reduce_sum(tf.pow(self._W - tf.stack([x for i in range(m*n)]),2),1))
    self.WTU_idx = tf.argmin(d,0)
    slice_start = tf.pad(tf.reshape(self.WTU_idx, [1]),np.array([[0,1]]))
    self.WTU_loc = tf.reshape(tf.slice(self._topography, slice_start,[1,2]), [2])
    # Change learning rate and radius as a function of iterations
    learning_rate = 1 - i/self._num_iterations
    _eta_new = self.eta * learning_rate
    _sigma_new = self.sigma * learning_rate
    # Calculating Neighbourhood function
    distance_square = tf.reduce_sum(tf.pow(tf.subtract(
        self._topography, tf.stack([self.WTU_loc for i in range(m * n)])), 2), 1)
        neighbourhood_func = tf.exp(tf.negative(tf.math.divide(tf.cast(
distance_square, "float32"), tf.pow(_sigma_new, 2))))
    # multiply learning rate with neighbourhood func
    eta_into_Gamma = tf.multiply(_eta_new, neighbourhood_func)
    # Shape it so that it can be multiplied to calculate dW
    weight_multiplier = tf.stack([tf.tile(tf.slice(
       eta_into_Gamma, np.array([i]), np.array([1])), [self.dim])
       for i in range(m * n)])
       delta_W = tf.multiply(weight_multiplier,
          tf.subtract(tf.stack([x for i in range(m * n)]),self._W))
          new_W = self._W + delta_W
          self._W = new_W

The fit() function is a helper function that calls the train() function and stores the centroid grid for easy retrieval:

def fit(self, X):
    Function to carry out training
    for i in range(self._num_iterations):
        for x in X:
    # Store a centroid grid for easy retrieval
    centroid_grid = [[] for i in range(self._m)]
    self._Wts = list(self._W)
    self._locations = list(self._topography)
    for i, loc in enumerate(self._locations):
    self._centroid_grid = centroid_grid
    self._learned = True

Then there are some more helper functions to find the winner and generate a 2D lattice of neurons, and a function to map input vectors to the corresponding neurons in the 2D lattice:

def winner(self, x):
    idx = self.WTU_idx,self.WTU_loc
    return idx
def _neuron_location(self,m,n):
    Function to generate the 2D lattice of neurons
    for i in range(m):
       for j in range(n):
          yield np.array([i,j])
def get_centroids(self):
    Function to return a list of 'm' lists, with each inner     list containing the 'n' corresponding centroid locations as 1-D     NumPy arrays.
    if not self._learned:
       raise ValueError("SOM not trained yet")
    return self._centroid_grid
def map_vects(self, X):
    Function to map each input vector to the relevant     neuron in the lattice
    if not self._learned:
       raise ValueError("SOM not trained yet")
       to_return = []
       for vect in X:
          min_index = min([i for i in range(len(self._Wts))],
                           key=lambda x: np.linalg.norm(vect -
        return to_return 

We will also need to normalize the input data, so we create a function to do so:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result.astype(np.float32)

Let us read the data. The data contains Red, Green, and Blue channel values. Let us normalize them:

## Reading input data from file
import pandas as pd
df = pd.read_csv('colors.csv')  # The last column of data file is a label
data = normalize(df[['R', 'G', 'B']]).values
name = df['Color-Name'].values
n_dim = len(df.columns) - 1
# Data for Training
colors = data
color_names = name

Let us create our SOM and fit it:

som = WTU(30, 30, n_dim, 400, sigma=10.0)

Now, let's look at the result of the trained model. In the following code, you can see the color map in the 2D neuron lattice:

# Get output grid
image_grid = som.get_centroids()
# Map colours to their closest neurons
mapped = som.map_vects(colors)
# Plot
plt.title('Color Grid SOM')
for i, m in enumerate(mapped):
    plt.text(m[1], m[0], color_names[i], ha='center', va='center',
             bbox=dict(facecolor='white', alpha=0.5, lw=0))

Figure 7: A plotted color map of the 2D neuron lattice

You can see that neurons that win for similar colors are closely placed.

Restricted Boltzmann machines

The RBM is a two-layered neural network—the first layer is called the visible layer and the second layer is called the hidden layer. They are called shallow neural networks because they are only two layers deep. They were first proposed in 1986 by Paul Smolensky (he called them Harmony Networks [1]) and later by Geoffrey Hinton who in 2006 proposed Contrastive Divergence (CD) as a method to train them. All neurons in the visible layer are connected to all the neurons in the hidden layer, but there is a restriction—no neuron in the same layer can be connected. All neurons in the RBM are binary in nature.

RBMs can be used for dimensionality reduction, feature extraction, and collaborative filtering. The training of RBMs can be divided into three parts: forward pass, backward pass, and then compare.

Let us delve deeper into the math. We can divide the operation of RBMs into two passes:

Forward pass: The information at visible units (V) is passed via weights (W) and biases (c) to the hidden units (h0). The hidden unit may fire or not depending on the stochastic probability ( is stochastic probability), which is basically the sigmoid function:

Backward pass: The hidden unit representation (h0) is then passed back to the visible units through the same weights, W, but different bias, c, where they reconstruct the input. Again, the input is sampled:

These two passes are repeated for k steps or until the convergence [4] is reached. According to researchers, k=1 gives good results, so we will keep k = 1.

The joint configuration of the visible vector V and the hidden vector has an energy given as follows:

Also associated with each visible vector V is free energy, the energy that a single configuration would need to have in order to have the same probability as all of the configurations that contain V:

Using the Contrastive Divergence objective function, that is, Mean(F(Voriginal))- Mean(F(Vreconstructed)), the change in weights is given by:

Here, is the learning rate. Similar expressions exist for the biases b and c.

Reconstructing images using RBM

Let us build an RBM in TensorFlow 2.0. The RBM will be designed to reconstruct handwritten digits like the Autoencoders did in Chapter 9, Autoencoders. We import TensorFlow, NumPy, and Matplotlib libraries:

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

We define a class RBM. The class __init_() function initializes the number of neurons in the visible layer (input_size) and the number of neurons in the hidden layer (output_size). The function initializes the weights and biases for both hidden and visible layers. In the following code we have initialized them to zero. You can try with random initialization as well:

#Class that defines the behavior of the RBM
class RBM(object):
    def __init__(self, input_size, output_size, lr=1.0, batchsize=100):
        m: Number of neurons in visible layer
        n: number of neurons in hidden layer
        # Defining the hyperparameters
        self._input_size = input_size # Size of Visible
        self._output_size = output_size # Size of outp
        self.learning_rate = lr # The step used in gradient descent
        self.batchsize = batchsize         # The size of how much data will be used for training         # per sub iteration
        # Initializing weights and biases as matrices full of zeroes
        self.w = tf.zeros([input_size, output_size], np.float32) # Creates and initializes the weights with 0
        self.hb = tf.zeros([output_size], np.float32) # Creates and initializes the hidden biases with 0
        self.vb = tf.zeros([input_size], np.float32) # Creates and initializes the visible biases with 0

We define methods to provide the forward and backward passes:

    # Forward Pass
    def prob_h_given_v(self, visible, w, hb):
        # Sigmoid 
        return tf.nn.sigmoid(tf.matmul(visible, w) + hb)
    # Backward Pass
    def prob_v_given_h(self, hidden, w, vb):
        return tf.nn.sigmoid(tf.matmul(hidden, tf.transpose(w)) + vb)

We create a function to generate random binary values. This is so because both hidden and visible units are updated using stochastic probability depending upon the input to each unit in the case of the hidden layer (and top-down input to visible layers):

   # Generate the sample probability
    def sample_prob(self, probs):
        return tf.nn.relu(tf.sign(probs - tf.random.uniform(tf.shape(probs))))

We will need functions to reconstruct the input:

def rbm_reconstruct(self,X):
    h = tf.nn.sigmoid(tf.matmul(X, self.w) + self.hb)
    reconstruct = tf.nn.sigmoid(tf.matmul(h, tf.transpose(self.w)) + self.vb)
    return reconstruct

To train the RBM created we define the train() function. The function calculates the positive and negative grad term of contrastive divergence and uses the weight update equation to update the weights and biases:

# Training method for the model
def train(self, X, epochs=10):
    loss = []
    for epoch in range(epochs):
        #For each step/batch
        for start, end in zip(range(0, len(X), self.batchsize),range(self.batchsize,len(X), self.batchsize)):
            batch = X[start:end]
            #Initialize with sample probabilities
            h0 = self.sample_prob(self.prob_h_given_v(batch, self.w, self.hb))
            v1 = self.sample_prob(self.prob_v_given_h(h0, self.w, self.vb))
            h1 = self.prob_h_given_v(v1, self.w, self.hb)
            #Create the Gradients
            positive_grad = tf.matmul(tf.transpose(batch), h0)
            negative_grad = tf.matmul(tf.transpose(v1), h1)
            #Update learning rates 
            self.w = self.w + self.learning_rate *(positive_grad - negative_grad) / tf.dtypes.cast(tf.shape(batch)[0],tf.float32)
            self.vb = self.vb +  self.learning_rate * tf.reduce_mean(batch - v1, 0)
            self.hb = self.hb +  self.learning_rate * tf.reduce_mean(h0 - h1, 0)
        #Find the error rate
        err = tf.reduce_mean(tf.square(batch - v1))
        print ('Epoch: %d' % epoch,'reconstruction error: %f' % err)
    return loss

Now that our class is ready, we instantiate an object of RBM and train it on the MNIST dataset:

(train_data, _), (test_data, _) =  tf.keras.datasets.mnist.load_data()
train_data = train_data/np.float32(255)
train_data = np.reshape(train_data, (train_data.shape[0], 784))
test_data = test_data/np.float32(255)
test_data = np.reshape(test_data, (test_data.shape[0], 784))
#Size of inputs is the number of inputs in the training set
input_size = train_data.shape[1]
rbm = RBM(input_size, 200)
err = rbm.train(train_data,50)

In the following code, you can see the learning curve of our RBM:


Figure 8: Learning curve for the RBM model

And the reconstructed images:

out = rbm.rbm_reconstruct(test_data)
# Plotting original and reconstructed images
row, col = 2, 8
idx = np.random.randint(0, 100, row * col // 2)
f, axarr = plt.subplots(row, col, sharex=True, sharey=True, figsize=(20,4))
for fig, row in zip([test_data,out], axarr):
    for i,ax in zip(idx,row):
        ax.imshow(tf.reshape(fig[i],[28, 28]), cmap='Greys_r')

Figure 9: Image reconstruction using an RBM

What do you think? Are RBMs better than Autoencoders? Try training the RBM on noisy input and see how good it is with reconstructing noisy images.

Deep belief networks

Now that we have a good understanding of RBMs and know how to train them using contrastive divergence, we can move toward the first successful deep neural network architecture, the deep belief networks (DBNs). Proposed in 2006 in the paper by Hinton and his team in the paper A fast learning algorithm for deep belief nets. Before this model it was very difficult to train deep architectures, not just because of the limited computing resources, but also, as discussed in Chapter 9, Autoencoders, because of the vanishing gradient problem. In DBNs it was first demonstrated how deep architectures can be trained via greedy layer-wise training.

In the simplest terms, DBNs are just stacked RBMs. Each RBM is trained separately using the contrastive divergence. We start with the training of the first RBM layer. Once it is trained, we train the second RBM layer. The visible units of the second RBM are now fed the output of the hidden units of the first RBM, when it is fed the input data. The procedure is repeated with each RBM layer addition.

Let us try stacking our RBM class. To be able to make the DBN we will need to define one more function in the RBM class, the output of the hidden of one RBM needs to be fed to the next RBM:

    #Create expected output for our DBN
    def rbm_output(self, X):
        out = tf.nn.sigmoid(tf.matmul(X, self.w) + self.hb)
        return out

Now we can just use the RBM class to create a stacked RBM structure. In the following code we create an RBM stack: the first RBM will have 500 hidden units, the second will have 200 hidden units, and the third will have 50 hidden units:

RBM_hidden_sizes = [500, 200 , 50 ] #create 2 layers of RBM with size 400 and 100
#Since we are training, set input as training data
inpX = train_data
#Create list to hold our RBMs
rbm_list = []
#Size of inputs is the number of inputs in the training set
input_size = train_data.shape[1]
#For each RBM we want to generate
for i, size in enumerate(RBM_hidden_sizes):
    print ('RBM: ',i,' ',input_size,'->', size)
    rbm_list.append(RBM(input_size, size))
    input_size = size
RBM:  0   784 -> 500
RBM:  1   500 -> 200
RBM:  2   200 -> 50

For the first RBM, the MNIST data is the input. The output of the first RBM is then fed as input to the second RBM, and so on through the consecutive RBM layers:

#For each RBM in our list
for rbm in rbm_list:
    print ('New RBM:')
    #Train a new one
    #Return the output layer
    inpX = rbm.rbm_output(inpX)

Our DBN is ready. The three stacked RBMs are now trained using unsupervised learning. DBNs can also be trained using supervised training. To do so we will need to fine-tune the weights of the trained RBMs and add a fully connected layer at the end.

Variational Autoencoders

Like DBNs and GANs, variational autoencoders are also generative models. Variational Autoencoders (VAEs) are a mix of the best of neural networks and Bayesian inference. They are one of the most interesting neural networks and have emerged as one of the most popular approaches to unsupervised learning. They are Autoencoders with a twist. Along with the conventional encoder and decoder network of Autoencoders (see Chapter 8, Autoencoders), they have additional stochastic layers. The stochastic layer, after the encoder network, samples the data using a Gaussian distribution, and the one after the decoder network samples the data using Bernoulli's distribution. Like GANs, VAEs can be used to generate images and figures based on the distribution they have been trained on. VAEs allow one to set complex priors in the latent and thus learn powerful latent representations. The following diagram describes a VAE:

The Encoder network qΦ(z|x) approximates the true but intractable posterior distribution p(z|x), where x is the input to the VAE and z is the latent representation. The decoder network takes the d-dimensional latent variables (also called latent space) as its input and generates new images following the same distribution as P(x). As you can see from the preceding diagram, the latent representation z is sampled from z|x ~ , and the output of the decoder network samples x|z from x|z ~ .

Now that we have the basic architecture of VAEs, the question arises of how they can be trained, since the maximum likelihood of the training data and posterior density are intractable? The network is trained by maximizing the lower bound of the log data likelihood. Thus, the loss term consists of two components: generation loss, which is obtained from the decoder network through sampling, and the Kullback–Leibler (KL) divergence term, also called the latent loss.

Generation loss ensures that the image generated by the decoder and the image used to train the network are similar, and latent loss ensures that the posterior distribution q(z|x) is close to the prior . Since the encoder uses Gaussian distribution for sampling, the latent loss measures how closely the latent variables match this distribution.

Once the VAE is trained, we can use only the decoder network to generate new images. Let us try coding a VAE. This time we are using the Fashion-MNIST dataset; you learned about this dataset in Chapter 5, Advanced Convolutional Neural Networks. The dataset contains Zalando's (https://github.com/zalandoresearch/fashion-mnist) article images. The test-train split is exactly the same as for MNIST, that is, 60,000 train images and 10,000 test images. The size of each image is also 28 × 28, so you can easily replace the codes running on the MNIST dataset with the Fashion-MNIST dataset. The code in this section has been adapted from https://github.com/dragen1860/TensorFlow-2.x-Tutorials. As the first step we, as usual, import all the necessary libraries:

import  tensorflow as tf
import  numpy as np
from matplotlib import pyplot as plt

Let us fix the seeds for random number, so that the results are reproducible. We can also add an assert statement to ensure that our code runs on TensorFlow 2.0 or above:

assert tf.__version__.startswith('2.'), "TensorFlow Version Below 2.0"

Before going ahead with making the VAE, let us also explore the Fashion-MNIST dataset a little. The dataset is available in the TensorFlow Keras API:

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
x_train, x_test = x_train.astype(np.float32)/255., x_test.astype(np.float32)/255.
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
(60000, 28, 28) (60000,)
(10000, 28, 28) (10000,)

We see some sample images:

number = 10  # how many digits we will display
plt.figure(figsize=(20, 4))
for index in range(number):
    # display original
    ax = plt.subplot(2, number, index + 1)
    plt.imshow(x_train[index], cmap='gray')

Figure 10: Sample images from the Fashion-MNIST dataset

Before we start, let us declare some hyperparameters like learning rate, dimensions of the hidden layer and the latent space, batch size, epochs, and so on:

image_size = x_train.shape[1]*x_train.shape[2]
hidden_dim = 512
latent_dim = 10
num_epochs = 80
batch_size = 100
learning_rate = 0.001

We use the TensorFlow Keras Model API to build a VAE model. The __init__() function defines all the layers that we will be using:

class VAE(tf.keras.Model):
    def __init__(self,dim,**kwargs):
        h_dim = dim[0]
        z_dim = dim[1]
        super(VAE, self).__init__(**kwargs)
        self.fc1 = tf.keras.layers.Dense(h_dim)
        self.fc2 = tf.keras.layers.Dense(z_dim)
        self.fc3 = tf.keras.layers.Dense(z_dim)
        self.fc4 = tf.keras.layers.Dense(h_dim)
        self.fc5 = tf.keras.layers.Dense(image_size)

We define the functions to give us the encoder output and decoder output and reparametrize. The implementation of the encoder and decoder functions are straightforward; however, we need to delve a little deeper for the reparametrize function. As you know, VAEs sample from a random node z, which is approximated by of the true posterior. Now, to get parameters we need to use backpropagation. However, back propagation cannot work on random nodes. Using reparameterization, we can use a new parameter eps that allows us to reparametrize z in a way that will allow the back propagation through the deterministic random node (https://arxiv.org/pdf/1312.6114v10.pdf):

def encode(self, x):
    h = tf.keras.nn.relu(self.fc1(x))
    return self.fc2(h), self.fc3(h)
def reparameterize(self, mu, log_var):
    std = tf.exp(log_var * 0.5)
    eps = tf.random.normal(std.shape)
    return mu + eps * std
def decode_logits(self, z):
    h = tf.nn.relu(self.fc4(z))
    return self.fc5(h)
def decode(self, z):
    return tf.nn.sigmoid(self.decode_logits(z))

Lastly, we define the call() function, which will control how signals move through different layers of the VAE:

def call(self, inputs, training=None, mask=None):
    mu, log_var = self.encode(inputs)
    z = self.reparameterize(mu, log_var)
    x_reconstructed_logits = self.decode_logits(z)
    return x_reconstructed_logits, mu, log_var

Now we create the VAE model and declare the optimizer for it. You can see the summary of the model:

model = VAE([hidden_dim, latent_dim])
model.build(input_shape=(4, image_size))
optimizer = tf.keras.optimizers.Adam(learning_rate)

Figure 11: Summary of the VAE model

Now we train the model. We define our loss function, which is the sum of the reconstruction loss and KL divergence loss:

dataset = tf.data.Dataset.from_tensor_slices(x_train)
dataset = dataset.shuffle(batch_size * 5).batch(batch_size)
num_batches = x_train.shape[0] // batch_size
for epoch in range(num_epochs):
    for step, x in enumerate(dataset):
        x = tf.reshape(x, [-1, image_size])
        with tf.GradientTape() as tape:
            # Forward pass
            x_reconstruction_logits, mu, log_var = model(x)
            # Compute reconstruction loss and kl divergence
            # Scaled by 'image_size' for each individual pixel.
            reconstruction_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=x, logits=x_reconstruction_logits)
            reconstruction_loss = tf.reduce_sum(reconstruction_loss) / batch_size
            kl_div = - 0.5 * tf.reduce_sum(1. + log_var - tf.square(mu) - tf.exp(log_var), axis=-1)
            kl_div = tf.reduce_mean(kl_div)
            # Backprop and optimize
            loss = tf.reduce_mean(reconstruction_loss) + kl_div
        gradients = tape.gradient(loss, model.trainable_variables)
        for g in gradients:
            tf.clip_by_norm(g, 15)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        if (step + 1) % 50 == 0:
            print("Epoch[{}/{}], Step [{}/{}], Reconst Loss: {:.4f}, KL Div: {:.4f}"
            .format(epoch + 1, num_epochs, step + 1, num_batches, float(reconstruction_loss), float(kl_div)))

Once the model is trained it should be able to generate images similar to the original Fashion-MNIST images. To do so we need to use only the decoder network and we will pass to it a randomly generated z input:

z = tf.random.normal((batch_size, latent_dim))
out = model.decode(z)  # decode with sigmoid
out = tf.reshape(out, [-1, 28, 28]).numpy() * 255
out = out.astype(np.uint8)

In the following figure, you can see the result after 80 epochs; the generated images resemble the input space:

Figure 12: Results after 80 epochs


The chapter covered the major unsupervised learning algorithms. We went through algorithms best suited for dimension reduction, clustering, and image reconstruction. We started with the dimension reduction algorithm PCA, then we performed clustering using k-means and self-organized maps. After this we studied the restricted Boltzmann machine and saw how we can use it for both dimension reduction and image reconstruction. Next the chapter delved into stacked RBMs, that is, deep belief networks, and we trained a DBN consisting of three RBM layers on the MNIST dataset. Lastly, we learned about variational autoencoders, which, like GANs, can generate images after learning the distribution of the input sample space.

This chapter, along with chapters 6 and 9, covered models that were trained using unsupervised learning. In the next chapter, we move on to another learning paradigm: reinforcement learning.


  1. https://arxiv.org/abs/1404.1100
  2. http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
  3. http://mplab.ucsd.edu/tutorials/pca.pdf
  4. http://projector.tensorflow.org/
  5. http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
  6. https://www.deeplearningbook.org
  7. Kanungo, Tapas, et al. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE transactions on pattern analysis and machine intelligence 24.7 (2002): 881-892.
  8. Ortega, Joaquín Pérez, et al. Research issues on K-means Algorithm: An Experimental Trial Using Matlab. CEUR Workshop Proceedings: Semantic Web and New Technologies.
  9. A Tutorial on Clustering Algorithms, http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html.
  10. Chen, Ke. On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications. SIAM Journal on Computing 39.3 (2009): 923-947.
  11. https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set.
  12. Least Squares Quantization in PCM, Stuart P. Lloyd (1882), http://www-evasion.imag.fr/people/Franck.Hetroy/Teaching/ProjetsImage/2007/Bib/lloyd-1982.pdf
  13. Dunn, J. C. (1973-01-01). A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics. 3(3): 32–57.
  14. Bezdek, James C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms.
  15. Peters, Georg, Fernando Crespo, Pawan Lingras, and Richard Weber. Soft clustering–Fuzzy and rough approaches and their extensions and derivatives. International Journal of Approximate Reasoning 54, no. 2 (2013): 307-322.
  16. Sculley, David. Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web, pp. 1177-1178. ACM, 2010.
  17. Smolensky, Paul. Information Processing in Dynamical Systems: Foundations of Harmony Theory. No. CU-CS-321-86. COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986.
  18. Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann Machines for Collaborative Filtering. Proceedings of the 24th international conference on Machine learning. ACM, 2007.
  19. Hinton, Geoffrey. A Practical Guide to Training Restricted Boltzmann Machines. Momentum 9.1 (2010): 926.
  20. http://deeplearning.net/tutorial/rbm.html