Tensor Processing Unit – Deep Learning with TensorFlow 2 and Keras – Second Edition

16

Tensor Processing Unit

This chapter introduces the Tensor Processing Unit (TPU), a special chip developed at Google for ultra-fast execution of neural network mathematical operations. Similarly to Graphic Processing Units (GPUs), the idea here is to have a special processor focusing only on very fast matrix operations with no support for all the other operations normally supported by Central Processing Units (CPUs). However, the additional improvement with TPUs is to remove from the chip any hardware support for graphics operation normally present in GPUs (rasterization, texture mapping, frame buffer operations, and so on). Think of a TPU as a special purpose co-processor specialized for deep learning, being focused on matrix or tensor operations. In this chapter we are going to compare CPUs and GPUs with the three generations of TPUs and Edge TPUs. All these accelerators are available as November 2019. The chapter will include code examples of using TPUs. So with that, let's begin.

C/G/T processing units

In this section we discuss CPUs, GPUs, and TPUs. Before discussing TPUs, it will be useful for us to review CPUs and GPUs.

CPUs and GPUs

You are probably somewhat familiar with the concept of a CPU, a general-purpose chip sitting in each computer, tablet, and smartphone. CPUs are in charge of all of the computations: from logical controls, to arithmetic, to register operations, to operations with memory, and much more. CPUs are subject to the well-known Moore's law [1], which states that the number of transistors in a dense integrated circuit doubles about every two years.

Many people believe that we are currently in an era where this trend cannot be sustained for long, and indeed it has already declined during the past few years. Therefore, we need some additional technology if we want to support the demand for faster and faster computation to process the ever-growing amount of data that is available out there.

One improvement came from so-called GPUs: special purpose chips that are perfect for fast graphics operations such as matrix multiplication, rasterization, frame buffer manipulation, texture mapping, and many others. In addition to computer graphics where matrix multiplications are applied to pixels of images, GPUs also turned out to be a great match for deep learning. This is a funny story of serendipity: a great example of a technology created for one goal and then meeting staggering success in a domain completely unrelated to the one they were originally envisioned for.

Serendipity is the occurrence and development of events by chance in a happy or beneficial way.

TPUs

One problem encountered in using GPUs for deep learning is that these chips are made for graphics and gaming, not only for fast matrix computations. This would of course be the case, given that the G in GPU stands for Graphics! GPUs led to unbelievable improvements for deep learning but, in the case of tensor operations for neural networks, large parts of the chip are not used at all. For deep learning, there is no need for rasterization, no need for frame buffer manipulation, and no need for texture mapping. The only thing that is necessary is a very efficient way to compute matrix and tensor operations. It should be no surprise that GPUs are not necessarily the ideal solution for deep learning, since CPUs and GPUs were designed long before deep learning became successful.

Before going into the technical details, let's first discuss the fascinating genesis of Tensor Processing Unit version 1, or TPU v1. In 2013, Jeff Dean, the Chief of Brain Division at Google, estimated (see Figure 1) that if all the people owning a mobile phone were talking only three minutes more per day, then Google would have needed two times or three times more servers to process this data. This would have been an unaffordable case of success-disaster, that is, where great success has led to problems that cannot be properly managed.

It was clear that neither CPUs nor GPUs were a suitable solution. So, Google decided that they needed something completely new; something that would allow a 10x growth in performance with no significant cost increase. That's how TPU v1 was born! What is impressive is that it took only 15 months from initial design to production. You can find more details about this story in Jouppi et al., 2014 [3] where a detailed report about different inference workloads seen at Google in 2013 is also reported:

Figure 1: Different inference workloads seen at Google in 2013 (source [3])

Let's talk a bit about the technical details. TPU v1 is a special device (or an Application-Specific Integrated Circuit, also known as ASIC) designed for super-efficient tensor operations. TPUs follow the philosophy less is more. This philosophy has an important consequence: TPUs do not have all the graphic components that are needed for GPUs. Because of this, they are both very efficient from an energy consumption perspective, and frequently much faster than GPUs. So far, there have been three generations of TPUs. Let's review them.

Three generations of TPUs and Edge TPU

As discussed, TPUs are domain-specific processors expressly optimized for matrix operations. Now, you might remember that the basic operation of a matrix multiplication is a dot product between a line from one matrix and a column from the other matrix. For instance, given a matrix multiplication Y=X*W, computing Y[i,0] is:

The sequential implementation of this operation is time consuming for large matrices. A brute-force computation has time complexity of O(n3) for n x n matrices, so it's not feasible for running large computations.

First-generation TPU

The first-generation TPU (TPU v1) was announced in May 2016 at Google I/O. TPU v1 [1] supports matrix multiplication using 8-bit arithmetic. TPU v1 is specialized for deep learning inference but it does not work for training. For training there is a need to perform floating-point operations, as discussed in the following paragraphs.

A key function of TPU is the "systolic" matrix multiplication. Let's see what this means. Remember that the core of deep learning is a core product Y=X*W, where, for instance, the basic operation to compute Y[i,0] is:

"Systolic" matrix multiplication allows multiple Y[i, j] values to be computed in parallel. Data flows in a coordinated manner and, indeed, in medicine the term "systolic" refers to heart contractions and how blood flows rhythmically in our veins. Here systolic refers to the data flow that pulses inside the TPU. It can be proven that a systolic multiplication algorithm is less expensive than the brute force one [2]. TPU v1 has a Matrix Multiply Unit (MMU) running systolic multiplications on 256×256 cores so that 64l,000 multiplications can be computed in parallel in one single shot. In addition, TPU v1 sits in a rack and it is not directly accessible. Instead, a CPU acts as the host controlling data transfer and sending commands to the TPU for performing tensor multiplications, for computing convolutions, and for applying activation functions.

The communication CPU ↔ TPU v1 happens via a standard PCIe 3.0 bus. From this perspective, TPU v1 is closer in spirit to an FPU (floating-point unit) coprocessor than it is to a GPU. However, TPU v1 has the ability to run whole inference models to reduce dependence on the host CPU. Figure 2 represents TPU v1, as shown in [3]. As you see in the figure, the processing unit is connected via a PCI port, and it fetches weights via a standard DDR4 DRAM Chip. Multiplication happens within the MMU with systolic processing. Activation functions are then applied to the results. MMU and the unified buffer for activations take up a large amount of space. There is an area where the activation functions are computed:

Figure 2: TPU v1 design schema (source [3])

TPU v1 is manufactured on a 28 nm process with a die size ≤ 331 mm2, clock speed of 700 MHz, 28 MB of on-chip memory, 4 MB of 32-bit accumulators, and a 256×256 systolic array of 8-bit multipliers. For this reason, we can get 700Mhz*65536 (multipliers) → 92 Tera operations/sec. This is an amazing performance for matrix multiplications: Figure 3 shows the TPU circuit board and flow of data for the systolic matrix multiplication performed by the MMU. In addition, TPU v1 has an 8 GB dual-channel 2133 MHz DDR3 SDRAM offering 34 GB/s of bandwidth. The external memory is standard, and it is used to store and fetch the weights used during the inference. Notice also that TPU v1 has a thermal design power of 28-40 Watts, which is certainly low consumption compared to GPUs and CPUs. Moreover, TPU v1 are normally mounted in a PCI slot used for SATA disks so they do not require any modification in the host server [3]. Up to 4 cards can be mounted in each server. Figure 3 shows TPU v1 card and the process of systolic computation:

Figure 3: On the left you can see a TPU v1 board, and on the right an example of how the data is processed during the systolic computation

If you want to have a look at TPU performance compared to GPUs and CPUs, we can refer Jouppi et al., 2014 [3] and see (in a log-log scale graph) that the performance is two orders of magnitude higher than a Tesla K80 GPU.

The graph shows a "rooftop" performance that is growing until the point where it reaches the peak and then it is constant. The higher the roof the merrier for performance.

Figure 4: TPU v1 peak performance can be up to 3x higher than a Tesla K80

Second-generation TPU

The second-generation TPUs (TPU2) were announced in 2017. In this case, the memory bandwidth is increased to 600 GB/s and performance reaches 45 TFLOPS. 4 TPU2s are arranged in a module with 180 TFLOPS performance. Then 64 modules are grouped into a pod with 11.5 PFLOPS of performance. TPU2s adopt floating-point arithmetic and therefore they are suitable for both training and inference.

TPU2 has MMU for matrix multiplications of 128x128 cores and a Vector Processing Unit (VPU) for all other tasks such as applying activations. The VPU handles float32 and int32 computations. The MXU on the other hand operates in a mixed precision 16-32 bit floating point format.

Each TPU v2 chip has two cores, and up to 4 chips are mounted in each board. In TPU v2, Google adopted a new floating-point model called bfloat 16. The idea is to sacrifice some resolution but still be very good for deep learning. This reduction in resolution allows you to improve the performance of the v2 TPUs, which are more power efficient than v1. Indeed, it can be proven that a smaller mantissa helps reducing the physical silicon area and multiplier power. Therefore, the bfloat16 uses the same standard IEEE 754 single-precision floating-point format, but it truncates the mantissa field from 23 bits to just 7 bits. Preserving the exponent bits allows the format to keep the same range as the 32-bit single precision. This allows for relatively simpler conversion between the two data types:

Figure 5: Cloud TPU v3 and Cloud TPU v2

Google offers access to these TPU v2 and TPU v3 via Google Compute Engine (GCE) and on Google Kubernetes Engine (GKE). Plus, it is possible to use them for free via Colab.

Third-generation TPU

The third-generation TPUs (TPU3) were announced in 2018 [4]. TPU3s are 2x faster than TPU2 and they are grouped in 4x larger pods. In total, this is a performance increase of 8x. Cloud TPU v3 pods can deliver more than 100 PetaFLOPS of computing power.

On the other hand, Cloud TPU v2 pods released in alpha in 2018 can achieve 11.5 PetaFLOPS; another impressive improvement. As of 2019 both TPU2 and TPU3 are in production with different prices:

Figure 6: Google announced TPU v2 and v3 Pods in beta at the Google I/O 2019

TPU v3 board has 4 TPU chips, 8 cores, and liquid cooling. Google has adopted an ultra-high-speed interconnect hardware derived from supercomputer technology, for connecting thousands of TPUs with very low latency. Each time a parameter is updated on a single TPU, all the others are informed via a reduce-all algorithm typically adopted for parallel computation. So, you can think about TPU v3 as one of the fastest supercomputers available today for matrix and tensor operations with thousands of TPUs inside it.

Edge TPU

In addition to the three generations of TPUs already discussed, in 2018 Google announced a special generation of TPUs running on the edge. This TPU is particularly appropriate for Internet of Things (IoT) and for supporting TensorFlow Lite on mobile and IoT. With this we conclude the introduction to TPU v1, v2, and v3. In the next section we will briefly discuss performance.

TPU performance

Discussing performance is always difficult because it is important to first define the metrics that we are going to measure, and the set of workloads that we are going to use as benchmarks. For instance, Google reported an impressive linear scaling for TPU v2 used with ResNet-50 [4] (see Figure 7).

Figure 7: Linear scalability in the number of TPUs v2 when increasing the number of images

In addition, you can find online a comparison of ResNet-50 [4] where a Full Cloud TPU v2 Pod is >200x faster than a V100 Nvidia Tesla GPU for ResNet-50 training:

Figure 8: A Full Cloud TPU v2 Pod is >200x faster than a V100 Nvidia Tesla GPU for training a ResNet-50 model

In December 2018, the MLPerf initiative was announced. MLPerf [5] is a broad ML benchmark suite created by a large set of companies. The goal is to measure the performance of ML frameworks, ML accelerators, and ML cloud platforms.

How to use TPUs with Colab

In this section, we show how to use TPUs with Colabs. Just point your browser to https://colab.research.google.com/ and change the runtime from the runtime menu as shown in Figure 9:

Figure 9: Setting TPU as runtime in Colab

Checking whether TPUs are available

First of all, let's check if there is a TPU available by using this simple code fragment that returns the IP address assigned to the TPU. Communication between CPU and TPU happens via grpc:

import os
try:
    device_name = os.environ['COLAB_TPU_ADDR']
    TPU_ADDRESS = 'grpc://' + device_name
    print('Found TPU at: {}'.format(TPU_ADDRESS))
except KeyError:
    print('TPU not found')
Found TPU at: grpc://10.91.166.82:8470

We've confirmed that a TPU is available! Now, we'll continue to explore how we can make use of it.

Loading data with tf.data

Our goal is to implement a simple CNN on MNIST data (see Chapter 4, Convolutional Neural Networks). Then we want to run the model on a TPU. To do this, we must load the data with tf.data libraries. Hence, we need to define a training and test function (see Chapter 2, TensorFlow 1.x and 2.x) as shown in the following code:

# training input function
def train_input_fn(batch_size=1024):
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((x_train,y_train))
    # Shuffle, repeat, and batch the examples.
    dataset = dataset.cache() # Loads the data into memory 
    dataset = dataset.shuffle(1000, reshuffle_each_iteration=True)
    dataset = dataset.repeat() 
    dataset = dataset.batch(batch_size, drop_remainder=True)
    return dataset
# testing input function
def test_input_fn(batch_size=1024):
    dataset = tf.data.Dataset.from_tensor_slices((x_test,y_test))
    # Shuffle, repeat, and batch the examples.
    dataset = dataset.cache()
    dataset = dataset.shuffle(1000, reshuffle_each_iteration=True)
    dataset = dataset.repeat()
    dataset = dataset.batch(batch_size, drop_remainder=True)
    return dataset

Where (x_train, y_train), (x_test, y_test) = mnist.load_data(). Note that drop_remainder=True is an important parameter that forces the batch method to pass fixed shapes expected by the TPUs. Note that TPU v2 has an MMU with 128 × 128 multipliers. Usually you get the best performance by setting the batch size to 128 per TPU core. With 10 TPU cores, for instance, the batch size would be 1,280.

Building a model and loading it into the TPU

As of November 2019, TensorFlow 2.0 does not fully support TPUs. They are available in TensorFlow 1.5.0, and with the nightly build. Let's first see an example with TensorFlow 1.5, and the example with the nightly build will be shown later.

Note that full support for TPUDistributionStrategy is planned for TensorFlow 2.1. 2.0 has limited support, and the issue is tracked in https://github.com/tensorflow/tensorflow/issues/24412.

So, let's define a standard CNN model made up of three convolutional layers, alternated with max-pooling layers and followed by two dense layers with a dropout in the middle. For the sake of brevity, the definition of input_shape, batch_size is omitted. In this case, we use the functional tf.keras API (see Chapter 2, TensorFlow 1.x and 2.x):

Inp = tf.keras.Input(name='input', shape=input_shape, batch_size=batch_size, dtype=tf.float32)
x = Conv2D(32, kernel_size=(3, 3), activation='relu',name = 'Conv_01')(Inp)
x = MaxPooling2D(pool_size=(2, 2),name = 'MaxPool_01')(x)
x = Conv2D(64, (3, 3), activation='relu',name = 'Conv_02')(x)
x = MaxPooling2D(pool_size=(2, 2),name = 'MaxPool_02')(x)
x = Conv2D(64, (3, 3), activation='relu',name = 'Conv_03')(x)
x = Flatten(name = 'Flatten_01')(x)
x = Dense(64, activation='relu',name = 'Dense_01')(x)
x = Dropout(0.5,name = 'Dropout_02')(x)
output = Dense(num_classes, activation='softmax',name = 'Dense_02')(x)
model = tf.keras.Model(inputs=[Inp], outputs=[output])

Let's now use Adam optimizer and compile the model:

#Use a tf optimizer rather than a Keras one for now
opt = tf.train.AdamOptimizer(learning_rate)
model.compile(
      optimizer=opt,
      loss='categorical_crossentropy',
      metrics=['acc'])

Then, we call tpu.keras_to_tpu_model to convert to a TPU model and then we use the tpu.TPUDistributionStrategy for running on a TPU. It's as simple as that; we just need to take the appropriate strategy with TPUDistributionStrategy() and all the rest is done transparently on our behalf:

tpu_model = tf.contrib.tpu.keras_to_tpu_model(
    model,
    strategy=tf.contrib.tpu.TPUDistributionStrategy(
        tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)))

The execution is super-fast on a TPU as every single iteration takes about 2 seconds:

Epoch 1/10
INFO:tensorflow:New input shapes; (re-)compiling: mode=train (# of cores 8), [TensorSpec(shape=(1024,), dtype=tf.int32, name=None), TensorSpec(shape=(1024, 28, 28, 1), dtype=tf.float32, name=None), TensorSpec(shape=(1024, 10), dtype=tf.float32, name=None)]
INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for input
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Started compiling
INFO:tensorflow:Finished compiling. Time elapsed: 2.567350149154663 secs
INFO:tensorflow:Setting weights on TPU model.
60/60 [==============================] - 8s 126ms/step - loss: 0.9622 - acc: 0.6921
Epoch 2/10
60/60 [==============================] - 2s 41ms/step - loss: 0.2406 - acc: 0.9292
Epoch 3/10
60/60 [==============================] - 3s 42ms/step - loss: 0.1412 - acc: 0.9594
Epoch 4/10
60/60 [==============================] - 3s 42ms/step - loss: 0.1048 - acc: 0.9701
Epoch 5/10
60/60 [==============================] - 3s 42ms/step - loss: 0.0852 - acc: 0.9756
Epoch 6/10
60/60 [==============================] - 3s 42ms/step - loss: 0.0706 - acc: 0.9798
Epoch 7/10
60/60 [==============================] - 3s 42ms/step - loss: 0.0608 - acc: 0.9825
Epoch 8/10
60/60 [==============================] - 3s 42ms/step - loss: 0.0530 - acc: 0.9846
Epoch 9/10
60/60 [==============================] - 3s 42ms/step - loss: 0.0474 - acc: 0.9863
Epoch 10/10
60/60 [==============================] - 3s 42ms/step - loss: 0.0418 - acc: 0.9876
<tensorflow.python.keras.callbacks.History at 0x7fbb3819bc50>

As you can see, running a simple MNIST model on TPUs is extremely fast. Each iteration is around 3 seconds even if we have a CNN with 3 convolutions followed by two dense stages.

Using pretrained TPU models

Google offers a collection of models pretrained with TPUs available on GitHub TensorFlow/tpu repo (https://github.com/tensorflow/tpu). Models include image recognition, object detection, low-resource models, machine translation and language models, speech recognition, and image generation. Whenever it is possible, my suggestion is to start with a pretrained model [6], and then fine tune it or apply some form of transfer learning. As of September 2019, the following models are available:

Image Recognition, Segmentation, and more Machine Translation and Language Models Speech Recognition Image Generation

Image Recognition

  • AmoebaNet-D
  • ResNet-50/101/152/2000
  • Inception v2/v3/v4

Object Detection

  • RetinaNet
  • Mask R-CNN

Image Segmentation

  • Mask R-CNN
  • DeepLab
  • RetinaNet

Low-Resource Models

  • MnasNet
  • MobileNet
  • SqueezeNet

Machine Translation

(transformer based)

Sentiment Analysis

(transformer based)

Question Answer

Bert

ASR Transformer

Image Transformer

DCGAN

GAN

Table 1: State-of-the-art collection of models pretrained with TPUs available on GitHub

The best way to play with the repository is to clone it on Google Cloud Console and use the environment available at https://github.com/tensorflow/tpu/blob/master/README.md.

You should be able to browse what is shown in Figure 10. If you click the button OPEN IN GOOGLE CLOUD SHELL, then the system will clone the Git repo into your Cloud Shell and then open the shell (see Figure 11). From there, you can play with a nice Google Cloud TPU demo for training a ResNet-50 on MNIST with a TPU Flock – a Compute Engine VM and Cloud TPU pair (see Figure 12). I will leave this training demo to you, if you are interested in looking it up:

Figure 10: State-of-the-art collection of models pretrained with TPUs available on GitHub

Figure 11: Google Cloud Shell with tpu git repo cloned on your behalf

Figure 12: Google Cloud TPU demo for training a ResNet-50 on MNIST with a TPU Flock

Using TensorFlow 2.1 and nightly build

As of November 2019, you can get full TPU support only with the latest TensorFlow 2.x nightly build. If you use the Google Cloud Console (https://console.cloud.google.com/) you can get the latest nightly build. Just, go to Compute Engine | TPUs | CREATE TPU NODE. The version selector has a "nightly-2.x" option. Martin Görner has a nice demo at http://bit.ly/keras-tpu-tf21 (see Figure 13). This is used for classifying images of flowers:

Figure 13: Martin Görner on Twitter on Full Keras/TPU support

Note that both Regular Keras using model.fit() and custom training loop, distributed are supported. You can refer tohttp://bit.ly/keras-tpu-tf21. Let's look at the most important parts of the code related to TPUs. First at all, the imports:

import re
import tensorflow as tf
import numpy as np
from matplotlib import pyplot as plt
print("Tensorflow version " + tf.__version__)

Then the detection of the TPUs, and the selection of TPU strategy:

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
    print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
    tpu = None
if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()
print("REPLICAS: ", strategy.num_replicas_in_sync)

Then the usage of TPUs, where the appropriate TPU strategy is used:

with strategy.scope():
    model = create_model()    
    model.compile(optimizer=tf.keras.optimizers.SGD(nesterov=True, momentum=0.9),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
    model.summary()

In short, it is extremely simple to use TPUs with the upcoming TensorFlow 2.1, and if you want to experiment immediately you can use TensorFlow 2.0 nightly build. Martin reports typical run times for his specific model:

  • GPU (V100): 15s per epoch
  • TPU v3-8 (8 cores): 5s per epoch
  • TPU pod v2-32 (32 cores): 2s per epoch

Summary

TPUs are very special ASIC chips developed at Google for executing neural network mathematical operations in an ultra-fast manner. The core of the computation is a systolic multiplier that computes multiple dot products (row * column) in parallel, thus accelerating the computation of basic deep learning operations. Think of a TPU as a special-purpose coprocessor for deep learning, which is focused on matrix or tensor operations. Google has announced three generations of TPUs so far, plus an additional Edge TPU for IoT. Cloud TPU v1 is a PCI-based specialized co-processor, with 92 TeraFLOPS and inference only. Cloud TPU v2 achieves 180 TeraFLOPS and it supports training and inference. Cloud TPU v2 pods released in alpha in 2018 can achieve 11.5 PetaFLOPS. Cloud TPU v3 achieves 420 TeraFLOPS with both training and inference support. Cloud TPU v3 pods can deliver more than 100 PetaFLOPS of computing power. That's a world-class supercomputer for tensor operations!

References

  1. Moore's law https://en.wikipedia.org/wiki/Moore%27s_law.
  2. Forty-three ways of systolic matrix multiplication, I.Ž. Milovanović, et al., Article in International Journal of Computer Mathematics 87(6):1264-1276 May 2010.
  3. In-Datacenter Performance Analysis of a Tensor Processing Unit, Norman P. Jouppi, and others, 44th International Symposium on Computer Architecture (ISCA), June 2014.
  4. Google TPU v2 performance https://storage.googleapis.com/nexttpu/index.html.
  5. MLPerf site https://mlperf.org/.
  6. Collection of models pretrained with TPU g.co/cloudtpu.