TensorFlow and Cloud – Deep Learning with TensorFlow 2 and Keras – Second Edition


TensorFlow and Cloud

AI algorithms require extensive computing resources. With the availability of a large number of cloud platforms offering their services at competitive prices, cloud computing offers a cost-effective solution. In this chapter, we will talk about three main cloud platform providers that occupy the majority of the market share: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform. Moreover, once you have trained your model on cloud, you can use TensorFlow Extended (TFX) to move your model to production. The chapter will cover:

  • Creating and using virtual machines on cloud
  • Creating and training directly on Jupyter Notebook on cloud
  • Deploying the model on cloud
  • Using TFX for production
  • TensorFlow Enterprise

Deep learning on cloud

There was a time when, if you wanted to work in the field of deep learning, then you needed to shell out thousands of dollars to obtain the infrastructure required to train your deep learning model. Not anymore! Today, a large number of public cloud service providers offer affordable cloud computing services. Training your Deep Learning (DL) model on cloud offers various advantages:

  • Affordability: Most cloud service providers offer a range of subscription options; you can choose from monthly subscriptions to pay-as-you-use options. Most also offer free credit for new users.
  • Flexibility: You are no longer bound to a physical location; you can log in to the cloud from any physical location and continue your work.
  • Scalability: As your need grows, you can scale your cloud resources as simply as requesting a quota increase or a change in subscription model.
  • Hassle-free: Unlike your personal system, where everything from choice of hardware to the installation of software dependencies is your responsibility, the cloud services offer ready-made solutions in the form of premade system images. The images come installed with all the packages you might need for training your deep learning model.
  • Language support: All services support a variety of computer languages. You can write your code in your favorite language.
  • API for deployment: Most cloud services also allow you to embed your deep learning model directly into the applications and on the web.

Depending on the services offered, the cloud platform can be classified as:

  • Infrastructure as a Service (IaaS): In this case, the service provider only provides the physical infrastructure; things like virtual machines, data storage centers come in this.
  • Platform as a Service (PaaS): Here, the service provider provides a runtime environment, both hardware and software, for the development and deployment of applications. For example, web servers and data centers.
  • Software as a Service (SaaS): Here, the service provider provides a software application as a service, for example Microsoft Office 365 or the interactive Jupyter notebooks available on cloud.

Before delving into details about how to use different cloud services, let us go through some popular cloud service providers and their offerings. We will be considering Microsoft Azure, AWS, and Google Cloud platforms. All three of them provide facilities to build, deploy, and manage applications, additionally they can also provide services through the worldwide web.

Microsoft Azure

Microsoft Azure provides both PaaS and SaaS services. The Azure platform provides a myriad of services: virtual machines, networking, storage, and even IoT solutions. To access these services, you need to open an account with Azure. You will require an email address to do this. Go to the site: https://azure.microsoft.com/en-in/ to open your account. The Azure platform also provides integration with GitHub, so if you already have a GitHub account you can use it to log in. Once you have successfully created the account and have logged in, you will see the following dashboard:

Figure 1: The Microsoft Azure dashboard

For new users, Azure offers a $200 credit, and a range of free services. For paid services, you can choose from a range (called "subscriptions" on the Azure platform) of monthly payment plans or "pay-as-you-go" options. You will need to provide your credit card information to access paid services. Some of the popular services offered by Azure are:

  • Virtual Machines: You can create your own machines on the net, with up to 128 virtual CPUs (vCPUs) and up to 6 TB of memory. There is a wide range of virtual machine series offered by Azure. Since the topic of this book is deep learning, we will limit ourselves to N-series virtual machines offered by Azure, which are sufficient for our needs. In a later section, Virtual machines on cloud, you will learn the features of N-series and how to deploy one. To see the complete range of virtual machines offered please refer to: https://azure.microsoft.com/en-in/pricing/details/virtual-machines/series/.
  • Function: Provides serverless architecture; one need not worry about hardware or networking, you just need to deploy your code, and Function takes care of the rest.
  • Storage Services: Azure provides Blob storage to store any type of data in the cloud. The data stored on Blob can be used for content distribution, backup, and big data analytics.
  • IoT Hub: Provides a central communication service to communicate between your IoT devices and code. Using IoT Hub, you can connect virtually any device to the cloud.
  • Azure DevOps: Provides an integrated set of features that allow you to collaborate with a team. You can create a work plan, work together on code, develop and deploy applications, and implement continuous integration and deployment.

Amazon Web Services (AWS)

Since 2006, Amazon started offering its infrastructure on cloud to businesses, under the name Amazon Web Services. AWS offers a wide range of global cloud-based products. They include compute instances, storage services, databases, analytics, networking, mobile and developer tools, IoT, management tools, security, and enterprise applications. These services are available on-demand with pay-as-you-go pricing options or monthly subscriptions. There are over 140 AWS services offering data warehousing, directories, deployment tools, and content delivery to name a few.

Before using the AWS, you need to open an account. If you have an existing account you can log in using it, otherwise visit http://aws.amazon.com and click on Create an AWS account to create a new account, as seen in the following screenshot:

Figure 2: The "Create an AWS Account" tab.

The account can be created for free, and many of the services are available under the "free basic plan". To learn details about the free offerings you can visit: http://aws.amazon.com/free. Even though you may choose a free account and free services, the portal requires you to give credit/debit card details for verification purposes. Once you log in, you are led to a management console. Following is a screenshot of the management console:

You can learn about all the services offered by AWS using this link: https://docs.aws.amazon.com/index.html?nc2=h_ql_doc_do. Let us now go through some of the important AWS services that we as deep learning engineers/researchers can use:

  • Elastic Compute Cloud (EC2): Provides virtual computers. You can configure the hardware and software according to your infrastructural needs. You have an option to choose from CPU, GPU, storage, networking, and disk image configurations. We will talk about how to create an EC2 instance for deep learning in the next section.
  • Lambda: The serverless computer service offered by Amazon. It lets you run code without provisioning or managing servers. You only need to pay for the compute time you consume – there is no charge when your code is not running. It allows one to run code for virtually any type of application or backend service, with zero administration requirements.
  • Elastic Beanstalk: Provides quick and efficient services for deployment, monitoring, and scaling of your application.
  • AWS IoT: Allows you to connect and manage devices in the cloud.
  • SageMaker: A platform for developing and deploying machine learning models. With its prebuilt ML models, it allows you to train and deploy ML algorithms with ease. Later in this chapter we will learn how to use the integrated Jupyter Notebook of SageMaker to train our model on cloud.

Google Cloud Platform (GCP)

From computing infrastructure to software management, GCP provides a suite of cloud computing services. A complete list of all the services offered by GCP is available here: https://cloud.google.com/docs/. Google cloud offers the same infrastructure that it uses for its end-user products like Gmail, Google Search, and YouTube. Beside CPUs and GPUs, GCP also offers a choice of TPUs (Chapter 16, Tensor Processing Unit).

GCP allows you to open an account for free – you just need to register using an email address (or phone) and card (debit/credit) details. It offers a $300 credit to new users, which is valid for 12 months and can be used across its products. Once you log in to the Google console you can access all its services. Following is a screenshot of my Google console:

Figure 3: The console of the Google Cloud Platform

Like Azure and AWS, GCP also offers a plethora of services. Some of the services of interest to a deep learning scientist and engineer are (as defined in the latest Google GCP documentation):

  • Compute Engine (https://cloud.google.com/compute/docs/): Compute Engine lets you create and run virtual machines on Google infrastructure. Compute Engine offers scale, performance, and value that allows you to easily launch large compute clusters on Google's infrastructure. There are no upfront investments and you can run thousands of virtual CPUs on a system that has been designed to be fast, and to offer strong consistency of performance.
  • Deep Learning Containers (https://cloud.google.com/ai-platform/deep-learning-containers/docs/): AI Platform Deep Learning Containers provides you with performance optimized, consistent environments to help you prototype and implement workflows quickly. Deep Learning Containers images come with the latest machine learning data science frameworks, libraries, and tools preinstalled.
  • App Engine (https://cloud.google.com/appengine/docs/): App Engine is a fully managed, serverless platform for developing and hosting web applications at scale. You can choose from several popular languages, libraries, and frameworks to develop your apps, then let App Engine take care of provisioning servers and scaling your app instances based on demand.
  • Cloud Functions (https://cloud.google.com/functions/docs/concepts/overview): Google Cloud Functions is a serverless execution environment for building and connecting cloud services. With Cloud Functions you write simple, single-purpose functions that are attached to events emitted from your cloud infrastructure and services. Your function is triggered when an event being watched is fired. Your code executes in a fully managed environment. There is no need to provision any infrastructure or worry about managing any servers.

    Cloud Functions can be written using JavaScript, Python 3, or Go runtimes on Google Cloud Platform. You can take your function and run it in any standard Node.js (Node.js 6, 8 or 10), Python 3 (Python 3.7), or Go (Go 1.11) environment, which makes both portability and local testing a breeze.

  • Cloud IoT Core (https://cloud.google.com/iot/docs/): Google Cloud Internet of Things (IoT) Core is a fully managed service for securely connecting and managing IoT devices, from a few to millions. Ingest data from connected devices and build rich applications that integrate with the other big data services of Google Cloud Platform.
  • Cloud AutoML (https://cloud.google.com/automl/docs/): Cloud AutoML makes the power of machine learning available to you even if you have limited knowledge of machine learning. You can use AutoML to build on Google's machine learning capabilities to create your own custom machine learning models that are tailored to your business needs, and then integrate those models into your applications and web sites.

Having covered GCP, let's move on to another cloud service: IBM Cloud.

IBM Cloud

With about 190 cloud services, IBM allows one to create an account for free with a $200 credit (no cards required). You can open an account by giving your email and some additional details: https://cloud.ibm.com/registration. The best part of IBM cloud is that it provides access to Watson Studio, where one can leverage the Watson API and use its pretrained models to build and deploy applications. It also offers Watson Machine Learning, which allows you to build deep learning models from scratch.

Now that we've covered some cloud service providers, let's take a look at the virtual machines that we are able to utilize on these clouds.

Virtual machines on cloud

As the name suggests, virtual machines (VMs) are not real systems. Instead, they are a computer file, called an image, which emulates the behavior of an actual computer. Thus, we can create a virtual computer within a computer. It runs on your existing OS, almost like any other program, providing you the same experience as you would have on a physical system with the same configuration (albeit with some latency).

Each virtual machine has its own virtual hardware, including CPUs, GPUs, memory, hard drives, network interfaces, and other devices. The cloud service providers allow you to create a virtual machine on their physical hardware using VM services. This section will cover how to create a virtual machine on the three cloud service providers, and features offered by them.

EC2 on Amazon

To create a virtual machine on Amazon EC2 you will need to launch an Amazon EC2 instance by clicking on the Launch Instance button available in the EC2 dashboard, as shown in the following screenshot:

Figure 4: Screenshot of the EC2 dashboard

After clicking Launch Instance, you can create your virtual machine in two simple steps:

  1. Choose an Amazon Machine Image (AMI): Amazon offers a variety of prebuilt AMIs for Deep Learning (https://aws.amazon.com/machine-learning/amis/). The Conda AMIs (on AWS Linux, Ubuntu, and Windows OS) provide prebuilt Conda virtual environments for various Deep Learning frameworks including TensorFlow. The Base AMIs (on AWS Linux and Ubuntu) have various versions of CUDA preinstalled, and the user needs to enable the appropriate CUDA version and install the framework of choice.

    As of November 2019, the existing AMIs in Amazon Marketplace do not support TensorFlow 2.x.

  2. Choose the Instance type: Amazon offers a wide range of instance selection, from general purpose computing to accelerated computing. For the purpose of deep learning we require instances with GPUs. P3, P2, G4, G3, and G2 instances have GPU support (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/accelerated-computing-instances.html). So, for DL projects you should select one of these. Please note that AWS has instance limits set on these, by default for all accelerated compute instances it is set to 0. You will need to first request for an increase in the instance limit (again remember each instance is not available in every region, so go through the documentation to know what regions to choose for your required instance).

Now unless you want to do advanced network and security settings, your machine is ready to launch. Just review your selections and launch it. Amazon EC2 allows you to communicate with your virtual machine through the command line via SSH or using a web browser.

An alternative to Amazon EC2 is Compute Instance, available on GCP.

Compute Instance on GCP

To access Compute instance, go to the Google Cloud Console and select Compute Engine, and you will reach the dashboard where you can select the configuration you want for your virtual machine. Following is a screenshot of the Compute Engine dashboard. Select Create or Import (if you already have a saved VM configuration) to create a new virtual machine instance:

Figure 5: The Compute Engine dashboard

Alternatively, you can also choose the complete configuration from the marketplace, which will launch the environment with the corresponding (minimum) infrastructure. You then just need to deploy the instance. Each instance will have different price rating per month depending upon the compute resources it requires.

GCP Compute Engine offers two options for CPUs families, either Intel Skylake platform (also called N1; this series allows GPUs) or the Intel Cascade Lake platform. With your machine you have an option to add GPUs. At the time of writing this book, GCP offered four different GPUs (and TPUs; for more on TPUs refer to Chapter 16, Tensor Processing Unit):

  • Nvidia Tesla K80
  • Nvidia Tesla P4
  • Nvidia Tesla T4
  • Nvidia Tesla V100

Virtual machine on Microsoft Azure

For deep learning and prediction applications Azure provides machines with GPU capabilities. These are called N-series machines.

According to the Microsoft Azure site (https://azure.microsoft.com/en-in/pricing/details/virtual-machines/series/) there are three different N-series offerings, each aimed at specific workloads:

  • NC series: It focuses on high-performance computing and machine learning applications. The latest version – NCsv3 – features Nvidia's Tesla V100 GPU.
  • ND series: It focuses on training and inference scenarios for deep learning. It uses the Nvidia Tesla P40 GPUs. The latest version – NDv2 – features the Nvidia Tesla V100 GPUs.
  • NV series: This supports powerful remote visualization applications and other graphics-intensive workloads backed by the Nvidia Tesla M60 GPU.

All of them also offer optional InfiniBand interconnect to enable scale-up performance. To create a virtual machine on Azure you need to follow three basic steps:

  1. Log in to Azure portal
  2. Select Virtual machines as a resource and then select Create Virtual Machine
  3. Choose the configuration you require and launch it

In Azure also you need to request an increase in quota for certain compute instances.

Jupyter Notebooks on cloud

During development and testing of the model, many in the machine learning community find using Jupyter Notebooks handy; they provide an integrated environment to run and view the result. They are very useful when you are collaborating or want to discuss code with a client. With LaTeX support, many researchers are even shifting to present their research papers on Jupyter, and hence it makes sense to have Jupyter Notebook environment on cloud.

You just share the link and the other person can view it and run it, without any of the hassle of OS environment and software dependencies. In this section we will cover the Jupyter Notebook environments made available by three of the technological giants: Google, Microsoft, and Amazon.


Amazon SageMaker is a fully managed machine learning service. You can use it easily and quickly build and train machine learning models. The trained models can then be directly deployed into a production-ready hosted environment. SageMaker provides an integrated Jupyter notebook instance; this allows for easy access to data sources and provides a convenient coding platform for exploration and analysis, thus removing any need to manage servers.

An additional feature provided by SageMaker is the availability of optimized common machine learning algorithms. This allows users to run code efficiently, even when the dataset being used is extremely large. It offers flexible distributed training options that you can tailor according to your specific workflow. The trained model can later be deployed into a scalable and secure environment, with only a single click from the Amazon SageMaker console. Both training and hosting are billed according to the number of minutes used. There are no minimum fees and no upfront commitments. You can follow the Amazon documentation on how to setup SageMaker using this link: https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html.

In order to load data and deploy your model, you will need to use SageMaker modules and functions. A good place to start will be this tutorial: https://www.bmc.com/blogs/amazon-sagemaker/. As you may gather from the tutorial, Amazon SageMaker is not free. Even experimenting on it to write code for this book required us to spend precious dollars. However, it offers ease of deployment.

Google Colaboratory

Google, along with the Jupyter development team, launched Google Colaboratory in 2014. Since then, the Colaboratory has grown in function and utility. Today, it supports GPU and TPU hardware acceleration. It supports Python (2.7 and 3.6 version). The Colab is integrated with Google Drive, so your notebooks are saved on your drive and you can also read data from your own drive (you will need to authorize the notebook first).

The best part of Google Colaboratory is that it is completely free. You can run your code continuously for 12 hours on it. To be able to work with Colaboratory, you need an account with Google. Your normal Gmail account will also work.

When you log in to Colaboratory at https://colab.research.google.com/notebooks/intro.ipynb#recent=true, you have the option to open an existing Jupyter Notebook from your drive or from GitHub. You can also upload an existing notebook from your computer or create a new notebook. For beginners, it also contains some example notebooks, as seen in the following screenshot:

Figure 6: A screenshot of the Colaboratory

You can create a new notebook as well, by clicking on the blue link in the bottom-right corner (in the preceding screenshot). This will create a Python 3 notebook.

Jupyter plans to phase out Python 2 support by January 2020. Google too shall phase out the Python 2 support from Colaboratory after that.

To choose the hardware accelerator, you need to go to Edit | Notebook Settings and select the required hardware accelerator (None/GPU/TPU. For more information on TPUs refer to Chapter 16, Tensor Processing Unit). The Notebook environment comes installed with most useful Python packages (TensorFlow, NumPy, Matplotlib, Pandas, and so on).

In case you require to install a specific version or a module not part of the default Colaboratory environment you can use pip install or apt-get install. For example, the following command on being executed in the Colab notebook cell will install TensorFlow GPU 2.0 version:

! pip install tensorflow-gpu

At the time of writing this book the default version of TensorFlow was 1.15, with a message that it will be shifting soon to TensorFlow 2.0, the NumPy version was 1.17.3, Matplotlib 3.1.1, and Pandas 0.25.3.

If you are interested to know the hardware details of the environment where Colaboratory notebooks run, you can get the info using the cat command:

For processor:

!cat /proc/cpuinfo

For memory:

!cat /proc/meminfo

To run these starting commands and get the version information for your region, you can use the following Colaboratory Notebook:


Just like the standard Jupyter notebook, you can run the Unix command line commands directly in the Notebook prefixed by an exclamation mark "!".

You can also mount your Google Drive and access the files you have saved in your drive. To do this you will use:

from google.colab import drive

This will generate a link. After clicking the link you will get an authorization code, and entering the authorization code will give the notebook access to your drive. You can check the content of your drive using !ls "/content/drive/My Drive" and access any of the folders in it by specifying the path.

The Colab interface is very similar to Jupyter, so now you are all set to run your machine learning experiments on Colaboratory. One disadvantage of Colaboratory is that it does not work well in presentation mode. Fortunately, for that we have Azure Notebooks.

Microsoft Azure Notebooks

Microsoft offers Azure Notebooks, a free service for anyone to develop and run code in their web-browser using Jupyter. It supports Python 2, Python 3, R and F# and their popular packages. It is a general code authoring, executing, and sharing platform. According to Microsoft documentation, one can use Notebooks in diverse scenarios: like giving an online webinar, giving a PowerPoint-like presentation with executable codes in slides, or learning a new model. The service is free. However, to stop abuse, they have put network limitations; at present there is a 4 GB memory limit per user, and a 1 GB data limit.

Azure Notebooks is a thriving place, with many existing and exciting notebooks shared by the DL community. You can access them here: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks.

Like Colaboratory, Azure Notebooks have most packages preinstalled, and if you require, you can install new packages via !pip install. You can run any Unix command line command with a prefixing exclamation mark. To be able to use Azure Notebooks, you will need to open an account. You can use your existing Microsoft account, or create a new one. It even offers an option to create a Child Account to encourage young people to learn programming, and these accounts have parental control.

Some packages may not yet be available in Azure Notebooks.

You can access data directly through the notebook interface using upload/download commands. You can even download data from a URL using !wget url_address.

Now that we've looked at the cloud services and the VMs that can help us to perform our training, let's look at how we can move into the production stage, using TensorFlow Extended.

TensorFlow Extended for production

TFX is an end-to-end platform for deploying machine learning pipelines. A part of the TensorFlow ecosystem, it provides a configuration framework and shared libraries so as to integrate the common components needed to define, launch, and monitor software based on ML models. TFX includes many of the requirements for production software deployments and best practices, viz: scalability, consistency, testability, safety and security, and so on.

It starts with ingesting your data, followed by data validation, feature engineering, training, and serving. Google has created libraries for each major phase of the pipeline, and there are frameworks for a wide range of deployment targets. TFX implements a series of ML pipeline components. All of this is made possible by creating horizontal layers for things like pipeline storage, configuration, and orchestration. These layers are very important for managing and optimizing the pipelines and the applications that you run on them.

You will need to install it first. TensorFlow Extended can be installed using the pip command:

pip install tfx

In the following section we will cover the fundamentals of TFX, its architecture, and the various libraries available within it.

TFX Pipelines

The TFX pipeline consists of a sequence of components that implement an ML pipeline, specifically, ensuring the scalability and high performance of the underlined ML task. It includes modeling, training, inference, and deployment to web or mobile targets. A TFX pipeline includes several components, with each component consisting of three main elements: Driver, Executor, and and the Publisher. The driver queries the metadata store and supplies the resultant metadata to the executor, publisher accepts the results of the executor and saves then in metadata. The executor is the one performing all the processing. As an ML software developer, you will need to write code that runs in the executor depending upon the component class you are working with:

In a TFX pipeline, a unit of data, called an artifact, is passed between components. Normally a component has one input artifact and one output artifact. Every artifact has an associated metadata that defines its type and properties. The artifact type defines the ontology of artifacts in the entire TFX system, while the artifact property specifies the ontology specific to an artifact type. Users have the option to extend the ontology globally or locally.

TFX pipeline components

The following diagram shows the flow of data between different TFX components:

Flow of data between TFX components

All the images in the TFX section have been adapted from the TensorFlow Extended official guide: https://www.tensorflow.org/tfx/guide.

To begin with we have ExampleGen, which ingests the input data, and can also split the input dataset. The data then flows to StatisticsGen, which calculates the statistics of the dataset. Then comes SchemaGen, which examines the statistics and creates a data schema; then an ExampleValidator, which looks for anomalies and missing values in the data; and Transform, which performs feature engineering in the dataset. The transformed dataset is then fed to the Trainer, which trains the model. The performance of the model is evaluated using Evaluator and ModelValidator. Finally, if all is well, the Pusher deploys the model on the serving infrastructure.

TFX libraries

TFX provides several Python packages that are used to create pipeline components. Quoting from the TensorFlow Extended User Guide (https://www.tensorflow.org/tfx/guide).

These packages are the libraries which you will use to create the components of your pipelines so that your code can focus on the unique aspects of your pipeline.

Different libraries included in TFX are:

  • TensorFlow Data Validation (TFDV) is a library for analyzing and validating machine learning data
  • TensorFlow Transform (TFT) is a library for preprocessing data with TensorFlow
  • TensorFlow is used for training models with TFX
  • TensorFlow Model Analysis (TFMA) is a library for evaluating TensorFlow models
  • TensorFlow Metadata (TFMD) provides standard representations for metadata that are useful when training machine learning models with TensorFlow
  • ML Metadata (MLMD) is a library for recording and retrieving metadata associated with ML developers and data scientists' workflows

The following diagram demonstrates the relationship between TFX libraries and pipeline components:

Figure 7: Relationships between TFX libraries and pipeline components, visualized

TFX uses the open source Apache Beam to implement data-parallel pipelines. Optionally TFX allows Apache Airflow and Kubeflow for easy configuration, operation, monitoring, and maintenance of the ML pipeline. Once the model is developed and trained, using TFX you can deploy it to one or more deployment target(s) where it will receive inference requests. TFX supports deployment to three classes of deployment targets: TensorFlow Serving (works with REST or gRPC interface), TensorFlow.js (for browser applications), and TensorFlow Lite (for native mobile and IoT applications). Trained models that have been exported as SavedModels can be deployed to any or all of these deployment targets.

TensorFlow Enterprise

TensorFlow Enterprise is the latest offering from Google that provides enterprise-grade support, cloud-scale performance, and managed services. TensorFlow Enterprise has been launched as a beta version. Its aim is to accelerate software development and ensure the reliability of launched AI applications. It is fully integrated with Google Cloud and its services, and introduces some improvements in the way TensorFlow Datasets reads data from Cloud Storage. TensorFlow Enterprise also introduces the BigQuery reader, which, as the name implies, allows the user to read data directly from BigQuery.

In ML tasks, speed is critical, and one of the major bottlenecks is the speed at which data is accessed for the training process. TensorFlow Enterprise provides optimized performance and easy access to data sources, making it extremely efficient on GCP.


In this chapter we explored different cloud service providers who could provide the computing power necessary to train, evaluate, and deploy your deep learning models. We started by first understanding the types of cloud computing services available today. The chapter explored the Amazon, Google, and Microsoft IaaS services for creating a virtual machine. The different infrastructure options available in each were discussed. Next, we moved to SaaS services, specifically Jupyter Notebook on cloud. The chapter covered the Amazon SageMaker, Google Colaboratory, and Azure Notebooks. Just training a model is not sufficient; eventually we want to deploy it in a scalable manner. Thus, we delved into TensorFlow Extended, which allows users to develop and deploy ML models in a scalable, safe, and secure manner. Lastly, we introduced TensorFlow Enterprise, the latest offering in the TensorFlow ecosystem, and briefly discussed its features.


  1. To get a complete list of virtual machine types offered by Microsoft Azure: https://azure.microsoft.com/en-in/pricing/details/virtual-machines/series/
  2. A good tutorial on Amazon SageMaker: https://www.bmc.com/blogs/amazon-sagemaker/
  3. https://colab.research.google.com/notebooks/intro.ipynb#recent=true
  4. A collection of interesting Azure Notebooks: https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks
  5. Sculley, David, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden Technical Debt in Machine Learning Systems. In Advances in neural information processing systems, pp. 2503-2511. 2015
  6. TensorFlow Extended tutorials: https://www.tensorflow.org/tfx/tutorials
  7. Baylor, Denis, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal et al. Tfx: A TensorFlow-Based Production-Scale Machine Learning Platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1387-1395. ACM, 2017.
  8. A nice comparison between Google Colab and Azure Notebooks: https://dev.to/arpitgogia/azure-notebooks-vs-google-colab-from-a-novices-perspective-3ijo
  9. TensorFlow Enterprise: https://cloud.google.com/blog/products/ai-machine-learning/introducing-tensorflow-enterprise-supported-scalable-and-seamless-tensorflow-in-the-cloud