Willem de Beijer and Daan Kolkman
This tutorial will take you through the steps for creating a data science virtual machine on Azure. This tutorial is part of our Cloud Computing for Data Science series.
1. Creating an Azure account
The easiest way to set up an Azure account is to go to http://signup.azure.com. If you have an existing Outlook, Hotmail or Github account than you can use this to login to your Azure account. A regular account comes with plenty of free services, but you will need to provide your credit card details before you can get started. You will automatically be billed once you exceed the free limits.
If you do not own a credit card and are a higher education student, you can create an Azure student account at href=https://azure.microsoft.com/en-us/free/students/ . You will need to sign up with a .edu email address for this to be accepted though. A student account will provide you with $100 in free credits that you can use for any service on the platform.
2. Setting up the VM
Once you have signed up for Azure, you will be able to access the portal which looks like this:
Search for “VM” in the top bar and click the “Virtual machines” option:
You will now get an overview of your Virtual Machines. Click “+ Add” to create a new Virtual Machine and you will be shown the configuration console:
This console will allow you to fully customize your VM. However, Azure also has the option to start with an existing image. This means that you can utilize a standard data science VM, without having to go to the painful process of installing all the required software correctly. To do so, click the blue text “Create VM from Azure Marketplace” on top. Click the “AI + Machine Learning” option on the left pane and your screen should look similar to this:
For this tutorial we will pick the “Data Science Virtual Machine for Linux (Ubuntu)” option since it’s versatile and useful in many data science applications. When you click that option you will be shown the specs of the pre-installed software.
Note: the similar Windows option might be tempting, but it comes with less data science tools pre-installed and is therefore not recommended.
Click the blue “Create” button to create an instance of this VM and you will be redirected to the configuration page, this time with the image we want selected.
The following details need to be entered on this page:
- Resource group – A resource group is like a folder for a project. This is useful when you have different projects and want to see the billing details for each of them. For now, just use the standard “ResourceGroup”.
- Virtual machine name – This is as straightforward as it sounds
- Image – This should be the Data Science image we have just selected
- Size – Here you can set what hardware you want your VM to run on. For this tutorial a small size will suffice, but you might need more for bigger projects
- Authentication – Will be described in more detail below
All others options can be left at their default.
To be able to access our VM we need to create an SSH key. This process is slightly different for Windows and Mac users.
Open up a Terminal window and run the following command:
ssh-keygen -t rsa -b 2048
The prompt will now request a file to store the key to. Press the ENTER key to use the default location.
Now you will be asked for a passphrase, which is similar to using a password anywhere else.
When you are done with these steps your prompt should look like this:
You have now created a public-private key pair for SSH access. Use the following command to get your public key:
This gives a string starting with “ssh-rsa” and then a lot of random characters. Copy this key (without the username at the end) and paste it to the SSH public key field on the Azure form.
The SSH file on your computer contains the private key to this public key. When you want to use your VM, it can check your identity by verifying that you own the private key to this public key.
The easiest way to manage SSH connections on Windows is by a tool called Putty. This tool can be downloaded at https://www.chiark.greenend.org.uk/~sgtatham/putty/. Once you’ve finished installing, open the app called “Puttygen”.
Make sure that in the “Parameters” box, an RSA key with 2048 bits is selected. Then click “Generate”. If you want some additional safety, enter a passphrase in the corresponding fields. Proceed to save both the public and private keys.
On top you should see a string starting with “ssh-rsa” and then a lot of random characters. Copy this key (without the username at the end if there is one) and paste it to the SSH public key field on the Azure form.
Now click the blue “Review + create” button on the bottom of the page to create your VM.
You will be redirected to a confirmation page. Click “Create” on the bottom and you will be taken to your VM overview. Deployment might take a few minutes.
Congratulations, your VM is now up and running! On the confirmation screen click “Go to resource”
You will now be shown an overview page where you can see all relevant information about your VM, including its usage in the past x hours.
3. Using your VM
Now that our VM is ready to go, let’s use it to run a Jupyter Notebook. To access your VM using SSH, click the “Connect” button on top of the overview page. This will show the SSH command with our username and ip address filled out.
Copy this command and run it in a Terminal window. Enter “yes” when asked if you want to connect and then type your passphrase. If your SSH connection is successful, the Terminal window should look like the one below:
Run a Jupyter Notebook
To start a Jupyter Notebook on your machine, execute this command in the Terminal:
jupyter notebook --no-browser
Then open a new (local) Terminal window and execute:
ssh -NfL 9999:localhost:8888 firstname.lastname@example.org
Note that the last part should be replaced with your own username and public ip address that we also used to SSH into the VM.
Now that our VM is ready to go, let’s use it to run a Jupyter Notebook. To access your VM using SSH, click the “Connect” button on top of the overview page. Copy the IP address shown here and open a new Putty window. Paste the IP address under “Host Name (or IP Address)”
In the left pane go to “Connection” -> “SSH” -> “Auth” and under “Private key for authentication” click “Browse” and find your private key that we created earlier.
Now in the left pane go to “Connection” -> “SSH” -> “Tunnels”. In the source port textfield enter “9999” and in the destination port enter “localhost:8888”. Now click “Add”.
Go back to “Session” in the left pane, enter a name under “Saved Sessions” and press “Save” to store this session. Next time we want to connect to the VM, we can load this session without having to enter all connection details again. Click “Open” on the bottom of the window to establish the connection.
When promted whether we trust this instance choose “Yes”.
Now enter the username you chose when setting up the VM and enter your passphrase if you chose to use one. You are now connected to your VM and the result should look somewhat like this:
Run a Jupyter Notebook
To start a Jupyter Notebook on your machine, execute this command in the Putty terminal we just created:
jupyter notebook --no-browser
Go to your browser and enter “localhost:9999” in the address bar. You will be shown the following screen:
On top in the “Password or token” field enter the token that can be seen in the Terminal window in which you started Jupyter Notebooks on your VM.
You will be shown a Jupyter environment that is very similar to a Jupyter environment when you run it locally. The “notebooks” folder contains a lot of useful example notebooks on how to do certain things in Azure.
4. Using your own datasets
There are several ways to get your datasets or other files into your VM. One easy way of doing this is using Azure Blobs. Blobs acts like any other cloud storage, and you can store files of any format on it. Since the data on Blobs is not tied to a VM, you can easily use your files on other Azure services as well. E.g. when you want to make a new VM, you don’t need to relocate the files to keep using them.
To create a Blob go to the Azure console and search for the service “Storage accounts”. Click the “+ Add button” to create a new storage account. The setup is fairly easy, and most options can be left at their default.
Once you filled out the form, click “Review + create”. When the deployment of your storage account is done you will be shown the following console:
In the left pane click “Blobs” followed by clicking “+ Container” to create a new container. Give your container a name and click “OK”. Your new container should now be visible in the console.
Click your container and click the “Upload” button on top to upload files in a similar way as you would in other cloud services. For this tutorial, I uploaded the MNIST dataset in CSV format.
The advantage of saving data to a Blob is that it is not attached to any specific VM or other machine. This means that if you want to launch a different VM or use the data elsewhere, it will be easy to transfer. Note that this is also an easy way to store large datasets that you don’t want to keep on your laptop.
Now execute the following code in a Jupyter Notebook.
import pandas as pd from azure.storage.blob import BlockBlobService blob_service = BlockBlobService(account_name="willemdb",account_key="...") blob_service.get_blob_to_path("test-container","mnist_784.csv","mnist_784.csv") df = pd.read_csv("mnist_784.csv")
First, we import the BlockBlobService module which comes pre-installed with an Azure data science VM. Then we connect to to our Blob using an account name and a key. These can be found in the Azure Storage accounts console if we navigate to “Access keys” on the left pane. Note that the key is like a password to access your data and should be kept private.
Then we copy a file from Blob to the VM using the get_blob_to_path method which takes 3 arguments:
- Container name – The name of our container
- Blob name – This is the name of the file we want to copy
- Local file name – This is the name the file will get on the VM. In this case we want to keep it the same as on the Blob, so the second and third argument of the function are the same.
The file is now copied to the VM and can be used in the same way as we would on our local machine!