Data science on Kaggle

[ffb_paragraph_0 unique_id=”pk3copb” data=”%7B%22o%22%3A%7B%22gen%22%3A%7B%22ffsys-disabled%22%3A%220%22%2C%22ffsys-info%22%3A%22%7B%7D%22%2C%22text-is-richtext%22%3A%220%22%2C%22align%22%3A%22text-center%22%7D%7D%7D”][ffb_param route=”o gen text”]

Willem de Beijer and Daan Kolkman

This tutorial will take you through the steps of using a Kaggle kernel for data science. This tutorial is part of our Cloud Computing for Data Science series.

1. About Kaggle

Kaggle is a famous data science platform, where individuals and teams can compete on different data science challenges. While this is the main goal of Kaggle, they also offer a free service called the Kaggle Kernel. This does not provide you with a virtual machine as the other tutorials in this series do, but it does provide you with free computing resources for your private projects. While more limited than a VM, this option is easy and quick to set up.

2. Setup

Go over to https://www.kaggle.comand create a free account if you don’t already have one. In the top menu click on “Kernels” and then “New Kernel”. You can run Python scripts as well as Jupyter Notebooks on a kernel, for now, click “Notebook”. 

While the look might be slightly different than the Jupyter Notebooks you are used to, it functions exactly the same. You can upload a dataset clicking “+ Add Data” on top and choose to keep this data private, so other Kagglers won’t be able to access it. 

3. What you get

Packages

The most used packages in data science are pre-installed on the kernel. If you find yourself needing more, go to “Settings” in the right pane and click “Install…” after Packages. 

Computing resources

Kaggle gives you 2 options for kernel resources:

CPU-type kernel (Default)GPU-type kernel
4 CPU cores2 CPU cores
16 GB RAM13 GB RAM
No GPUTesla P100 GPU
9 hours of execution time
5 GB auto-saved disk space (/kaggle/working)
16 GB temporary/scratchpad disk space
9 hours of execution time
5 GB auto-saved disk space (/kaggle/working)
16 GB temporary/scratchpad disk space

To switch from a CPU to a GPU kernel, toggle the “GPU” switch on the “Settings” tab of the right pane.

Parallelizing

A single user can have up to 4 GPU kernels running in parallel without losing speed. If more kernels are running in parallel, your performance will the throttled.

Execution time

Kaggle kernels come with 9 hours of execution time. This means that from the moment you start it, you have 9 hours before it times out. At this point temporary data will be lost (such as variables), but nothing prevents you from saving your results and restarting the kernel to get another 9 hours of execution time. Please note that your kernel will stop executing if you close the browser it’s connected to.

Collaborate

You can collaborate whilst keeping your data and notebooks private from others on Kaggle. On the Kernels page, click the kernel you want to share. In the top right, click the “Access” button to provide access to specific users.

4.  When to use

Kaggle kernels can be a useful addition in the data scientist’s toolbox and should be seen as an extension of your notebook. It offers less flexibility and scalability than a virtual machine but is easier to set up. This section describes a few use cases where Kaggle is especially useful.

Deep learning on a MacBook or non-Nvidia laptop

Most deep learning libraries can use GPU acceleration for (much) faster training. However, this acceleration normally relies on CUDA, a framework that is only available on Nvidia-GPU’s. MacBooks come with AMD GPU’s and therefore can’t use GPU acceleration in most deep learning libraries. Many ultrabooks come with Intel Integrated Graphics and these also cannot use GPU-acceleration. Even if you’ve got a Nvidia graphics card, the Nvidia Tesla P100 offered by Kaggle is likely to perform a lot better than your laptop.

I trained a simple deep neural net using Tensor Flow with a Keras wrapper on the MNIST dataset and got a ~6.9 on the Kaggle GPU kernel vs my 2017 13” MacBook Pro. That’s a whopping 85% decrease in training time for the exact same neural net!

Notebook with <16 GB RAM

If your notebook offers less than 16 GB RAM and you are working with a large dataset, you will encounter problems. A dataset that does not fit in memory will considerably slow down your process, and therefore a Kaggle kernel with 16 GB memory will work better in this case.

When you want to try different – large models

As mentioned before you can have 4 Kaggle kernels running in parallel without compromising on performance. If you train a different model on each of them, you effectively get the speed of using 4 different laptops at the same time. 

5. How to use

You can use a Kaggle kernel in much the same way as you would use a Jupyter Notebook on your local device. In the top left under “File” you can choose “Upload notebook” to easily upload a notebook from your local machine to Kaggle. When you’re done for the day, you can use “Download notebook” option to get your notebook back to your own laptop. 

Committing 

The “Commit” option in the top right is similar to committing in Git. It will store your work and allow for versioning with one slight difference. Once you commit, Kaggle will run your full notebook from front to end and you can view the result when accessing this version later. Normally, a kernel will stop if you close the browser, but a commit run will keep going! Therefore, if you want to train several large models you can just commit them and call it a day. Note that these committed notebooks will run for at most 6 hours before they timeout. 

[/ffb_param][/ffb_paragraph_0]
Scroll naar boven
Scroll naar top

Bedankt voor je aanmelding!

Super leuk dat je bij ons lustrum aanwezig bent!
De inloop is vanaf 15:00 en om 15:30 zullen we beginnen met het programma.

Wanneer?

9 juni 15:30

Wij hebben er zin in en kijken er naar uit jullie weer te zien!