Retrieving Data with APIs: An Intro
This tutorial teaches you how to retrieve data from API’s with Python. We will cover real retrieving real time railway data with the NS API, weather data from Weerlive.nl and Twitter data.
In this tutorial we give an overview of how to use APIs (Application Programming Interfaces) to retrieve data. An API is a set of protocols and routines for building and interacting with software applications. APIs are hosted in web servers and provides a very effective and quickly way to retrieve data that changes frequently. Via API one is able to retrieve real time data as well as historical data. An example of interesting business application involves combining real time data together with historical data to predict demand of products.
Imagine you have a bakery close by a train station in the Netherlands. The historical data from the Dutch Railway, NS (Nederlandse Spoorwegen) could be combined with weather historical data obtained from KNMI weather API to build a forecasting model. Then the actual data would allow forecasting the demand of clients on a particular day.
We start this tutorial showing a simple API: The Open Movie database (OMDb) API. Then we take a look on how to get information from both NS and KNMI Weer APIs. To close it we check out how to pull data from Twitter.
Some background in JSON files
The image below is the output of a OMDb API request.
Notice that JSON consists of key-value pairs like a Python dictionary. That’s why when loading JSONs into Python dictionary is a natural choice. The keys in JSONs are always strings enclosed in quotation marks. The values can be strings, integers, arrays or even objects. An object can even be a JSON and then you have nested JSONs. We can see in the JSON above that all keys are strings between quotation marks. Most of the values are strings but notice that `Ratings` is a list of dictionaries.
The JSON library has two main methods:
- dumps : Takes in a Python object, and converts it to a string
- loads : Takes a JSON string, and converts it to a Python object
We use the first to save an object and the second to load it. To exemplify I’ll use some information about a series this time: “The Queen’s Gambit“.
Let’s recover the content of the JSON file we just saved.
Using APIS to Retrieve Information from the Web
As commented previously, an API is a set of protocols and routines for building and interacting with software applications which allows two software programs to communicate with each other.
For instances, if one wants to stream actual weather information by writing some Python code, he/she would use a weather API such as KNMI streaming Weer API. On the other hand, if someone wants to automate pulling and processing data from the Dutch Railway NS, he/she could use the NS API.
Using APIs has become normal practice nowadays. Marketing companies and social scientists use APIs from Twitter, Facebook, Instagram, for example. Many other companies and organizations have APIs. Rapid API is a good way to get informed about APIs that are available.
Now that we know a bit about JSON including how to save and load JSON files it is time to use APIs and Python to automate data retrieval .
Let’s start with the OMDb API.
In order to get information over The Queen’s Gambit series which we saw partially in the previous section I used the following URL to make a request:
http://www.omdbapi.com/?apikey="+keys.omdb_key+"&t=The Queen's Gambit
First, notice that in place of my key I used keys.omdb_key. To protect my API keys, I listed them in a python script keys.py and added it to my .gitinore file. It is a good practice to do this in order to protect your keys, especially when sharing your work on GitHub.
For OMDb API the request URL is http://www.omdbapi.com/?apikey=[yourkey]&. Usually, ? indicates the query part, i.e., where we specify parameters. But before setting your search parameters you need to add your apikey. Then here we can say that the part
after & is referred to as the query string.
Notice that the URL can be constructed using simple string manipulation. However, for more complex requests using the structure presented above can make the task easier and less prone to error.
As a result of the code above we can see that we had a successful response (code 200) which is delivered in the form of a dictionary.
Now is time to explore a bit the NS API which is the API of the Nederlandse Spoorwegen, i.e., the Dutch Railway. With this API we can extract actual information while historical data can be retrieved from sites like rijdendetreinen.nl and NDOV Loket.
In addition, GoTrain is a server application for receiving, processing and distributing real-time data about train services in the Netherlands. It is designed to continuously receive data streams offered as open data by the Dutch Railways (NS).
A list of stations was supposed to be found in the website of NS https://www.ns.nl/en/travel-information/ns-api/documentation-station-list.html. However, the link were not active when this tutorial was made.
Wikipedia provides a list of the code of the stations and at NDOV loket a list containing the UCI codes is available for download. An UCI code is an identifier for a railway station in Europe, CIS countries, China, Mongolia, North Africa and the Middle East. You can find a list at the repository associated with this article here.
In order to use the NS API, you need to register so you can get an API key. Go to the starter guide to register and access some other information about the NS API.
There are different APIs available and before using a certain API you need to subscribe to the specific API you want to use. The image below shows the APIs I’ve subscribed.
In this example we use NS-App subscription to get information about stations. At Reisinformatie API page you can see all available operations and example of code to be used in different languages.
Notice that for NS API we have:
- Request URL: Depends on the operation, for Get Stations: https://gateway.apiportal.ns.nl/reisinformatie-api/api/v2/stations
- Request headers: The NS API key, i.e., Ocp-Apim-Subscription-Key
- Request parameters: These are our query strings and they vary in accord with the API you are using.
Then, notice that in the code used we add headers to our request.
For GET Stations we don’t have any specific parameter. However, we can perform some filtering after to get information only on Dutch train stations, for example.
Below, a dataframe with partial information over Dutch train stations after filtering the information obtained from the API.
There are many weather APIs available. In the Netherlands we can use, for instance :
- For private and study use: https://weerlive.nl/delen.php
- For commercial use: https://meteoserver.nl/
Using weerlive.nl API one can get current weather data from the KNMI (Koninklijk Nederlands Meteorologisch Instituut), i.e., Royal Netherlands Meteorological Institute which is the Dutch national weather forecasting service for free (maximal 300 data requests per day).
Meteoserver has different APIs that can be used for free until the limit of 500 requests/ month. Historical data can be obtained by paying 60 euro/month.
Like for the NS API, for both weather APIs you need to subscribe in order to obtain an API key.
Current Weather using Weerlive.nl
As answer to your request, you get 48 variables related to the weather in the chosen location. Following an image showing the first variables which include temperature, some information about the wind, visibility, when the sun sets and rises.
Now is time to explore a bit the Twitter API. You will notice that it differs a bit in relation to the ones we have explored so far. For example, for the previous ones we have API keys. For the Twitter API we have keys and access tokens.
We will also need to use some package to help with the authentication process, i.e., an authentication handler such as Tweepy or python-twitter. The JSON file obtained is also a bit more nested and complex. There are many different fields including information such as tweet text, user, language, time of tweet, location etc.
To gain access to the Twitter API, one needs to create a Twitter account, in case you don’t already have one. Then log into the Twitter Apps and Apply. After that you just need to agree to some terms and conditions to have available your keys and access tokens. These are the authentication credentials that will allow you to access the Twitter API.
There are different Twitter APIs, like for instances, the REST (Representational State Transfer) API which read and write Twitter data, and the Streaming API. The public streaming API streams the public data flowing through Twitter. In this tutorial, we use the REST API to collect tweets from users as well as tweets obtained by using queries.
Summarizing, in this section we:
- Collect tweets from the user timeline (GetUserTimeline)
- Collect tweets using queries (GetSearch)
- Select which information from the data retrieved will be kept
- Save collected data in .csv file
From the previous mentioned Python packages, I’m using python-twitter, a python wrapper around the Twitter API.
The Twitter API and authentication
Again, it is appropriate to keep your credentials in a script listed in your .gitignore file as we mentioned before when talking about other APIs and API keys. Mine are kept in private_twitter_credentials.py, that’s why I import it to be able to call my credentials in the code below.
Access user timeline Tweets
First, we apply the method GetUserTimeline method on object api we’ve just created to access user timeline Tweets. For example, let’s access MKB datalab’s timeline Tweets.
To access user timeline Tweets we need the account’s Twitter handle of user, in this MKB datalab: @jadsmkbdatalab and we use the argument screen_name as the handle without @.
Access Tweets using queries
Second, we obtain Tweets based on some query, i.e., apply GetSearch method on api.
Because of the lock down many sectors of the economy is suffering. During the last weeks there were many reactions, for example, connected with the catering sector since restaurants, bars and similar were closed again. So, let’s say that one wants to search
Tweets that mention horeca and COVID-19 in order to access the impact of the last news related to this. In this case you use the method GetSearch using your query as argument.
The easiest way to have the query right is going to Twitter’s Advanced Search and typing what you want to know. Then using as your raw_query the part of search URL after the “?” , removing the `&src=type` portion.
Let’s try it out.
The URL I get is:
Therefore, I use raw_query = q=covid-19%2C%20horeca.
The results from GetSearch are limited to 7 days. In the last 7 days we got 15 Tweets that mentioned horeca and covid-19.
In addition, when retrieving Tweets from user timeline Twitter API limits us to 200 Tweets at a time, and from search to 100 tweets. This is the parameter count from both methods, GetUserTimeline and GetSearch.
Another constraint that we need to deal with is that the Twitter API is rate limited, meaning Twitter puts restrictions on how much data you can take at a time. More details about it
Because I want to retrieve much more than 200 Tweets (if possible) I’ll write a class TweetMiner which contains two methods.
- mine_user_tweets which mine user’s tweets making use of GetUserTimeline.
- search_tweets which mine tweets using GetSearch.
Notice that in this class I’m also selecting which information from the Tweets I want to keep in a form of dictionary. I’m almost collecting everything. Probably for the purpose I have in mind now I’ll not be using all this but it was my choice to keep what I kept. Feel free to adapt it.
Next, you find a function where I use the list of dictionaries obtained from apply my TweetMiner, organize it a bit and save the result in a .csv file
Now that we have a class to retrieve Tweets and a function to save the result in .csv let’s collect data.
Our goal is to collect as much data as possible so we can have more confidence in our analysis. However, when using Twitter API, we need to consider some limitations.
First, it is difficult to have control on how far in the past we can go when retrieving user timeline data (using TweetMiner. mine_user_tweets(). So, we will play mainly with parameters result_limit, i.e., count and max_pages.
Second, when performing search with API it is only possible to access Tweets of the last 7 days, as mentioned previously (more details). So unfortunately, when performing queries, we will be limited and we will not be able to go back months ago.
In the GitHub repository there are many examples using the Class and function above. For example, in one of the examples we retrieved Tweets about vaccinatiepaspoort (the EU vaccine passport) and COVID-19 which are hot topics.
APIs are important tools when building Apps, getting information to build forecasting models, and to know better your customers. Only to enumerate some of the interesting ways of using APIs.
Using API became so common that more and more organizations have APIs.
In this tutorial we explored superficially some APIs so you can have a feeling on how to use them. Go on and explore even more these APIs or try some other. Check out Rapid APIs for more APIs.
To go deeper on Twitter APIs you try, for example, the nice tutorials provided by Twitter here. This article shows some ways Data Scientists use Twitter.
Thank you for reading!
Comments and questions are always welcome, feel free to get in touch. All code presented here can be found at: