Basics on Web Scraping

As William Edwards Deming said “In God we trust; all others bring data.”, so bring us data.

When data is not available through datasets or APIs, web scraping may be our last resource. It allows retrieving and parsing data stored on web pages across the Internet. It not only allows us to retrieve data when we don’t have it, but also give us the opportunity to acquire additional data that might give that extra boost to our model. Therefore, obtaining data through web scraping is a valuable skill for any data scientist.

In God we trust; all others bring data.

William Edwards Deming

From a business point of view, web scraping helps us make informed business decisions. It provides an opportunity to:

  • Know our competitors, their prices, services,
  • Know our customers better; their behavior, their needs, what they think of our product(s)/service(s),
  • Stay well informed about partners,
  • Gather public opinion about a company in general, as well as of its or similar product(s)/service(s),
  • Obtain contact or other information of potential clients via social media and forums, so meaningful resources can be directed towards this group of possible customers.

and the list goes on…

Also, for public/governmental organizations web scraping can be very helpful. It might help gathering information from websites of different cities within a region about an important subject such as health, security, or environment. This data, that sometimes is not easily collected across city agencies, might be published by them online. Therefore, this gives an opportunity to collect and analyze the data in order to extract beneficial insights to society.

In addition, data obtained via web scraping can be used for personal purposes and for fun! For instances, it can help you find your new home, a new recipe, material for your hobby, or information about your favorite subject, artist, movie, music…. again, imagination is the limit.

Then after scraping your data, it is time to analyze and manipulate it using tools such as pandas and NumPy.

Here, to illustrate the use of web scraping we’ve chosen a subject that probably will please everybody (or most of you): Movies and Music! On top of it, we have an opportunity to pay our respect to the first James Bond, Sir Thomas Sean Connery that left us October 31, 2020.

Our goal is to collect information about movies of the James Bond franchise and their theme songs. For this, the following steps are taken:

📽️ Extract information about all the movies from James Bond from a table at List_of_James_Bond_films.

🎶 Extract information about all the James Bond’s theme songs from a table at Lijst_van_titelsongs_uit_de_James_Bondfilms (“Yes! Dutch site because the structure of the table was much easier. It does not need to be difficult to be good, right? 😉 ”).

🎶 Scrape lyrics of the theme songs.

To accomplish this we need a basic knowledge of HTML .This means its tree structure and that tags define the branches where the information we search are. Furthermore, we make use of two Python libraries:

  • requests which allows us to get the webpage we want; and
  • Beautiful Soup that parses the content of the webpage and allows us to extract tags from an HTML document.

So, let’s start!

📽️Web Scraping Information about James Bond’s Movies

Step 1: Inspecting website

An important first step when web scraping is inspection. Every time we scrape a website, we need to have an idea of its structure and where to find what we need.

For this, no matter which browser we use, we can access its code by right clicking and choosing to access its source code, i.e., view page (Firefox) or view page source (Chrome and Microsoft Edge). If you need details of a specific element right click on it and choose inspect element(Firefox) or inspect (Chrome and Microsoft Edge), instead.

Web pages use HyperText Markup Language (HTML) which is a markup language with its own syntax and rules. When a web browser like Chrome or Firefox downloads a web page, it reads the HTML to determine how to render and display it to you.

HTML consists of tags. Anything in between the opening and closing of a tag is the content of that tag.

Some of elements often encountered in a web page are:

<head> : Contains metadata useful to the web browser which is rendering the page but which is invisible to the user.

<body> : Contains the content of an HTML document with which the user interacts. Every page has only one body.

<div>: Section of the body.

<p>: Delimits paragraphs.

<a> : Creates a hyperlink to web pages, files, email addresses, locations in the same page, or anything else a URL can address.

For more definitions of elements check any of these links: dev_mozilla or w3s.

While inspecting the website source code you will notice that some tags contain attributes which provide special instructions for the contents contained within that tag. Specific html attribute names are followed by an equal sign, followed by information which is passed to that attribute within that tag.

For example:

<div id="contentSub"><div>

You can see below code of part of the web page we will explore first.

sample of source code of https://en.wikipedia.org/wiki/List_of_James_Bond_films

Step 2: Access Content of Website

After inspecting the web page we need to:

1. Access website using requests.

2. Parse content with Beautiful Soup so we can extract what we need within tags.

In other words, we apply the steps shown in the following function:

The first information we extract is about James Bond’s movies. To parse the web page containing this info we use the above function:

main_url = "https://en.wikipedia.org/wiki/List_of_James_Bond_films"
parser = parse_website(main_url)

The parse is a Beautiful Soup object, which represents the document as a nested data structure, and for this page it looks like this:

partial image of the parsed web page.

We know the structure of the web page source code and we have it parsed. Hence, we are ready to extract the information we need.

Extracting Information from Website

This part will depend on the structure of the website source code and of what you need as information from it.

Before going to our target (table with information about James Bond’s films) let’s see how we can access some text of the website.

As we saw, the function parse_website returns a parser that will allow us to access the content. For example, we can get the title of the website using .title and then .text to get the string of it, i.e.,

# access title of the web page
title = parser.title
# obtain text between tags
title = title.text
title

Output: 'List of James Bond films - Wikipedia'

In the body of the HTML document you can find paragraphs that are identified with tag p. To find only the first paragraph use find, to find all paragraphs use find_all. This method returns a list and as one we can access an item using an index.

So, if we want to extract the text of all paragraphs in this webpage, we use the following code:

# find all paragraphs within the body of html
list_paragraphs = parser.body.find_all('p')
# extract the string within it
list_paragraphs = [p.text for p in list_paragraphs]
text_films = ' '.join(list_paragraphs).strip()
# First 2000 characters
print(text_films[:2000])

Output:
James Bond is a fictional character created by the novelist Ian Fleming in 1953. Bond is a British secret agent working for MI6 who also answers to his codename, ”007“.  He has been portrayed on film by the actors Sean Connery, David Niven, George Lazenby, Roger Moore, Timothy Dalton, Pierce Brosnan and Daniel Craig, in twenty-seven productions. All the films but two were made by Eon Productions. Eon now holds the full adaptation rights to all of Fleming's Bond novels.[1][2]
 In 1961 the producers Albert R. Broccoli and Harry Saltzman joined forces to purchase the filming rights to Fleming's novels.[3] They founded the production company Eon Productions and, with financial backing by United Artists, began working on Dr. No, which was directed by Terence Young and featured Connery as Bond.[4] Following Dr. No's release in 1962, Broccoli and Saltzman created the holding company Danjaq to ensure future productions in the James Bond film series.[5] The series currently encompasses twenty-four films, with the most recent, Spectre, released in October 2015. With a combined gross of nearly $7 billion to date, the films produced by Eon constitute the fourth-highest-grossing film series, behind the Marvel Cinematic Universe, Star Wars and Wizarding World films.[6] Accounting for the effects of inflation, the Bond films have amassed over $14 billion at current prices.[a] The films have won five Academy Awards: for Sound Effects (now Sound Editing) in Goldfinger (at the 37th Awards), to John Stears for Special Visual Effects in Thunderball (at the 38th Awards), to Per Hallberg and Karen Baker Landers for Sound Editing, to Adele and Paul Epworth for Original Song in Skyfall (at the 85th Awards) and to Sam Smith and Jimmy Napes for Original Song in Spectre (at the 88th Awards). Additionally, several of the songs produced for the films have been nominated for Academy Awards for Original Song, including Paul McCartney's "Live and Let Die", Carly Simon's "Nobody Does It Better" and

Extracting info from Table

We just got the title of the webpage and some text, but what we really want is the information about all movies of the James Bond franchise. Those are in the 1st table of the website.

The table information can be found under tag tbody . If you use find_all you will notice that we have 6 tables on this website. However, as said, what we need is in the first one.

The goal is to build a data frame so from the table I’ll extract information for the header (name of the columns/features) and the data (values for each feature).

When accessing tag tbody we see the following:

Deal of the source code containing information to be used for data frame header.

Notice that the information we need for the header is under tag th with attribute scope='col' . However, we still need to do a bit more to be able to have the complete name for Box Office and Budget because of the structure of the table (See code below).

Now that we have the name of features to be used to build our data frame, let’s find the values for each feature.

Deal of source code showing information that is used as data for our data frame.

If we continue checking the content within tbody we will notice that titles of the films are found under tag th with attribute row while the rest of the information is found under td with the same attribute. Thus, the code used to obtain data about the movies is:

We have how to get the header and the data of our movies’ data frame. The code below puts it all together, creates a dictionary and uses it to generate df_films.

df_films = create_films_dataframe(parser)
df_films.head()
First five rows of df_films.

🎶 Web Scraping Information about James Bond’s Theme Songs

For this task I’ve chosen the Dutch Wikipedia website because the structure of the table is simpler than the English Wikipedia website. This will make it a bit easier to extract the information we want. In addition, the information there is mostly in English.

Let’s start by using again our function parse_website and inspect the result so we know how to get what we are looking for.

Again, the information we are looking for is in the first table.

Obtaining the header for our new data frame (df_songs) is much easier this time, only 2 lines of code:

# Name of columns 
list_columns = parser.tbody.find_all('th')
list_columns = [item.text.strip() for item in list_columns]
list_columns

output:
['Titelsong', 'Artiest', 'Film', 'Jaar', 'Componist']

However, this information is in Dutch and needs to be translated. Indeed, we could simply type the list. However, it is a good example to show how retrieving information depends on the structure of the website, and how the inspection part is important.

# list translated
list_columns = ['Theme Song', 'Performer', 'Film Title', 'Year', 'Composer']

We now have the names of our 5 columns. Following, we will build the content of our table. For this table we observe that all information is under tag td (See following image).

Partial body of the source code of the table with James Bond’s theme songs information.

<td> is a html element that defines a cell of a table that contains data. As we notice above every 5 rows (cells of the table) contains respectively, Theme Song, Performer, Film Title, Year, Composer. Let’s use this to build our data frame with all theme songs of the James Bond film series.

main_url = "https://nl.wikipedia.org/wiki/Lijst_van_titelsongs_uit_de_James_Bondfilms"
parser = parse_website(main_url)
list_columns = ['Theme Song', 'Performer', 'Film Title', 'Year', 'Composer']
df_songs = create_songs_dataframe(list_columns, parser)
df_songs
First version of df_songs.

Pretty good, right? In that concern our web scraping job is done, but as data scientists we need to do our best to have clean data and the most complete and right information. No trash in, trash out for this tutorial! So, there are just some little things we need to fix.

To start with, the first movie of the James Bond franchise, Dr. No, which has two themes. However, we only have information about the performer of the 1st theme. In addition, formally Monty Norman is the composer of both James Bond theme and Kingston Calypso.

Next, in some items we find o.l.v that means in Dutch onder leiding van which we can translate to led by.

At last, the Year of the most recent film is 2021 as in the films table. The film was supposed to be released in 2020 but due to COVID it will be released in 2021. The code below fix all these issues.

# Fixing all the problems observed 
df_songs = update_info_songs(df_songs)

Now it is time to merge everything. After verifying that indeed columns Film Title are the same for both df_films and df_songs , we combine them.

df_films_songs = df_films.merge(df_songs, on = ['Film Title', 'Year'])
First five rows of df_films_songs

Now that you have all the information combined, you are able to answer some questions. Details about this part can checked in the notebook on GitHub. Here we will focus on the web scraping part.

Which actor performed James Bond more times?

As we can see, Roger Moore performed the 007 agent the most times, followed by Sean Connery. If Daniel Craig goes on for 2 more movies he will beat Roger Moore.

What was the Box Office and Budget of the James Bond franchise movies?

Actual Box Office and Budget of Bond’s movies.

It seems that is pretty profitable, right? This video shows how the film industry makes money and how taking Box Office as proxy for profit can be misleading. Some interesting information about it in the following video.

Is there any performer that performed songs more than once?

Shirley Bassey was the only one so far, performing 3 theme songs.

Theme songs performed by Shirley Bassey. The only singer that performed more than one James Bond’s theme song.

After retrieving some curiosities about James Bond’s franchise let’s go back to web scraping. Now it is time to retrieve data from hyperlinks within a webpage.

🎶 Web Scraping Lyrics: How to Access Information within Hyperlinks

To show how to scrape web pages within a webpage let’s obtain lyrics of James Bond’s theme songs.

Here we will build a data frame with song titles, performers, and lyrics.

The website where I have found most of the lyrics was https://www.stlyrics.com/b/bestofbondjamesbond.htm.

At the beginning we saw that hyperlinks are associated with tags <a>. Inspecting the web page, you will notice that the address of the hyperlink is pointed by the attribute href. So the code below does the job of retrieving all hyperlinks within main_url by finding all tags a and retrieving the contents of href.

The code above retrieves all hyperlinks within the webpage (109 when I ran it). A good observation shows how we can filter the result to keep only the links related to the lyrics.

# filtering list and keeping items that has 'bestofbondjamesbond'
list_links = [link for link in list_links if 'bestofbondjamesbond' in link]

This filtering gave us 23 hyperlinks to visit and extract lyrics.

Usually, the structure of each of these links is the same. Again, a good inspection of the source code shows us that the lyrics text is located within tag <div class=”highlight”. Then, the code below can be used to retrieve all lyrics of James Bond’s theme songs within www.stlyrics.com. 

And build a data frame with those lyrics in a way that we can merge with df_songs.

To be able to merge this new data frame (df_links) with df_songs I need a common column. This column that I’ll call links consists of the titles of the songs in low letter and without spaces. I’ll create the same column in df_songs so we can merge it with df_links and create df_lyrics that will contain at the end all information about the theme songs of the James Bond franchise, including lyrics.

# create a list with complete address 
complete_urls = ["https://www.stlyrics.com"+link for link in list_links]

# create dataframe df_links
df_links = create_dataframe_links_lyrics(complete_urls)
df_links.head()
First five rows df_links.

To be able to merge df_songs with df_films we’ve kept the two theme songs in the same row. Now we need to split them before completing the songs data with lyrics. The code below copies df_songs in what will be our final data frame (df_lyrics), make all necessary adjustments, and merge the result with df_links

df_lyrics=create_df_lyrics(df_songs,df_links)

Unfortunately, the website has most of the lyrics but not all. The songs there go until 2008 and the lyrics of Kingston Calypso (a.k.a ‘Three Blind Mice’) from the first movie are also missing.

So, if you check this first version of df_lyrics you will notice four NaN and one empty cell. The empty cell for the James Bond Theme is expected since this is an instrumental song.

The other four are lyrics of the Kingston Calypso and from the movies after 2008.

We also notice that the last row has NaN for Theme Song, Performer, Film Title, Year, and Composer. But we know the title of the song by the link: We have all the time in the world. A search on Google shows that this James Bond Theme performed by Louis Amstrong was the second theme of On Her Majesty’s Secret Service (1969) and was composed by Hal David and John Barry. In addition, it says that the other theme On Her Majesty’s Secret Service is instrumental. Therefore, there is a mistake on the website. In fact, when checking the lyrics they are from 1985! From a band called Orchestral manoeuvres in the dark and the song is called Secret.

Therefore, to make things right we :

  1. Remove lyrics from theme song On Her Majesty’s Secret Service
  2. Add the missing information of the second (non-instrumental) theme of On Her Majesty’s Secret Service` , i.e., We have all the time in the world
  3.  Add lyrics to:
    • Kingston Calypso a.k.a Three Blind Mice (1962)
    • Skyfall (2012)
    • Writing’s On The Wall (2015)
    • No Time to Die (2021)

Taking care of points 1 and 2:

# remove lyrics of "On Her Majesty's Secret Service"
df_lyrics['lyrics'][df_lyrics['Theme Song']=="On Her Majesty's Secret Service"]=''

Now for point 3 we need to web scrape a bit more. Three out the 4 lyrics can be found on the same website (https://www.songteksten.nl/). Let’s start by Kingston Calypso a.k.a Three Blind Mice which is found at https://www.flashlyrics.com/lyrics/monty-norman/kingston-calypso-75.

After inspecting the html source code of Kingston Calypso we come with the following code:

Now is time for the last 3 missing lyrics, all retrieved from https://songteksten.net/.

When inspecting any of the 3 hyperlinks above you will notice that text of the lyrics is between line breaks, i.e., <\br> tags. This is something that we didn’t come across in the previous examples. This link points out a nice solution using a childGenerator from BeautifulSoup.

We combined this solution with some filtering in a list comprehension and voilá!

And that’s how we complete our df_lyrics.

df_lyrics

Beautiful! We have our complete data frame with all information about song themes of the James Bond franchise.

Here’s the whole playlist at Spotify 🎧.

Conclusions

  • Through some different examples we showed how to scrape web pages in order to extract the data we need.
  • With some knowledge of HTML and with the help of Python packages requests and Beautiful Soup, we are able to retrieve information from the Internet using web scraping.
  • Inspecting the source code of the webpage is a very important part of the web scraping process. Each web site has its own structure and a good observation shows us which steps are needed to retrieve the information we need.
  • Some list comprehensions and python string methods are also handy in the process of retrieving information by web scraping.
  • As a web scraping data scientist also remember to ask yourself if the data you have retrieved makes sense. We can obtain a lot of good things from the Internet but also misleading or incomplete information just like we faced here.
  • Have in mind responsible AI which includes factors such as copyright, privacy, and confidentiality. In particular, when considering the use of web scraping remember that it is not because the data is available on the Internet that it can be used by you. Sometimes there are no restrictions, sometimes the restrictions depend on how the data will be used (e.g. may be used for educational purposes but not for business purposes) or perhaps it may not be used at all. Therefore, check which one is the case before web scraping. In fact, this is still a grey area and much discussion is going on when considering Ethics and AI. Examples of some cases involving web scraping are discussed in this article .

Now it is your turn. How about getting your hands dirty performing some web scraping?

Are there some interesting subjects you are passionate about and want to know more of? Or maybe you’d like to answer some business questions?

If you don’t know yet which web sites might contain the data you need, start by web searching. Then choose some of them and apply what was introduced here.

No better way to learn than by doing! Good Luck!

Thank you for reading!

Comments and questions are always welcome, feel free to get in touch. All code presented here can be found at:

https://github.com/MKB-Datalab/basics_web_scraping

Scroll naar top

Bedankt voor je aanmelding!

Super leuk dat je bij ons lustrum aanwezig bent!
De inloop is vanaf 15:00 en om 15:30 zullen we beginnen met het programma.

Wanneer?

9 juni 15:30

Wij hebben er zin in en kijken er naar uit jullie weer te zien!