Scraping the Web with Julia πŸ„
May 29, 2020 · 2572 words 🐦 πŸ“§

For many problems/tasks, even if you already have access to rich a data set, additional data sources can augment and add context to your analysis. And given that it is 20201 the wonderful WWW is likely where you will find the most interesting data. As a data scientist, feeling confident programatically searching and downloading new sources of data can go a long way and hopefully will start to feel like a minor superpower…..

DISCLAIMER: Always be responsible, ethical, and polite when web scraping. If it feels like you are doing something questionable, you should probably stop.

Downloading vs. Parsing

I like to conceptualize a web scraping task in two distinct phases: downloading and parsing. In the downloading phase we are really just concerned with having a semi-automated programatic way to acquire raw data (whether that be HTML, JSON, plain text, etc.). In the downloading phase we should treat data as a binary blob – just some bytes we need to get from a server somewhere to our computer.

Once we have this raw data however, it is in the parsing phase that we add meaning to it by imposing some structure (and human semantics – i.e. column 3 corresponds to the number of total COVID tests). Now even if there is some inevitable abstraction leakage that happens (and might be necessary) between these two phases, you should still think of the web scraping process as these two distinct tasks.

Abstracting each phase can make things much more easy to debug/troubleshoot but also make the code more extensible if you want to download additional sources (but parse every source identically).

HTTP Requests

We already saw before how to use the download funtion in Julia Base to download a single file from a url to our local machine. And thankfully for us, the url we needed was predictable and the file well formatted (CSVs and JSON). Since we were working with a well structed API (the COVID Tracking Project), the site was designed to facilitate data dissemination. Other times however, you might want data that the host/owner either:

And in these more difficult situations, knowing how to scrape a data source on the Internet can be invaluable.

While download() can actually get us pretty far (it just calls out to the OS’s curl, wget, or fetch), the abstraction Julia provides on these utilities hides all of their options and only gives you the ability to specify a url. The Julia package ecosystem never disappoints though πŸ™Œ

The HTTP.jl package is a fairly well worn library that lets us use Julia to make (and recieve) HTTP requests. Without getting into the nitty gritty of internet protocols, all you really need to know for now is that HTTP requests are what are sent from your web browser to a remote server when you want to view a web site2. In the parlance of our downloading vs. parsing section, the web browser does double duty:

  1. It first downloads the raw HTML text data from the server (this is the HTTP request)
  2. Once it receives the HTML text, the web browser software (i.e. Firefox, Chrome, Safari, etc.) parses the HTML text and converts it into a graphical display to show you.

We will use HTTP.jl to do step 1 above and later we will programmatically traverse the HTML text with to accomplish step 2.

HTTP.jl

As an example, let’s say we want to cross check the official CDC case numbers with the NYT’s data. While the NYT data is structured in a Github repository, the CDC website linked above simply displays the total case numbers in an HTML table. While this is optimized for human consumption (i.e. someone visiting the CDC site in their web browser), it is a little cumbersome for computer consumption….

The first step in programmatically getting the CDC case numbers3 is to get the raw HTML of the web page. As we will hopefully start to get accustomed to, let’s activate our environment and install any new packages:

julia> ] # enter Pkg REPL
(@v1.4) pkg> activate . # activate our environment
(data-dailies) pkg> add HTTP, HttpCommon

GET vs. POST

The HTTP.jl interface should feel reminiscent of the download() function we used before:

using HTTP
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
response = HTTP.request("GET", url)

A few small, but important differences however. Instead of just giving the HTTP.request() function a url, we also specify a HTTP verb. There are a lot on intricacies to HTTP methods but the two main methods (you as a web scraper) will use are GET and POST. A GET basically (as the name implies) requests4 some data from the server and a POST sends5 data to a server.

The download method we used before behaves quite similarly to you typing an address in your browser and hitting enter (and makes a GET request). Now already you might be starting to notice the limits of download….

Responses

What if the url endpoint we are scraping expects input (like an API)? Or what if you are interested in more than just the HTML body of the response (like the status code or headers)?

This is exactly what HTTP.jl exposes for us, but what it gives us in flexibility it trades for convenience (and is a little bit more low level than download()6).

# print HTTP status code
println(response.status)

# inspect the response headers
response.headers
200
8-element Array{Pair{SubString{String},SubString{String}},1}:
                "Content-Type" => "text/html"
                         "SRV" => "3"
 "Access-Control-Allow-Origin" => "*"
             "X-UA-Compatible" => "IE=edge"
                        "Date" => "Tue, 16 Jun 2020 06:12:29 GMT"
           "Transfer-Encoding" => "chunked"
                  "Connection" => "keep-alive, Transfer-Encoding"
   "Strict-Transport-Security" => "max-age=31536000 ; includeSubDomains ; preload"

From the status code of 200 we can see that the HTTP response was returned successfully, and the headers provide additional context on the response7.

# peak at first 5 lines of HTML response
split(String(response.body), '\n')[1:5]
5-element Array{SubString{String},1}:
 "\r"
 "<!DOCTYPE html>\r"
 "<html lang=\"en-us\" class=\"theme-cyan\" >\r"
 "<head>\r"
 "\t\r"

The raw HTML text returned from a programmatic request like this can often be quite messy8 and here we will just glimpse at the first few lines. One quirk of the HTTP response is that it behaves like any other File IO in Julia and the input stream gets “exhausted” once you read it.

# body is empty if we try to re-read the same response....
response.body
0-element Array{UInt8,1}

So if you want to repeatedly read/traverse the body9 you should read it into a variable.

# HTTP Helper functions by JuliaWeb
using HttpCommon

url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
r = HTTP.request("GET", url)

# read the body into a String
status, headers, body = r.status, r.headers , String(r.body)

# escape HTML so `this` webpage doesn't format it
show("$(escapeHTML(body)[1:22])....")
r.body
"\r\n<!DOCTYPE html>...."
0-element Array{UInt8,1}

That covers many of the mechanics of downloading content programmatically but it is not too useful to us (or anyone for that matter) in its raw form.

HTML Primer

Most of us know HTML as a somewhat removed concept10 since web browsers take care of all the messy work of formatting, styling, and displaying HTML in a pleasing graphic form. But since we are interested in programmatically extracting data from a web page, we have to perform the tasks typically relegated to the web browser.

Additionally, since the information we want is usually nested deep in the raw HTML text, we need to traverse the hierarchical HTML structure to find it . Without getting into the gory details of the HTML specification, for our parsing purposes there are just 2 key points to note:

  1. HTML is SGML (an XML like markup format)
  2. HTML is hierarchical

Point 1 is why many XML libraries (like EzXML.jl) can be used with HTML. Point 2 is how we can (somewhat) efficiently find the elements we are interested in without needing to traverse the entire web page.

HTML Tags

All HTML documents are comprised of HTML elements, which themselves are composed of HTML tags. Most (but not all) HTML tags you will encounter have an opening tag, some content, and a closing tag. When web scraping often you are trying to get some content, but occasionally you might need to extract information in the tag itself (like a hyperlink url).

All of the key=value pairs that are within the tag itself (between the < >) are called attributes. Some relevant examples for scraping are:

AttributeTagRoleExample
classanyCSS selector<div class="note"></div>
idanyCSS Selector<div id="two" class="note"></div>
href<a>specify url<a href="https://dailies.directory">Blog</a>

CSS Selectors

While there are many ways to specify which element of the page you want, some are much nicer than others (but as with all things you usually trade convenience for power). CSS is a style sheet language that allows developers to specify how HTML content should be presented.

Since most HTML you will encounter will be styled with CSSβ€”and since CSS uses selectors to specify which elements should be styled howβ€”CSS selectors are often the most natural and straightforward way to traverse an HTML document. While CSS selectors can be combined to represent their own very complicated match rules (like regular expressions), the types of selectors can be grouped into the following:

SelectorExample
tag nameh1 { color: yellow; }
class.note {color: yellow; }
id#two { color: yellow; }
attributea[href="https://dailies.directory”] { color: yellow; }
combinators.note > span { color: yellow; }

While Julia unfortunately doesn’t have the most mature HTML/XML parsing packages11 and it isn’t really a core use case for the language, it’s parallel computing is much more friendly and performant than other scripting languages (that might have better HTML parsers…).

HTML as XML

If we were properly building a scraper to run as a script on a recurring basis, we would probably download the raw HTML (and store it) and then read it in and parse it seperately (remember downloading vs. parsing). For the sake of this tutorial however, I will just pass the HTTP response directly into a parsing package.

I usually prefer to parse/traverse HTML using CSS selectors since (I think) it maps a little more naturally to the structure of HTML, but if you programatically need to traverse an entire document12 XPath is a bit more powerful/flexible than CSS selectors.

Also for Julia, the XML packages like EzXML.jl and LightXML.jl seem to be more active than Cascadia.jl (the only CSS selector library). And in general XML libraries are likely to be more predictable since XPath (and XML) is stricter than HTML and CSS selectors.

We will be parsing the CDC web page that we downloaded earlier.

using EzXML

# use `readhtml(filename)` if reading from a file
doc = parsehtml(body)
doc
EzXML.Document(EzXML.Node(Ptr{EzXML._Node} @0x00007fd86830e1d0, EzXML.Node(#= circular reference @-1 =#)))

We can see here that we now have a EzXML.Document (basically a parsed HTML document represented as a Julia struct) that we can either:

In this case, since we are only interested in the cases and deaths from the CDC page, we will use XPath to get as close to the relevant HTML element using its class or id. Shown in the margin is the CDC page with the web inspector open (right click on the element). Usually once I find the element with the content I want, I start at the element and identify the closest (uniquely identifiable) parent element upstream. In this case since we are trying to get the total cases, new cases, total deaths and new deaths, the closest uniquely identifiable element is likely <section class="cases-header"> (since there may be other elements on the page with callout classes).

using HttpCommon

html = root(doc)
xpath = "//section[@class=\"cases-header\"]"
header = findfirst(xpath, html)
<section class="cases-header">
    <div class="cases-callouts">
        <div class="callouts-container">
            <div class="callout">
                Total Cases
                <span class="count">2,085,769</span>
                <span class="new-cases">21,957 New Cases*</span>
            </div>
            <div class="callout">
                Total Deaths
                <span class="count">115,644</span>
                <span class="new-cases">373 New Deaths*</span>
            </div>
        </div>
        <footer>
            <ul>
                <li>
                    *Compared to yesterday's data   
                </li>
                <li>
                    <a>About the Data</a>   
                </li>
            </ul>
        </footer>

    </div>
    ...
</section>

While it is a little messy in how we printed it out, it does indeed look like we got the right element that contains the information we need. So now that we have isolated the relevant elements we can get a little more specific in how we traverse them:

# initialize empty dictionary to store content
data = Dict()

# convenience function to parse strings with commas
parse_number(x) = parse(Int, replace(x, "," => ""))

# get the nested <div>s that hold the case and death numbers
callouts = findfirst("div/div", header)

# extract the first <div>s that corresponds to the cases
cases = firstelement(callouts)

# extract the inner <span> elements that contain the numbers
total, new = map(nodecontent, findall("span", cases))

data["total_cases"] = parse_number(total)

# use a regex to pull out just the new cases number
data["new_cases"] = parse_number(match(r"([\d,]+)", new)[1])
Dict{Any,Any} with 2 entries:
  "new_cases" => 21957
  "total_cases" => 2085769

Since the deaths callout has the same structure, I won’t show the code here for that but it will be identical to what is above:

deaths = lastelement(callouts)
...

References

CC0
To the extent possible under law, Jonathan Dinu has waived all copyright and related or neighboring rights to Scraping the Web with Julia πŸ„. This work is published from: United States.


  1. I think 🧐 ↩︎

  2. A more comprehensive treatment of HTTP, the web, and HTML would probably be good but unfortunately we will have to wait until another day. ↩︎

  3. say if we wanted to automatically perform this check or update a dashboard everyday without having to visit the web page in our browser and update our code manually….. ↩︎

  4. what happens when you type a url in your browser address bar ↩︎

  5. what happens when you enter data in a form on a web page ↩︎

  6. like a manual vs. automatic car…. ↩︎

  7. The Content-Type letting us know how to parse the body, and the Date letting us know when the response was returned. ↩︎

  8. browsers (thankfully) hide so so many details from us. ↩︎

  9. the body is only “exhausted” if you read from it or use a method that does (like coercing it into a String). ↩︎

  10. even though we interact with it on a daily basis (you are doing it right now 😱) ↩︎

  11. parsing HTML correctly (and quickly) is more difficult than it may seem… ↩︎

  12. instead of say just extracting the text of a single tag ↩︎

back · whoami · teaching · projects · talks · writing · cv · colophon · join