Getting started with Julia for Data Science
May 26, 2020 · 3467 words 🐦 πŸ“§

Setting up a Julia environment

This guide is based on my personal preferences. See the margin notes (and references) for alternate approaches to setting up an environment

The official Julia distributions make the installation process reasonably seamless across a wide variety of platforms (compared to other scientific computing environments). What I walk through below is more of just a TLDR; of the official channels (plus or minus some convenience).

The following installation methods are most ergonomic if most of your development happens in a command line REPL (i.e. julia>).

For a more full service installation method, JuliaPro is the (slightly corporate) way to go.

Mac OSX

brew cask install julia

I prefer using Homebrew (vs. the Julia binaries) since it makes it faily easy to quickly update Julia to newer versions right from the command line without needing to download a new image file (and delete + move + symlink the binary each time).

Windows

Similar to Mac OSX, I present the package manager installation of Julia here (for the same reasons πŸ‘† as above). If you don’t already have Chocolatey you will need to get it first.

choco install julia --confirm
I haven’t tested this Windows installation of the Julia package recently as I don’t have easy access to a Windows OS.

Chocolatey is the Homebrew equivalent for Windows.

Linux

While every Linux distribution historically has had its own package manager, it looks like there is a potentially cross distribution solution in snapcraft. The apt commands below are Debian/Ubuntu specific and have been tested with Ubuntu 18.04 LTS but the snap commands (and everything after) should be distribution agnostic. You can find a slightly outdated Julia on the snapcraft store.

While the snapcraft method may be more convenient, if you want the most recent Julia distribution you should download the official binaries from https://julialang.org

These commands should also should work in most server/cloud environments (i.e. AWS, Azure, Google Cloud, etc.)

hyphaebeast@linux:~$ sudo apt update; sudo apt install snapd
hyphaebeast@linux:~$ sudo snap install julia --classic
# NOTE: as of writing this installs Julia 1.0.4 (released June 2019)

hyphaebeast@linux:~$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.4 (2019-05-16)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia>

Vagrant/VirtualBox/Docker

vagrant init hashicorp/bionic64
vagrant up
vagrant ssh

These commands get a Ubunutu VM setup using Vagrant on your local machine. Once you vagrant ssh into the VM, follow the instructions πŸ‘† for installing Julia on Linux.

Regardless of which installation method you chose, as long as you can start the Julia REPL in a terminal everything should be gravy…..

Editors and IDEs

Editors and IDEs are the more personal development choices to make (and often there is little objective reason to choose one over the other). I personally use Visual Studio Code for most of my other development and there is a decent Julia extension for it.

Some folks are very partial to a notebook like experience. For that go with IJulia for Jupyter.

The most full featured/RStudio like (but most Julia specific) IDE is Juno1.

Downloading Files

While we won’t get to doing any data processing or statistics with the data (that’s for a another day), this section is all about the various ways to get data into Julia. And by association… the various data formats you might encounter in the wild.

First things first, let’s actually get the files. You can download files with a web browser if you want, use an existing file on your computer, or maybe even use a built-in dataset in some Julia package. Since we will eventually want to programmatically download files (and possible automate this process), I wanted to see how to do this purely in Julia. Thankfully Julia Base has a convenient function to do exactly this.

url = "https://covidtracking.com/api/v1/states/current.csv"
download(url,  "data/covid-current.csv")
In Julia, single quotes (') represent a single character while double quotes (") correspond to strings

Now this just downloads a file from a url to a given location on your computer ("data/covid-current.csv"). We still need to read and parse the file.

If we want a quick and dirty method to just inspect the file, we can use the Julia REPL’s shell mode. If you type a ; in the REPL, you should notice the julia> prompt turns into a shell> prompt. And in the shell> prompt we can run any command we could on the command line.

julia> ; # break into a command line shell
shell> head -n 2 data/covid-current.csv

Now this is a little unreadable since we haven’t done any parsing or formatting yet (that’s what our Julia packages are for after all) but at least we can peak at the header names and verify that the columns are indeed comma-seperated2.

CSV

There are a myriad of possible file types/formats out there (and an equally multiplicitous set of Julia packages to handle them). But while there are many many possibilities for the types of files you can encounter out there, chances are the file you want to work with is one of a few common types. Possible the most common of these being the humble DSV (delimited-seperated values).

The most popular DSV of course uses a comma (i.e. CSV). In Julia, the CSV.jl package can support any type of delimiter, the default being a comma.

Before we can use CSV.jl we have to add the package (optionally in our environment):

julia> ] # enter Pkg REPL
(@v1.4) pkg> activate . # create an environment
(data-dailies) pkg> add CSV
   Updating registry at `~/.julia/registries/General`
   ...
using CSV, DataFrames
data = CSV.read(joinpath("data", "covid-current.csv"); delim=',')
print(first(data, 5))
5Γ—39 DataFrames.DataFrame
β”‚ Row β”‚ date     β”‚ state  β”‚ positive β”‚ negative β”‚ pending β”‚ hospitalizedCurrently β”‚ hospitalizedCumulative β”‚ inIcuCurrently β”‚ inIcuCumulative β”‚ onVentilatorCurrently β”‚ onVentilatorCumulative β”‚ recovered β”‚ dataQualityGrade β”‚ lastUpdateEt    β”‚ dateModified         β”‚ checkTimeEt β”‚ death β”‚ hospitalized β”‚ dateChecked          β”‚ totalTestsViral β”‚ positiveTestsViral β”‚ negativeTestsViral β”‚ positiveCasesViral β”‚ fips  β”‚ positiveIncrease β”‚ negativeIncrease β”‚ total  β”‚ totalTestResults β”‚ totalTestResultsIncrease β”‚ posNeg β”‚ deathIncrease β”‚ hospitalizedIncrease β”‚ hash                                     β”‚ commercialScore β”‚ negativeRegularScore β”‚ negativeScore β”‚ positiveScore β”‚ score β”‚ grade   β”‚
β”‚     β”‚ Int64    β”‚ String β”‚ Int64    β”‚ Int64⍰   β”‚ Int64⍰  β”‚ Union{Missing, Int64} β”‚ Union{Missing, Int64}  β”‚ Int64⍰         β”‚ Int64⍰          β”‚ Union{Missing, Int64} β”‚ Union{Missing, Int64}  β”‚ Int64⍰    β”‚ String           β”‚ String          β”‚ String               β”‚ String      β”‚ Int64 β”‚ Int64⍰       β”‚ String               β”‚ Int64⍰          β”‚ Int64⍰             β”‚ Int64⍰             β”‚ Int64⍰             β”‚ Int64 β”‚ Int64            β”‚ Int64            β”‚ Int64  β”‚ Int64            β”‚ Int64                    β”‚ Int64  β”‚ Int64         β”‚ Int64                β”‚ String                                   β”‚ Int64           β”‚ Int64                β”‚ Int64         β”‚ Int64         β”‚ Int64 β”‚ Missing β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 20200615 β”‚ AK     β”‚ 664      β”‚ 73773    β”‚ missing β”‚ 21                    β”‚ missing                β”‚ missing        β”‚ missing         β”‚ 3                     β”‚ missing                β”‚ 417       β”‚ A                β”‚ 6/15/2020 00:00 β”‚ 2020-06-15T00:00:00Z β”‚ 06/14 20:00 β”‚ 12    β”‚ missing      β”‚ 2020-06-15T00:00:00Z β”‚ 74437           β”‚ missing            β”‚ missing            β”‚ missing            β”‚ 2     β”‚ 3                β”‚ 967              β”‚ 74437  β”‚ 74437            β”‚ 970                      β”‚ 74437  β”‚ 0             β”‚ 0                    β”‚ 6b08035ecccc3d7c158bd1ebff8a325714b92a03 β”‚ 0               β”‚ 0                    β”‚ 0             β”‚ 0             β”‚ 0     β”‚ missing β”‚
β”‚ 2   β”‚ 20200615 β”‚ AL     β”‚ 26272    β”‚ 276402   β”‚ missing β”‚ 546                   β”‚ 2259                   β”‚ missing        β”‚ 676             β”‚ missing               β”‚ 395                    β”‚ 13508     β”‚ B                β”‚ 6/15/2020 11:00 β”‚ 2020-06-15T11:00:00Z β”‚ 06/15 07:00 β”‚ 774   β”‚ 2259         β”‚ 2020-06-15T11:00:00Z β”‚ missing         β”‚ missing            β”‚ missing            β”‚ 25892              β”‚ 1     β”‚ 657              β”‚ 4562             β”‚ 302674 β”‚ 302674           β”‚ 5219                     β”‚ 302674 β”‚ 1             β”‚ 4                    β”‚ 5a2e19c16964661d50217b2850723b167c2255b8 β”‚ 0               β”‚ 0                    β”‚ 0             β”‚ 0             β”‚ 0     β”‚ missing β”‚
β”‚ 3   β”‚ 20200615 β”‚ AR     β”‚ 12917    β”‚ 191221   β”‚ missing β”‚ 206                   β”‚ 1003                   β”‚ missing        β”‚ missing         β”‚ 45                    β”‚ 163                    β”‚ 8352      β”‚ B                β”‚ 6/15/2020 00:00 β”‚ 2020-06-15T00:00:00Z β”‚ 06/14 20:00 β”‚ 182   β”‚ 1003         β”‚ 2020-06-15T00:00:00Z β”‚ missing         β”‚ missing            β”‚ missing            β”‚ 12917              β”‚ 5     β”‚ 416              β”‚ 6733             β”‚ 204138 β”‚ 204138           β”‚ 7149                     β”‚ 204138 β”‚ 3             β”‚ 5                    β”‚ b09ef2ffe500407eb08f0c06674367d8b4eb4d37 β”‚ 0               β”‚ 0                    β”‚ 0             β”‚ 0             β”‚ 0     β”‚ missing β”‚
β”‚ 4   β”‚ 20200615 β”‚ AS     β”‚ 0        β”‚ 174      β”‚ missing β”‚ missing               β”‚ missing                β”‚ missing        β”‚ missing         β”‚ missing               β”‚ missing                β”‚ missing   β”‚ C                β”‚ 6/1/2020 00:00  β”‚ 2020-06-01T00:00:00Z β”‚ 05/31 20:00 β”‚ 0     β”‚ missing      β”‚ 2020-06-01T00:00:00Z β”‚ missing         β”‚ missing            β”‚ missing            β”‚ missing            β”‚ 60    β”‚ 0                β”‚ 0                β”‚ 174    β”‚ 174              β”‚ 0                        β”‚ 174    β”‚ 0             β”‚ 0                    β”‚ 9fbf373597d3b20bdf748d1567bfb5daf0a1ade9 β”‚ 0               β”‚ 0                    β”‚ 0             β”‚ 0             β”‚ 0     β”‚ missing β”‚
β”‚ 5   β”‚ 20200615 β”‚ AZ     β”‚ 36705    β”‚ 308552   β”‚ missing β”‚ 1449                  β”‚ 3750                   β”‚ 464            β”‚ missing         β”‚ 307                   β”‚ missing                β”‚ 6462      β”‚ A+               β”‚ 6/15/2020 00:00 β”‚ 2020-06-15T00:00:00Z β”‚ 06/14 20:00 β”‚ 1194  β”‚ 3750         β”‚ 2020-06-15T00:00:00Z β”‚ 344929          β”‚ missing            β”‚ missing            β”‚ 36377              β”‚ 4     β”‚ 1014             β”‚ 6198             β”‚ 345257 β”‚ 345257           β”‚ 7212                     β”‚ 345257 β”‚ 8             β”‚ 24                   β”‚ 63be08bad71d36297e4a1aaeb41b67b953a669ad β”‚ 0               β”‚ 0                    β”‚ 0             β”‚ 0             β”‚ 0     β”‚ missing β”‚

Internally the CSV.jl package returns a DataFrame, which behaves pretty similarly to other dataframe libraries in other languages (like R’s data.frame or Python’s pandas). We won’t get too much into the specifics of Julia’s DataFrames here, but the TLDR; of them is that they behave like a table (or two dimensional matrix) with row and column indeces.

CSV.jl does its best to infer the type of each column, but you might notice some columns with Int64? (and other types with a ?). For columns with missing values (and no user supplied type), the library tries to guess the most appropriate type3.

JSON

A nicety of the COVID Tracking Project is that the same data is published in both CSV and JSON formats. We can use the same download function as before and just use the JSON url instead.

url = "https://covidtracking.com/api/v1/states/current.json"
download(url,  "data/covid-current.json")

And analougous to CSV.jl we have the JSON.jl package. The only difference here is that the JSON.jl package parses JSON files into an Array of Dicts:

julia> ]
(@v1.4) pkg> activate . # load previously created environment
(data-dailies) pkg> add JSON
using JSON
data = JSON.parsefile("data/covid-current.json")
print("The loaded JSON is of type: $(typeof(data))")
data[1]
The loaded JSON is of type: Array{Any,1}
Dict{String,Any} with 39 entries:
  "negativeIncrease" => 967
  "totalTestResultsIncrease" => 970
  "negativeScore" => 0
  "inIcuCumulative" => nothing
  "negativeTestsViral" => nothing
  "checkTimeEt" => "06/14 20:00"
...

Now we only saw how to deal with CSV and JSON files but the JuliaIO Github organization has a (seemingly) exhaustive list of pacakges for various file formats. And also I listed some common packages in the extras.

A Data what?

R made them cool, pandas brough them to Python, and Julia has DataFrames.jl

DataFrames are perhaps the most ubiquitous data structure in data science, mainly because at their core they simply represent tabular data4. And while they may seem like a matrix or multidimensional array on the surface, it is what is inside that makes them really special.

Like a vector/matrix5 in Julia, DataFrames support mutable constant time row and column operations (i.e. get me column 2 of row 12) except that the columns in a DataFrame usually have names. So instead of column 2 of row 12 you can say get me the total_cases of row 12 (unfortunately DataFrames.jl doesn’t support row indices like in pandas).

But besides clarity, DataFrame’s lookup mechanism is actually much more powerful since the row/column indices don’t have to be sequential (as in [1, 2, 3, etc.]) and function like a dictionary of dictionaries with key lookups.

DataFrame data types

Because DataFrame’s can hold heterogeneous types6 we can group on certain columns and compute on others (like get the mean). A DataFrame is actually one of the most flexible data structures (while maintaining modest performance). Because of this it is very well suited to data manipulation and exploratory data analysis tasks. Or any time when you may not know the exact type/structure of your data a priori.

tupl = ("a", 1, [3,4])
arr = ["a", 1, [3,4]]

println("$(typeof(tupl))")
println("$(typeof(arr))")
Tuple{String,Int64,Array{Int64,1}}
Array{Any,1}

In the above code, notice that while an Array can hold multiple types, the Array itself can only have a single type (in this case Any). This is the most specific type that can describe all the elements. But for the Tuple, the individual elements retain their own types (like a DataFrame). But unlike a DataFrame, a Tuple is immutable, for better or worse.

julia> ] # enter Pkg REPL
(@v1.4) pkg> activate .
(data-dailies) pkg> add DataFrames
using DataFrames

df = DataFrame(letter = 'a':'d',
               num = 1:4,
               other = [(1,3), 20, [4,5,6], "Jon"])
print(df)
4Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ letter β”‚ num   β”‚ other     β”‚
β”‚     β”‚ Char   β”‚ Int64 β”‚ Any       β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 'a'    β”‚ 1     β”‚ (1, 3)    β”‚
β”‚ 2   β”‚ 'b'    β”‚ 2     β”‚ 20        β”‚
β”‚ 3   β”‚ 'c'    β”‚ 3     β”‚ [4, 5, 6] β”‚
β”‚ 4   β”‚ 'd'    β”‚ 4     β”‚ Jon       β”‚

As you can see above, each column of a DataFrame functions similarly to a single Array w.r.t. types, but from one column to the next they can have very different types. So in this sense it almost functions like a Tuple of Arrays….

letter = ['a', 'b', 'c', 'd']
num = [1, 2, 3, 4]
other = [(1,3), 20, [4,5,6], "Jon"]

typeof(tuple(letter, num, other))
Tuple{Array{Char,1},Array{Int64,1},Array{Any,1}}

Indexing

In additional to being a very flexible data structure, DataFrame's also allow us to index into them in very flexible ways.

Julia has a convenient built-in pipe operator (|>) that allows you to chain function calls as well as the compose function (∘).

using CSV

# load in the COVID data set from last week
data = CSV.read(joinpath("data", "covid-current.csv"); delim=',')

# inspect first 5 rows
first(data, 5) |> println
println()

# inspect last 5 rows
last(data, 5)
5Γ—35 DataFrames.DataFrame
β”‚ Row β”‚ date     β”‚ state  β”‚ positive β”‚ negative β”‚ pending β”‚ hospitalizedCurrently β”‚ hospitalizedCumulative β”‚ inIcuCurrently β”‚ inIcuCumulative β”‚ onVentilatorCurrently β”‚ onVentilatorCumulative β”‚ recovered β”‚ dataQualityGrade β”‚ lastUpdateEt   β”‚ dateModified         β”‚ checkTimeEt β”‚ death β”‚ hospitalized β”‚ dateChecked          β”‚ fips  β”‚ positiveIncrease β”‚ negativeIncrease β”‚ total  β”‚ totalTestResults β”‚ totalTestResultsIncrease β”‚ posNeg β”‚ deathIncrease β”‚ hospitalizedIncrease β”‚ hash                                     β”‚ commercialScore β”‚ negativeRegularScore β”‚ negativeScore β”‚ positiveScore β”‚ score β”‚ grade   β”‚
β”‚     β”‚ Int64    β”‚ String β”‚ Int64    β”‚ Int64⍰   β”‚ Int64⍰  β”‚ Union{Missing, Int64} β”‚ Union{Missing, Int64}  β”‚ Int64⍰         β”‚ Int64⍰          β”‚ Union{Missing, Int64} β”‚ Union{Missing, Int64}  β”‚ Int64⍰    β”‚ String           β”‚ String         β”‚ String               β”‚ String      β”‚ Int64 β”‚ Int64⍰       β”‚ String               β”‚ Int64 β”‚ Int64            β”‚ Int64            β”‚ Int64  β”‚ Int64            β”‚ Int64                    β”‚ Int64  β”‚ Int64         β”‚ Int64                β”‚ String                                   β”‚ Int64           β”‚ Int64                β”‚ Int64         β”‚ Int64         β”‚ Int64 β”‚ Missing β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 20200607 β”‚ AK     β”‚ 544      β”‚ 64360    β”‚ missing β”‚ 7                     β”‚ missing                β”‚ missing        β”‚ missing         β”‚ 1                     β”‚ missing                β”‚ 382       β”‚ B                β”‚ 6/7/2020 00:00 β”‚ 2020-06-07T00:00:00Z β”‚ 06/06 20:00 β”‚ 10    β”‚ missing      β”‚ 2020-06-07T00:00:00Z β”‚ 2     β”‚ 8                β”‚ 995              β”‚ 64904  β”‚ 64904            β”‚ 1003                     β”‚ 64904  β”‚ 0             β”‚ -48                  β”‚ 62adbd451838656b7df7519e830d6439be0b5877 β”‚ 0               β”‚ 0                    β”‚ 0             β”‚ 0             β”‚ 0     β”‚ missing β”‚
β”‚ 2   β”‚ 20200607 β”‚ AL     β”‚ 20500    β”‚ 239066   β”‚ missing β”‚ missing               β”‚ 2022                   β”‚ missing        β”‚ 615             β”‚ missing               β”‚ 364                    β”‚ 11395     β”‚ B                β”‚ 6/7/2020 11:00 β”‚ 2020-06-07T11:00:00Z β”‚ 06/07 07:00 β”‚ 692   β”‚ 2022         β”‚ 2020-06-07T11:00:00Z β”‚ 1     β”‚ 457              β”‚ 13465            β”‚ 259566 β”‚ 259566           β”‚ 13922                    β”‚ 259566 β”‚ 3             β”‚ 29                   β”‚ 9040674078ce6afca363f8e95943845a032ab5d6 β”‚ 0               β”‚ 0                    β”‚ 0             β”‚ 0             β”‚ 0     β”‚ missing β”‚
β”‚ 3   β”‚ 20200607 β”‚ AR     β”‚ 9426     β”‚ 150847   β”‚ missing β”‚ 145                   β”‚ 844                    β”‚ missing        β”‚ missing         β”‚ 35                    β”‚ 143                    β”‚ 6424      β”‚ A                β”‚ 6/7/2020 16:10 β”‚ 2020-06-07T16:10:00Z β”‚ 06/07 12:10 β”‚ 154   β”‚ 844          β”‚ 2020-06-07T16:10:00Z β”‚ 5     β”‚ 325              β”‚ 3191             β”‚ 160273 β”‚ 160273           β”‚ 3516                     β”‚ 160273 β”‚ 0             β”‚ 6                    β”‚ ef23d4d3f9e232bb5f58a59d79a27d2cb0797e2a β”‚ 0               β”‚ 0                    β”‚ 0             β”‚ 0             β”‚ 0     β”‚ missing β”‚
β”‚ 4   β”‚ 20200607 β”‚ AS     β”‚ 0        β”‚ 174      β”‚ missing β”‚ missing               β”‚ missing                β”‚ missing        β”‚ missing         β”‚ missing               β”‚ missing                β”‚ missing   β”‚ C                β”‚ 6/1/2020 00:00 β”‚ 2020-06-01T00:00:00Z β”‚ 05/31 20:00 β”‚ 0     β”‚ missing      β”‚ 2020-06-01T00:00:00Z β”‚ 60    β”‚ 0                β”‚ 0                β”‚ 174    β”‚ 174              β”‚ 0                        β”‚ 174    β”‚ 0             β”‚ 0                    β”‚ 893135d0d7a9340a91aca139f4e3bb289f418f71 β”‚ 0               β”‚ 0                    β”‚ 0             β”‚ 0             β”‚ 0     β”‚ missing β”‚
β”‚ 5   β”‚ 20200607 β”‚ AZ     β”‚ 26889    β”‚ 254732   β”‚ missing β”‚ 1252                  β”‚ 3352                   β”‚ 392            β”‚ missing         β”‚ 248                   β”‚ missing                β”‚ 5517      β”‚ A+               β”‚ 6/7/2020 00:00 β”‚ 2020-06-07T00:00:00Z β”‚ 06/06 20:00 β”‚ 1044  β”‚ 3352         β”‚ 2020-06-07T00:00:00Z β”‚ 4     β”‚ 1438             β”‚ 8537             β”‚ 281621 β”‚ 281621           β”‚ 9975                     β”‚ 281621 β”‚ 2             β”‚ 32                   β”‚ 505a05efa5a9b912644a7ad16b2ab6f37330806b β”‚ 0               β”‚ 0                    β”‚ 0             β”‚ 0             β”‚ 0     β”‚ missing β”‚

He we are loading in the COVID data set from the COVID Tracking Project that we downloaded before. And if you remember, with this data the columns in that data set actually have many different types.

If we want to index specific ranges of columns AND rows we can do that too!

# for rows [3, 17, 20] show me columns [1,2,3]
data[[3, 17, 20], 1:3]

# we can also get columns by their name
data[1:5, [:recovered, :death]]

# or print out all the columns... πŸ‘‰
names(data)
πŸ‘‡
35-element Array{Symbol,1}:
 :date
 :state
 :positive
 :negative
 :pending
 :hospitalizedCurrently
 :hospitalizedCumulative
 :inIcuCurrently
 :inIcuCumulative
 :onVentilatorCurrently
 :onVentilatorCumulative
 :recovered
 :dataQualityGrade
 :lastUpdateEt
 :dateModified
 :checkTimeEt
 :death
 :hospitalized
 :dateChecked
 :fips
 :positiveIncrease
 :negativeIncrease
 :total
 :totalTestResults
 :totalTestResultsIncrease
 :posNeg
 :deathIncrease
 :hospitalizedIncrease
 :hash
 :commercialScore
 :negativeRegularScore
 :negativeScore
 :positiveScore
 :score
 :grade

One nicety that DataFrames.jl has over pandas is built-in conditional logic for column selections:

# select columns based on regular expression
data[!, r"(pos|neg)"] |> head

# invert column selection with case-insensitive regex
data[!, Not(r"(score|pos|neg|grade|notes|total|date)"i)] |> head

And finally we can also index based on specific values of certain columns:

# which states have had more than 50,000 positive cases
data[data.positive .> 50000, [:state, :positive]]

While we have only scratched the surface of the power and potential7 of DataFrames.jl, this should be enough to get us started slicing and dicing our data. As we progress through more interesting analyses I will be sure to call out the additional functionality of DataFrames.jl relevant to the problem at hand.

References and Extras

CC0
To the extent possible under law, Jonathan Dinu has waived all copyright and related or neighboring rights to Getting started with Julia for Data Science. This work is published from: United States.


  1. This is what comes bundled with JuliaPro. ↩︎

  2. You never can trust the file extensions of files you download off the Internet…. ↩︎

  3. Also, since a non-decimal numeric value can be represented both by an Int or a Float, Julia assumes the most specific type (i.e. all Ints can be represented by Floats but not all Floats can be represented as an Int) ↩︎

  4. and it is hard to throw a data science stone without hitting tables… ↩︎

  5. Vector and Matrix are really just syntactic sugar for a one- and two- dimensional Array. ↩︎

  6. each column can have it’s own type. ↩︎

  7. and haven’t even looked at the rich ecosystem of querying frameworks for it… ↩︎

Questions, comments, or feedback on Getting started with Julia for Data Science πŸ‘‡
~hyphaebeast/hyphaebeast.club

back · whoami · teaching · projects · talks · writing · cv · colophon · join