Data Visualization with Julia πŸ“Š
Jun 06, 2020 · 1178 words 🐦 πŸ“§

A tale of two packages, that is a parable of two perspectives….

A short (maybe misinformed) history

This is my very opinionated perspective of the history of Julia visualization development… as an outsider and onlooker who hasn’t actually contributed to either of these packages.

Data visualization and plotting in Julia has had a bit of a mottled history1. Unlike the scientific Python community which had an early, publication quality incumbent like matplotlib, Julia visualization development has been a bit more ad-hoc and grassroots.

Initially (I believe) in the early days, the first libraries were simply wrappers to existing libraries2 since, on the whole, the Julia ecosystem was very young. In this pre Cambrian Explosion world PyPlot.jl ruled, and being an interface to matplotlib, it could do everything the Pythonistas could do. However, as things progressed and other essential statistical/mathematical packages matured, space opened up to build out the visualization side of the package ecosystem.

But the ease of creating and distributing packages in Juliaβ€”combined with the rapidly growing enthusiastic Julia communityβ€”led to a mulitplicity of new libraries. Many of these were created to fill a scientific visualization niche3 and still do. But with this sort of proliferation, for common/general visualization tasks there was naturally some overlap and it was confusing/overwhelming as a newcomer to know which package is best suited for your goals (if you don’t fall into one of those exisitng niches3).

In this post we will look at (what seem to be) the two frontrunners for general data visualization and plotting. Each package has a unique approach and I will highlight things to consider in general when you are considering data visualization. We will do this by recreating a simple enough example that is easy to understand quickly yet complex enough to actually highlight the differences between these two libraries.

julia> ] # enter Pkg REPL
(@v1.4) pkg> activate .
(data-dailies) pkg> add Plots, Gadfly
using CSV, HTTP, DataFrames, Dates

url = "https://covidtracking.com/api/v1/states/ca/daily.csv"

res = HTTP.request("GET", url).body
columns = [:date, :totalTestResultsIncrease]
fmt = "yyyymmdd"
t = Dict(:date=>Date)

data = sort(CSV.read(res; dateformat=fmt, select=columns, types=t))
head(data)
6Γ—2 DataFrames.DataFrame
β”‚ Row β”‚ date       β”‚ totalTestResultsIncrease β”‚
β”‚     β”‚ Dates.Date β”‚ Int64                    β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 2020-03-04 β”‚ 0                        β”‚
β”‚ 2   β”‚ 2020-03-05 β”‚ 0                        β”‚
β”‚ 3   β”‚ 2020-03-06 β”‚ 7                        β”‚
β”‚ 4   β”‚ 2020-03-07 β”‚ 9                        β”‚
β”‚ 5   β”‚ 2020-03-08 β”‚ 19                       β”‚
β”‚ 6   β”‚ 2020-03-09 β”‚ 254                      β”‚

Plots.jl

Plots.jl was developed as a common interface for many of the various existing Julia visualization packages and provides a huge convenience for end-users. Instead of needing to rewrite a visualization for every backend you wanted to render to, now you could develop and fine tune a graphic once using the Plots.jl API and optionally export it to any of the supported environments4.

But what you gain in flexibility of output, you necessarily have to give up in expressivity of API. Since Plots.jl needs to interface to a variety of backends, the API is something of the average of all the backend APIs. And most scientific visualization APIs can trace their lineage back to the MATLAB plotting tradition (which is also true of the matplotlib API).

While not intrinsically good or bad, know that this level of abstraction is really designed for rapidly creating scientific visualizations of tabular data (and often numeric data from experiments). As such, it should feel very easy and natural to use if you have….. mostly numeric tabular data that you want to visualize with conventional plots5.

using Plots, RollingFunctions

# plot daily test increase as bars/sticks
Plots.plot(data.date,
    data.totalTestResultsIncrease,
    seriestype=:sticks,
    label="Test Increase",
    title = "California Total Testing Capacity",
    lw = 2)

# compute the 7-day average
window = 7
average = rollmean(data.totalTestResultsIncrease, window)

# to add another series we mutate the existing plot
Plots.plot!(data.date,
    cat(zeros(window - 1), average, dims=1),
    label="7-day Average",
    lw=3)

Plots.jl bar plot with average of California testing

Plots.jl natively operates on Arrays but it has a large ecosystem that extends the base Plots.jl package for statistical plotting, machine learning, and domain specific visualizations (among others).

Gadfly (and the Grammar of Graphics)

While Plots.jl API is a bit more high level than Gadfly.jl, it is much less expressive. I like to conceptualize Plots.jl as a plotting framework that enables customization (in the convention over configuration sense) where as Gadfly.jl is a library that provides you with visualization building blocks6. This distinction can be further analogized to the general difference between libraries and frameworks.

using Gadfly

labels = ["Test Increase", "7-day Average"]
colors = ["deepskyblue", "tomato"]

# Gadfly can work with DataFrames directly
p = Gadfly.plot(data,
    layer(
        x=:date,
        y=:totalTestResultsIncrease,
        Geom.hair,
        Theme(line_width=1.5pt)
    ),
    layer(
        x=:date,
        y=:totalTestResultsIncrease,
        Geom.line,
        Stat.smooth(method=:loess, smoothing=.15),
        Theme(
            default_color=colors[2], line_width=2pt
        )
    ),
    Guide.xlabel("Date"),
    Guide.ylabel(labels[1]),
    Guide.title("California Total Testing Capacity"),
    Guide.manual_color_key("", labels, colors),
    Theme(background_color="white")
)

Gadfly bar plot with average of California testing

While a LOESS smooth is not identical to the 7-day moving average used in the plot above, using a smoothing parameter of 0.15 approximates a 7 day moving average in this example and is built into Gadfly.jl (rather than having to use RollingFunctions.jl).

As you can see by comparing the Plots.jl and Gadfly.jl examples, even though the Plots.jl API on the whole is a bit more succinct, a composition of Gadfly.jl geometries is much more flexible than the Plots.jl series types.

Comparison

PackageUse if you wantWeaknessMost Similar
Plots.jlMultiple backendsHigh level but inflexible APImatplotlib
Gadfly.jlA declarative Grammar of Graphics APINo built-in interactivityggplot2
VegaLite.jlWeb based interactivityNot designed for non-web environments (like PDFs)Altair

If you only need to make fairly common charts, Plots.jl is probably the way to go since it is easier/quicker for common plots. And it is convenient to have the flexibility of the various backends. If you are creating a more ad-hoc visualization however, Gadfly.jl is more expressive.

References and Extras

CC0
To the extent possible under law, Jonathan Dinu has waived all copyright and related or neighboring rights to Data Visualization with Julia πŸ“Š. This work is published from: United States.


  1. A great example of this is the list of supported backends of \cite{plots} (and even the need for Plots.jl in the first place). ↩︎

  2. Writing a visualization library is far from a trivial task…. ↩︎

  3. things like creating performant 3D visualizations, publication quality LaTex figures, or interactive scientific GUIs. ↩︎

  4. a GUI for rapid develoment, HTML/JS for interactive web plots, PDF for publications, etc. ↩︎

  5. or as Hadley Wickham would say: named graphics ↩︎

  6. This type of componetization applied to visualizations has its roots in the Grammar of Graphics and more recently in Hadley Wickham’s ggplot2 ↩︎

back · whoami · teaching · projects · talks · writing · cv · colophon · join