Data Grand-Prix

Speed Test

Last iteration Average speed

Overview

In this application, we compare the speed of five different data-processing tools in R. Pressing the "Run" button on the "Speed Test" page initiates data-processing by the different tools. Processing speed is presented using a custom-built JavaScript-based "Speedometer" widget.

Data-processing is performed ten times for each tool. The duration for each iteration of the processing pipeline is stored. The speedometer presents "Iterations per second" in three different ways:

A thick white line presents speed as averaged across all iterations (iterations performed / duration so far);
A thin red line presents speed for the most recent iteration (1 / duration of the latest iteration);
The speeds for each iteration are presented as fixed red dots.

The tools and processing approaches in use are:

dplyr acting on an in-memory data-frame;
the same {dplyr} processing code acting on a SQLite database via dbplyr;
the same dplyr code acting on a parquet file via the arrow package;
data.table acting on an in-memory data-table;
a hand-written SQL query against the SQLite database, using the DBI package.

Data preprocessing and storage

The dataset used is a subset of the taxi-journey dataset from nyc.gov . We downloaded the Yellow-Taxi data for 2024 and filtered to keep those journeys where the journey began and ended at one of the three airports in the original dataset (Newark, La-Guardia or JFK).

That filtered dataset was stored in three separate formats: a tibble, a parquet dataset and a SQLite database. These files were uploaded as 'pins' to our Posit Connect server for use within the application. The SQLite database was created locally and uploaded using the other files were uploaded using pins::pin_write(). For the parquet dataset, this resulted in a single file so no benefit could be gained from file-level partitioning of the parquet dataset.

Data processing in the application

The analysis performed by the four tools is as follows on the 178k row, 22 column dataset

Filter the dataset to keep journeys from La-Guardia to JFK (or vice versa)
Group by the month of the year and by pick-up location
Calculate the mean 'fare' and mean 'tip' amount
Arrange the data by month of the year and pick-up location.

The resulting data (24 rows, 6 columns) was converted to a data-frame for consistency.

Data Import

So that data import and conversion are not included in the data processing speeds, the datasets are ingested or created prior to data processing by the application. That is, when the application starts, the tibble and parquet datasets are read in from the pin board, a connection to the SQLite database is created and a data.table is created from the tibble; and the length of time these steps take does not contribute to the speed comparisons.

Visualisation

Data is passed from Shiny to the browser using the R function sendCustomMessage provided by Shiny and then read on the front end using the JavaScript function Shiny.addCustomMessageHandler, also provided by Shiny.

Results for a given algorithm are then aggregated and visualised in the form of a custom-built moving Speedometer, created using the d3 JavaScript library.