Skip to content

Commit

Permalink
feat: Revision pass.
Browse files Browse the repository at this point in the history
  • Loading branch information
jheer committed Feb 25, 2024
1 parent 1ab2347 commit c2a44ab
Show file tree
Hide file tree
Showing 9 changed files with 56 additions and 39 deletions.
16 changes: 9 additions & 7 deletions docs/data-loading.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ header: |

# Data Loading with DuckDB

This page provides guidance for using DuckDB in Observable Framework data loaders, and then deploying them using GitHub Actions.
Looking under the hood, this page provides guidance for using DuckDB in Framework data loaders and deploying it within GitHub Actions.

## Using DuckDB in Data Loaders

The [NYC Taxi Rides](nyc-taxi-rides) and [Gaia Star Catalog](gaia-star-catalog) examples use [data loaders](https://observablehq.com/framework/loaders) to perform data preparation, generating pre-projected data and writing it to a Parquet file.

The shell script below loads taxi data using the command line interface to DuckDB.
The `duckdb` executable must be on your environment path... but more on that below!
The [shell script below](/~https://github.com/uwdata/mosaic-framework-example/blob/main/docs/data/nyc-taxi.parquet.sh) loads taxi data using the command line interface to DuckDB.
The `duckdb` executable must be on your environment path... we'll come back to that!

```sh
duckdb :memory: << EOF
Expand All @@ -31,9 +31,11 @@ FROM 'https://uwdata.github.io/mosaic-datasets/data/nyc-rides-2010.parquet';
-- Write output parquet file
COPY (SELECT
(HOUR(datetime) + MINUTE(datetime)/60) AS time,
ST_X(pick)::INTEGER AS px, ST_Y(pick)::INTEGER AS py,
ST_X(drop)::INTEGER AS dx, ST_Y(drop)::INTEGER AS dy
HOUR(datetime) + MINUTE(datetime) / 60 AS time,
ST_X(pick)::INTEGER AS px, -- extract pickup x-coord
ST_Y(pick)::INTEGER AS py, -- extract pickup y-coord
ST_X(drop)::INTEGER AS dx, -- extract dropff x-coord
ST_Y(drop)::INTEGER AS dy -- extract dropff y-coord
FROM rides) TO 'trips.parquet' WITH (FORMAT PARQUET);
EOF

Expand Down Expand Up @@ -73,4 +75,4 @@ steps:
rm duckdb_cli-linux-amd64.zip
```
We perform this step before site build steps, ensuring `duckdb` is installed and ready.
We perform installation before the site build steps, ensuring `duckdb` is ready to go.
8 changes: 5 additions & 3 deletions docs/data/nyc-taxi.parquet.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,11 @@ FROM 'https://uwdata.github.io/mosaic-datasets/data/nyc-rides-2010.parquet';
-- Write output parquet file
COPY (SELECT
(HOUR(datetime) + MINUTE(datetime)/60) AS time,
ST_X(pick)::INTEGER AS px, ST_Y(pick)::INTEGER AS py,
ST_X(drop)::INTEGER AS dx, ST_Y(drop)::INTEGER AS dy
HOUR(datetime) + MINUTE(datetime) / 60 AS time,
ST_X(pick)::INTEGER AS px, -- extract pickup x-coord
ST_Y(pick)::INTEGER AS py, -- extract pickup y-coord
ST_X(drop)::INTEGER AS dx, -- extract dropff x-coord
ST_Y(drop)::INTEGER AS dy -- extract dropff y-coord
FROM rides) TO 'trips.parquet' WITH (FORMAT PARQUET);
EOF

Expand Down
14 changes: 7 additions & 7 deletions docs/flight-delays.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,11 +152,11 @@ vg.vconcat(

We can see right away that flights are more likely to be delayed if they leave later in the day. Delays may accrue as a single plane flies from airport to airport.

As the number of flight records in a hexbin vary across multiple orders of magnitude, we default to using a logarithmic scale. _Try adjusting the color scale menu to see the effects of different choices._
The number of records in a hexbin vary from 0 to over 2,000, spanning multiple orders of magnitude. To see these orders more clearly, we default to a logarithmic color scale. _Try adjusting the color scale menu to see the effects of different choices._

## Density Rasters
## Density Heatmaps

For finer-grained detail, we can instead bin down to the level of individual pixels.
For finer-grained detail, we can bin all the way down to the level of individual pixels.

```js
const $filter = vg.Selection.crossfilter(); // interval ranges
Expand All @@ -167,7 +167,7 @@ vg.hconcat(
vg.plot(
vg.raster(
vg.from("flights", { filterBy: $filter }),
{ x: "time", y: "delay", fill: "density" }
{ x: "time", y: "delay", fill: "density", imageRendering: "pixelated" }
),
vg.intervalX({ as: $filter, brush: {fill: "none", stroke: "#888"} }),
vg.colorScheme("blues"),
Expand All @@ -183,7 +183,7 @@ vg.hconcat(
vg.plot(
vg.raster(
vg.from("flights", { filterBy: $filter }),
{ x: "distance", y: "delay", fill: "density" }
{ x: "distance", y: "delay", fill: "density", imageRendering: "pixelated" }
),
vg.intervalX({ as: $filter, brush: {fill: "none", stroke: "#888"} }),
vg.colorScheme("blues"),
Expand All @@ -199,5 +199,5 @@ vg.hconcat(
```

The result is a raster, or heatmap, view.
We can now see some striping, which reveals that data values are truncated to a limited precision.
As before, we can also use interactive selections to cross-filter the charts.
We now see some striping, which reveals that data values are truncated to a limited precision.
As before, we can use interactive selections to cross-filter the charts.
5 changes: 4 additions & 1 deletion docs/gaia-star-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ const vg = vgplot(vg => [ vg.loadParquet("gaia", url(gaia)) ]);

Here we visualize a 5M star sample.
A raster sky map reveals our Milky Way galaxy.
Select high parallax stars in the histogram to reveal a [Hertzsprung-Russel diagram](https://en.wikipedia.org/wiki/Hertzsprung%E2%80%93Russell_diagram) in the plot of stellar color vs. magnitude on the right.
Select higher parallax (≥ 6) stars in the histogram to reveal a [Hertzsprung-Russel diagram](https://en.wikipedia.org/wiki/Hertzsprung%E2%80%93Russell_diagram) in the plot of stellar color vs. magnitude on the right.

```js
const $brush = vg.Selection.crossfilter();
Expand Down Expand Up @@ -56,6 +56,7 @@ vg.hconcat(
),
vg.intervalX({as: $brush}),
vg.xDomain(vg.Fixed),
vg.xTicks(5),
vg.yScale("sqrt"),
vg.yGrid(true),
vg.width(280),
Expand All @@ -69,6 +70,7 @@ vg.hconcat(
),
vg.intervalX({as: $brush}),
vg.xDomain(vg.Fixed),
vg.xTicks(5),
vg.yScale("sqrt"),
vg.yGrid(true),
vg.width(280),
Expand All @@ -87,6 +89,7 @@ vg.hconcat(
vg.xyDomain(vg.Fixed),
vg.colorScale("sqrt"),
vg.colorScheme("viridis"),
vg.xTicks(5),
vg.yReverse(true),
vg.width(320),
vg.height(500),
Expand Down
18 changes: 12 additions & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,19 @@ const weather = await FileAttachment("data/seattle-weather.parquet").url();
const vg = vgplot(vg => [ vg.loadParquet("weather", url(weather)) ]);
```

This site shares examples of integrating Mosaic and DuckDB into Observable Framework. The examples demonstrate:
[Mosaic](https://uwdata.github.io/mosaic) is a system for linking data visualizations, tables, and input widgets, all leveraging a database ([DuckDB](https://duckdb.org/)) for scalable processing. With Mosaic, you can interactively visualize and explore millions and even billions of data points.

This site shows how to publish Mosaic and DuckDB-powered interactive dashboards and data-driven articles using [Observable Framework](https://observablehq.com/framework/). The examples illustrate:

- Visualization and real-time interaction with massive data sets
- Using Mosaic and DuckDB-WASM within Framework pages
- Using DuckDB in a data loader and in GitHub Actions

All source markup and code is available at </~https://github.com/uwdata/mosaic-framework-example>.
All source markup and code is available at </~https://github.com/uwdata/mosaic-framework-example>. Or, use the source links at the top of each page!

[Mosaic](https://uwdata.github.io/mosaic) is a system for linking data visualizations, tables, and input widgets, all leveraging a database ([DuckDB](https://duckdb.org/)) for scalable processing. With Mosaic, you can interactively visualize and explore millions and even billions of data points.
## Example: Seattle Weather

Here is a simple example, an interactive dashboard of weather in Seattle:
Our first example is an interactive dashboard of Seattle’s weather, including temperatures, precipitation, and the type of weather. Drag on the scatter plot to see the proportion of days that have sun, fog, drizzle, rain, or snow.

```js
const $click = vg.Selection.single();
Expand Down Expand Up @@ -57,7 +59,8 @@ vg.vconcat(
vg.colorRange($colors),
vg.rDomain(vg.Fixed),
vg.rRange([2, 10]),
vg.width(680),
vg.marginLeft(45),
vg.width(660),
vg.height(300)
)
),
Expand All @@ -77,11 +80,14 @@ vg.vconcat(
vg.yLabel(null),
vg.colorDomain($domain),
vg.colorRange($colors),
vg.width(680)
vg.marginLeft(45),
vg.width(660)
)
)
```

The examples linked below involve much larger datasets and a variety of visualization types.

## Example Articles

- [Flight Delays](flight-delays) - examine over 200,000 flight records
Expand Down
14 changes: 7 additions & 7 deletions docs/mosaic-duckdb-wasm.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ header: |

# Using Mosaic & DuckDB-WASM

This page describes how to set up Mosaic and DuckDB-WASM to "play nice" with Observable's reactive runtime.
Unlike standard JavaScript, Observable will happily run JavaScript "out-of-order".
We need to set up Mosaic and DuckDB-WASM to "play nice" with Observable's reactive runtime.
Unlike standard JavaScript, the Observable runtime will happily run JavaScript "out-of-order".
Observable uses dependencies among code blocks, rather than the order within the file, to determine what to run and when to run it.
This reactivity can cause problems for code that depends on "side effects" that are not tracked by Observable's runtime.

Expand All @@ -26,11 +26,11 @@ const vg = vgplot(vg => [ vg.loadParquet("flights", url(flights)) ]);
We first import a custom `vgplot` initialization method that configures Mosaic, loads data into DuckDB, and returns the vgplot API. We also import a custom `url` method which we will later use to to prepare URLs that will be loaded by DuckDB.

Next, we reference the data files we plan to load.
As Observable Framework needs to track which files are used, we must use its `FileAttachment` mechanism.
However, we don't actually want to load the file yet, so we instead request a URL.
As Observable Framework needs to track which files are used, we _must_ use its `FileAttachment` mechanism.
However, we don't actually want to load the file yet, so we instead retrieve a corresponding URL.

Finally, we invoke `vgplot(...)` to initialize Mosaic, which returns a (Promise to an) instance of the vgplot API.
This method takes a single function as input, which should return an array of SQL queries to execute upon load.
This method takes a single function as input, which should return an array of SQL queries to execute for client-side data loading.

We use the `url()` helper method to prepare a file URL so that DuckDB can successfully load it.
The url string returned by `FileAttachment(...).url()` is a _relative_ path like `./_file/data/doodads.csv`.
Expand All @@ -49,7 +49,7 @@ Why the gymnastics?

We want to have access to the API to support data loading, using Mosaic's helper functions to install extensions and load data files.
At the same time, we don't want to assign the _outer_ `vg` variable until data loading is complete, ensuring downstream code that uses the API will not be evaluated by the Observable runtime until DuckDB is ready.
Once `vg` is assigned, the data has been loaded, and we can evaluate API calls for creating [visualizations](https://uwdata.github.io/mosaic/vgplot/),
Once `vg` is assigned, the data has been loaded and we can evaluate downstream API calls for creating [visualizations](https://uwdata.github.io/mosaic/vgplot/),
[inputs](https://uwdata.github.io/mosaic/inputs/),
[params](https://uwdata.github.io/mosaic/core/#params), and
[selections](https://uwdata.github.io/mosaic/core/#selections).
Expand All @@ -75,7 +75,7 @@ export async function vgplot(queries) {
We first get a reference to the central coordinator, which manages all queries.
We create a new API context, which we eventually will return.

Next, we configure Mosaic to use DuckDB-WASM.
Next, we configure Mosaic to use DuckDB-WASM as an in-browser database.
The `wasmConnector()` method creates a new database instance in a worker thread.

We then invoke the `queries` callback to get a list of data loading queries.
Expand Down
10 changes: 5 additions & 5 deletions docs/nyc-taxi-rides.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ const vg = vgplot(vg => [ vg.loadParquet("trips", url(trips)) ]);
# NYC Taxi Rides
## Pickup and dropoff points for 1M NYC taxi rides on Jan 1-3, 2010.

Using a data loader, we ingest a remote file into DuckDB and project [_longitude_, _latitude_] coordinates (in the database!) to spatial positions with units of 12 inch feet.
Using a data loader, we ingest a remote file into DuckDB and project [_longitude_, _latitude_] coordinates (in the database!) to spatial positions with units of feet (1 foot = 12 inches).
We then load the prepared data to visualize taxi pickup and dropoff locations, as well as the volume of rides by the time of day.

_Please wait a few seconds for the dataset to load._
Expand All @@ -39,7 +39,7 @@ vg.hconcat(
vg.plot(
vg.raster(
vg.from("trips", {filterBy: $filter}),
{x: "px", y: "py"}
{ x: "px", y: "py", imageRendering: "pixelated" }
),
vg.intervalXY({as: $filter}),
vg.text(
Expand All @@ -60,7 +60,7 @@ vg.hconcat(
vg.plot(
vg.raster(
vg.from("trips", {filterBy: $filter}),
{x: "dx", y: "dy"}
{ x: "dx", y: "dy", imageRendering: "pixelated" }
),
vg.intervalXY({as: $filter}),
vg.text(
Expand Down Expand Up @@ -95,5 +95,5 @@ vg.plot(
)
```

_Select an interval in a plot to filter the maps.
What spatial patterns can you find?_
Select an interval in a plot to filter the maps.
_What spatial patterns can you find?_
6 changes: 3 additions & 3 deletions docs/observable-latency.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ That said, a lot is going on in the original [custom heatmap component](https://
- Observable Plot and HTML Canvas code are intermixed in non-trivial ways
- Frame-based animation is used to progressively render the graphic, presumably to combat sluggish rendering

Here we re-create this graphic with [Mosaic vgplot](https://uwdata.github.io/mosaic/what-is-mosaic/), using a standalone specification.
We also leverage Mosaic's support for cross-chart linking and scalable filtering.
Here we re-create this graphic with [Mosaic vgplot](https://uwdata.github.io/mosaic/what-is-mosaic/), resulting in a simpler, standalone specification.
We further leverage Mosaic's support for cross-chart linking and scalable filtering for real-time updates.

```js
const $filter = vg.Selection.crossfilter();
Expand Down Expand Up @@ -92,7 +92,7 @@ vg.plot(
)
```

_Select bars in the chart of most-requested routes above to filter the heatmap and isolate patterns. Or, select a range in the heatmap to show just the corresponding routes._
_Select bars in the chart of most-requested routes above to filter the heatmap and isolate patterns. Or, select a range in the heatmap to show only corresponding routes._

## Implementation Notes

Expand Down
4 changes: 4 additions & 0 deletions docs/style.css
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,7 @@
#observablehq-header a[target="_blank"]:not(:hover, :focus)::after {
color: var(--theme-foreground-muted);
}

.input label {
margin-right: 0.5em;
}

0 comments on commit c2a44ab

Please sign in to comment.