Cookbook

Please feel free to add recipes to the Cookbook!

CKAN

get a CSV of datasets/users/groups/orgs in a CKAN instance

ckanapi -r https://demo.ckan.org dump datasets --all | qsv jsonl > datasets.csv
ckanapi -r https://demo.ckan.org dump users --all | qsv jsonl > users.csv
ckanapi -r https://demo.ckan.org dump groups --all | qsv jsonl > groups.csv
ckanapi -r https://demo.ckan.org dump organizations --all | qsv jsonl > organizations.csv

get a CSV of resources for a given dataset

ckanapi -r https://catalog.data.gov action package_show \
id=low-altitude-aerial-imagery-obtained-with-unmanned-aerial-systems-uas-flights-over-black-beach \
| jq -c '.resources[]' \
| qsv jsonl \
> resources.csv

get the latest version of a CSV resource for wellstar-oil-and-gas-wells1 on https://data.cnra.ca.gov, run stats on it

ckanapi -r https://data.cnra.ca.gov action package_show id="wellstar-oil-and-gas-wells1" \
> wellstar-oil-and-gas-wells.json
cat wellstar-oil-and-gas-wells.json \
| jq -c '.resources[] | select(.name=="CSV") | .url' \
| xargs -L 1 wget -O wellstar.csv
qsv stats --everything wellstar.csv > wellstar-stats.csv

Date Enrichment

add a column with Quarter information calculated from "Created Date"
Ingredients:
- nyc311samp.csv
- getquarter.lua
- date.lua module - can be downloaded from src at /~https://github.com/Tieske/date.

qsv luajit map Quarter -x -f getquarter.lua nyc311samp.csv > result-qtr.csv

Note: Change the column name on line 7 of getquarter.lua to adapt it for your use.
Also, you need to have the date.lua module in the same directory.

Partition 311 files by Quarter. Create the files in the nyc311byqtr directory

qsv partition Quarter nyc311byqtr result-qtr.csv

convert "Created Date" to ISO-8601 format

qsv apply datefmt "Created Date" nyc311samp.csv > result-iso8601.csv

format "Created Date" using the format string "%a %b %e %T %Y %z"

qsv apply datefmt "Created Date" --formatstr "%a %b %e %T %Y %z" nyc311samp.csv > result-datefmt.csv

create a "Created Year" column from "Created Date"

qsv apply datefmt "Created Date" --formatstr "%Y" --new-column "Created Year" nyc311samp.csv \
> result-year.csv

compute Turnaround Time. Store it in a new column named "TAT"
Ingredient:
- turnaroundtime.lua

qsv luajit map TAT -x -f turnaroundtime.lua nyc311samp.csv > result-tat.csv

What is the average turnaround time in Brooklyn for 311 calls?

qsv search --select City --ignore-case brooklyn nyc311samp.csv \
| qsv luajit map TAT -x -f turnaroundtime.lua \
| tee brooklyn-311-details.csv \
| qsv stats --everything > result-brooklyn311-stats.csv

NOTE: The tee command reads from stdin and writes to both stdout and one or more files at the same time. We do this so we can create the brooklyn-311-details.csv file and pipe the same data to the qsv stats command.

Geocoding

using the "Location" column, geocode the nearest city in a new column named "City"

qsv apply geocode Location --new-column City nyc311samp.csv > result-geocoded.csv

Note: The bundled static geocoder uses the reverse-geocoder crate. It uses the geonames cities database, and geocodes to the closest city.

geocode the county in a new column

qsv apply geocode Location --new-column County --formatstr county nyc311samp.csv > result-county.csv

Multi-table join avoiding repeated columns

This example was inspired by having to combine multiple tables exported from another system, which were themselves from multiple database joins. Suppose you have have several tables (table_*.csv) which have the same first 10 columns, and then a varying number of additional columns. The column we want to join on is column 2, and for simplicity assume the rows all match perfectly (otherwise you would explore the left and right join options).

cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
    qsv join 2 combined.csv 1 <(qsv select 2,11- $NEXT) > new.csv
    mv new.csv combined.csv
done

We use a loop to perform multiple joins. Each time we use xsv select to pull out the index (join column 2) and the columns unique to that file (11 onwards), which could also be done with cut -s "," -f 1,11- $NEXT if preferred. The join column becomes column 1 of the intermediate file.

The proposed qsv join --merge option would stop duplication of the join column.

Little versus Big Endian Spreadsheets

As discussed in issue #609, the CPAN module Spreadsheet::WriteExcel does not generate the usual Little Endian XLS file which Excel produces. There may be other tools doing same.

That causes an issue for qsv search among others.

The quick workaround is to pipe through tr -d '\000'.

For Perl scripts the definitive workaround is to use Excel::Writer::XLSX from the same author which is plug in compatible and requires very little modification of the Perl code (search and replace).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cookbook

Cookbook

CKAN

Date Enrichment

Geocoding

Multi-table join avoiding repeated columns

Little versus Big Endian Spreadsheets

Clone this wiki locally