Skip to content
Joel Natividad edited this page Jul 7, 2023 · 13 revisions

Cookbook

Please feel free to add recipes to the Cookbook!

using qsv, ckanapi, jq and xargs.

  • get a CSV of datasets/users/groups/orgs in a CKAN instance
ckanapi -r https://demo.ckan.org dump datasets --all | qsv jsonl > datasets.csv
ckanapi -r https://demo.ckan.org dump users --all | qsv jsonl > users.csv
ckanapi -r https://demo.ckan.org dump groups --all | qsv jsonl > groups.csv
ckanapi -r https://demo.ckan.org dump organizations --all | qsv jsonl > organizations.csv
  • get a CSV of resources for a given dataset
ckanapi -r https://catalog.data.gov action package_show \
id=low-altitude-aerial-imagery-obtained-with-unmanned-aerial-systems-uas-flights-over-black-beach \
| jq -c '.resources[]' \
| qsv jsonl \
> resources.csv
ckanapi -r https://data.cnra.ca.gov action package_show id="wellstar-oil-and-gas-wells1" \
> wellstar-oil-and-gas-wells.json
cat wellstar-oil-and-gas-wells.json \
| jq -c '.resources[] | select(.name=="CSV") | .url' \
| xargs -L 1 wget -O wellstar.csv
qsv stats --everything wellstar.csv > wellstar-stats.csv

Date Enrichment

qsv luajit map Quarter -x -f getquarter.lua nyc311samp.csv > result-qtr.csv

Note: Change the column name on line 7 of getquarter.lua to adapt it for your use.
Also, you need to have the date.lua module in the same directory.

  • Partition 311 files by Quarter. Create the files in the nyc311byqtr directory
qsv partition Quarter nyc311byqtr result-qtr.csv 
  • convert "Created Date" to ISO-8601 format
qsv apply datefmt "Created Date" nyc311samp.csv > result-iso8601.csv
  • format "Created Date" using the format string "%a %b %e %T %Y %z"
qsv apply datefmt "Created Date" --formatstr "%a %b %e %T %Y %z" nyc311samp.csv > result-datefmt.csv
  • create a "Created Year" column from "Created Date"
qsv apply datefmt "Created Date" --formatstr "%Y" --new-column "Created Year" nyc311samp.csv \
> result-year.csv
  • compute Turnaround Time. Store it in a new column named "TAT"
    Ingredient:
qsv luajit map TAT -x -f turnaroundtime.lua nyc311samp.csv > result-tat.csv
  • What is the average turnaround time in Brooklyn for 311 calls?
qsv search --select City --ignore-case brooklyn nyc311samp.csv \
| qsv luajit map TAT -x -f turnaroundtime.lua \
| tee brooklyn-311-details.csv \
| qsv stats --everything > result-brooklyn311-stats.csv

NOTE: The tee command reads from stdin and writes to both stdout and one or more files at the same time. We do this so we can create the brooklyn-311-details.csv file and pipe the same data to the qsv stats command.

Geocoding

  • using the "Location" column, geocode the nearest city in a new column named "City"
qsv apply geocode Location --new-column City nyc311samp.csv > result-geocoded.csv

Note: The bundled static geocoder uses the reverse-geocoder crate. It uses the geonames cities database, and geocodes to the closest city.

  • geocode the county in a new column
qsv apply geocode Location --new-column County --formatstr county nyc311samp.csv > result-county.csv

Multi-table join avoiding repeated columns

This example was inspired by having to combine multiple tables exported from another system, which were themselves from multiple database joins. Suppose you have have several tables (table_*.csv) which have the same first 10 columns, and then a varying number of additional columns. The column we want to join on is column 2, and for simplicity assume the rows all match perfectly (otherwise you would explore the left and right join options).

cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
    qsv join 2 combined.csv 1 <(qsv select 2,11- $NEXT) > new.csv
    mv new.csv combined.csv
done

We use a loop to perform multiple joins. Each time we use xsv select to pull out the index (join column 2) and the columns unique to that file (11 onwards), which could also be done with cut -s "," -f 1,11- $NEXT if preferred. The join column becomes column 1 of the intermediate file.

The proposed qsv join --merge option would stop duplication of the join column.

Little versus Big Endian Spreadsheets

As discussed in issue #609, the CPAN module Spreadsheet::WriteExcel does not generate the usual Little Endian XLS file which Excel produces. There may be other tools doing same.

That causes an issue for qsv search among others.

The quick workaround is to pipe through tr -d '\000'.

For Perl scripts the definitive workaround is to use Excel::Writer::XLSX from the same author which is plug in compatible and requires very little modification of the Perl code (search and replace).

Clone this wiki locally