-
Notifications
You must be signed in to change notification settings - Fork 74
Cookbook
Please feel free to add recipes to the Cookbook!
using qsv, ckanapi, jq and xargs.
- get a CSV of datasets/users/groups/orgs in a CKAN instance
ckanapi -r https://demo.ckan.org dump datasets --all | qsv jsonl > datasets.csv
ckanapi -r https://demo.ckan.org dump users --all | qsv jsonl > users.csv
ckanapi -r https://demo.ckan.org dump groups --all | qsv jsonl > groups.csv
ckanapi -r https://demo.ckan.org dump organizations --all | qsv jsonl > organizations.csv
- get a CSV of resources for a given dataset
ckanapi -r https://catalog.data.gov action package_show \
id=low-altitude-aerial-imagery-obtained-with-unmanned-aerial-systems-uas-flights-over-black-beach \
| jq -c '.resources[]' \
| qsv jsonl \
> resources.csv
- get the latest version of a CSV resource for
wellstar-oil-and-gas-wells1
on https://data.cnra.ca.gov, run stats on it
ckanapi -r https://data.cnra.ca.gov action package_show id="wellstar-oil-and-gas-wells1" \
> wellstar-oil-and-gas-wells.json
cat wellstar-oil-and-gas-wells.json \
| jq -c '.resources[] | select(.name=="CSV") | .url' \
| xargs -L 1 wget -O wellstar.csv
qsv stats --everything wellstar.csv > wellstar-stats.csv
- add a column with Quarter information calculated from "Created Date"
Ingredients:- nyc311samp.csv
- getquarter.lua
-
date.lua
module - can be downloaded fromsrc
at https://github.com/Tieske/date.
qsv luajit map Quarter -x -f getquarter.lua nyc311samp.csv > result-qtr.csv
Note: Change the column name on line 7 of
getquarter.lua
to adapt it for your use.
Also, you need to have thedate.lua
module in the same directory.
- Partition 311 files by Quarter. Create the files in the
nyc311byqtr
directory
qsv partition Quarter nyc311byqtr result-qtr.csv
- convert "Created Date" to ISO-8601 format
qsv apply datefmt "Created Date" nyc311samp.csv > result-iso8601.csv
- format "Created Date" using the format string "%a %b %e %T %Y %z"
qsv apply datefmt "Created Date" --formatstr "%a %b %e %T %Y %z" nyc311samp.csv > result-datefmt.csv
- create a "Created Year" column from "Created Date"
qsv apply datefmt "Created Date" --formatstr "%Y" --new-column "Created Year" nyc311samp.csv \
> result-year.csv
- compute Turnaround Time. Store it in a new column named "TAT"
Ingredient:
qsv luajit map TAT -x -f turnaroundtime.lua nyc311samp.csv > result-tat.csv
- What is the average turnaround time in Brooklyn for 311 calls?
qsv search --select City --ignore-case brooklyn nyc311samp.csv \
| qsv luajit map TAT -x -f turnaroundtime.lua \
| tee brooklyn-311-details.csv \
| qsv stats --everything > result-brooklyn311-stats.csv
NOTE: The
tee
command reads from stdin and writes to both stdout and one or more files at the same time. We do this so we can create thebrooklyn-311-details.csv
file and pipe the same data to theqsv stats
command.
- using the "Location" column, geocode the nearest city in a new column named "City"
qsv apply geocode Location --new-column City nyc311samp.csv > result-geocoded.csv
Note: The bundled static geocoder uses the
reverse-geocoder
crate. It uses the geonames cities database, and geocodes to the closest city.
- geocode the county in a new column
qsv apply geocode Location --new-column County --formatstr county nyc311samp.csv > result-county.csv
This example was inspired by having to combine multiple tables exported from another system, which were themselves from multiple database joins. Suppose you have have several tables (table_*.csv
) which have the same first 10 columns, and then a varying number of additional columns. The column we want to join on is column 2, and for simplicity assume the rows all match perfectly (otherwise you would explore the left and right join options).
cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
qsv join 2 combined.csv 1 <(qsv select 2,11- $NEXT) > new.csv
mv new.csv combined.csv
done
We use a loop to perform multiple joins. Each time we use xsv select
to pull out the index (join column 2) and the columns unique to that file (11 onwards), which could also be done with cut -s "," -f 1,11- $NEXT
if preferred. The join column becomes column 1 of the intermediate file.
The proposed qsv join --merge
option would stop duplication of the join column.
As discussed in issue #609, the CPAN module Spreadsheet::WriteExcel
does not generate the usual Little Endian XLS file which Excel produces. There may be other tools doing same.
That causes an issue for qsv search
among others.
The quick workaround is to pipe through tr -d '\000'
.
For Perl scripts the definitive workaround is to use Excel::Writer::XLSX
from the same author which is plug in compatible and requires very little modification of the Perl code (search and replace).
CSVs from a 3rd party are unreliable. Over time, the header formatting may change, a header may get added, or removed. The following script will concatenate hundreds of csv files together that share the same basic set of headers, with marginal differences.
e.g.
Input CSV #1
one,two
hello,world
Input CSV #2
ONE,THREE
more,data
Output CSV
one,two,three
hello,world,
more,,data
#!/bin/bash
# bash scripts/fix.sh ./fixed "*.csv"
DIST=$1
TARGETS=$2
mkdir -p $DIST
# Step 1: collect the unique header names by parsing all the csv files
unique_headers_list=$(qsv headers -j --intersect $TARGETS | tr '[:upper:]' '[:lower:]' | sort -u)
unique_headers_comma=$(echo $unique_headers_list | tr ' ' ',')
# Step 2: For each CSV file, add missing headers with default values
FILES=($TARGETS)
for file in "${FILES[@]}"; do
FILENAME=${file##*/}
OUT="$DIST/$FILENAME"
# make headers lowercase
LOWERCASE=$OUT.lowercase
qsv input $file | awk 'NR==1 {print tolower($0)} NR>1 {print}' > $LOWERCASE
mv $LOWERCASE $OUT
# find the missing headers, if any
MISSING_HEADERS=$(echo -e "$(qsv headers -j $OUT)\n$unique_headers_list" | sort | uniq -u | tr '\n' ',' | sed 's/,$//' | sed 's/^,//')
# add missing headers with empty values
FULL=$OUT.full
echo $MISSING_HEADERS | qsv cat columns -o $FULL -p $OUT -
mv $FULL $OUT
# the order of headers cannot be assumed, they must be selected by name
ORDERED=$OUT.ordered
qsv select $unique_headers_comma -o $ORDERED $OUT
mv $ORDERED $OUT
done