Indie Map is a public IndieWeb social graph and dataset. See the docs for details. This doc focuses on how to develop, run, and maintain Indie Map itself.
The individual sites and pages retain their original copyright. The rest of the dataset and this project are placed into the public domain via the CC0 public domain dedication.
The crawler is basically just xargs wget -r < domains.txt
. Details in crawl.sh
and wget.sh
.
To add a site to the dataset, run crawl/add_site_1.sh
, then crawl/add_site_2.sh
. They're separate because add_site_2.sh
runs ~300GB of BigQuery queries, which costs ~$1.50, so if you're adding lots of sites, run add_site_1.sh
once for each site and then add_site_2.sh
once for all the sites together.
Run this shell command to fetch 2017.indieweb.org, extract the attendees' web sites, and compare to 2017/domains_iws.txt
to see if there are any new ones.
curl https://2017.indieweb.org/ \
| grep -A2 'class="profile-info"' \
| grep -o -E 'http[^"]+' \
| grep -v www.facebook.com \
| cut -d/ -f3 | sort | uniq \
| diff ~/src/indie-map/crawl/2017/domains_iws.txt -
Run this shell command to fetch a page of IndieWeb wiki users and extract their web site domains:
curl 'https://indieweb.org/wiki/index.php?title=Special:ListUsers&limit=500' \
| grep -o -E '"User:[^" ]+' | cut -d: -f2 | grep .
...and then follow next page links, e.g. &offset=Homefries.org
, &offset=...
When I hit eg:
BigQuery error in load operation: Error processing job
'indie-map:bqjob_r9dcbd6e76f81db0_0000015c29d469ee_1': JSON table encountered too many
errors, giving up. Rows: 57; errors: 1.
Failure details:
- file-00000000: JSON parsing error in row starting at position
3088711: . Only optional fields can be set to NULL. Field: names;
Value: NULL
I deleted the offending null values manually with e.g.:
grep '"names": \[[^]]*null[^]]*\]' sites.json |cut -d, -f1
sed -i '' 's/"names":\ \[null\]/"names":\ []/' sites.json
sed -i '' 's/"names":\ \[null,\ /"names":\ [/' sites.json
sed -i '' 's/"names":\ \["Jeremy Zilar",\ null,/"names":\ ["Jeremy Zilar",/' sites.json
foreach f (*.warc.gz)
echo --
echo $f
gzcat $f | head -n 100 | grep -m 2 -A20 -E '^HTTP/1\.[01] (30.|403)' | grep -E '^(HTTP/1|Location:)'
end
then massage manually, with e.g. these Emacs regexp replaces:
.+\.warc\.gz
--
^\(.+\)\.warc\.gz
.+
[Ll]ocation: https://\1/?.*
\(HTTP.*
\)?--
^\(.+\)\.warc\.gz
.+
[Ll]ocation: /.*
\(HTTP.*
\)?--
^.+.warc.gz
HTTP/1.1 403 Forbidden
--
^.+\.warc\.gz
.+
[Ll]ocation: https?://\([^/]+\)/?.*
\(HTTP.*
\)?--
=>
\1
The Indie Map web site is stored and served on Firebase Hosting. I followed these instructions to set it up:
npm install -g firebase-tools # or brew install firebase-cli
firebase login
firebase init
firebase deploy
I could then see the site serving on indie-map.firebaseapp.com, and I could manage it in the Firebase console. All I had to do then was connect the indiemap.org domain and the www subdomain, and I was all set.
Indie Map used to serve from Google Cloud Storage. Here's what I did originally to set that up, and to store HTTP request logs in gs://indie-map/
:
gsutil mb www.indiemap.org
gsutil cp www/index.html gs://www.indiemap.org/
gsutil cp www/docs.html gs://www.indiemap.org/
gsutil cp www/404.html gs://www.indiemap.org/
gsutil acl ch -u AllUsers:R gs://www.indiemap.org/index.html
gsutil web set -m index.html -e 404.html gs://www.example.com
# https://cloud.google.com/storage/docs/access-logs
gsutil acl ch -g cloud-storage-analytics@google.com:W gs://indie-map
gsutil logging set on -b gs://indie-map -o logs/ gs://www.indiemap.org/
# https://cloud.google.com/storage/docs/access-control/create-manage-lists#defaultobjects
gsutil defacl ch -u AllUsers:READER gs://www.indiemap.org
I ran these commands to copy the initial WARCs, BigQuery input JSON files, and social graph files to GCS and load them into BigQuery:
gsutil -m cp -L cp.log *.warc.gz gs://indie-map/crawl/
gsutil -m cp -L cp.log *.json.gz gs://indie-map/bigquery/
# can gsutil -m rsync ... as final sync. can also sanity check with:
ls -1 *.{json,warc}.gz | wc
gsutil ls gs://indie-map/{crawl,bigquery}/ | wc
du -c -b *.{json,warc}.gz | wc
gsutil du -sc gs://indie-map/{crawl,bigquery}/ | wc
# https://cloud.google.com/bigquery/bq-command-line-tool
bq load --replace --autodetect --source_format=NEWLINE_DELIMITED_JSON indiemap.pages gs://indie-map/bigquery/\*.json.gz
bq load --replace --autodetect --source_format=NEWLINE_DELIMITED_JSON indiemap.sites sites.json.gz
# then watch job at https://bigquery.cloud.google.com/jobs/indie-map
import gzip
from bs4 import BeautifulSoup, UnicodeDammit
import mf2py
import warcio
with gzip.open('FILE.warc.gz', 'rb') as input:
for i, record in enumerate(warcio.ArchiveIterator(input)):
if i == 178:
body = UnicodeDammit(record.content_stream().read()).unicode_markup
break
record.rec_type
record.rec_headers.get('WARC-Target-URI')
print(body)
I've gotten one request so far to remove a site. Here's what I did.
- Ran
git grep DOMAIN
and removed all matches, notably incrawl/[YEAR]/domains*.txt
. - Ran these commands to remove the WARC, JSON, and other files:
# in the repo root dir rm www/DOMAIN.json firebase deploy gsutil rm gs://indie-map/crawl/DOMAIN.warc.gz gsutil rm gs://indie-map/bigquery/DOMAIN.json.gz # check that there's nothing left gsutil ls 'gs://indie-map/**/*DOMAIN*'
- Ran these
DELETE
statements in BigQuery:DELETE FROM indiemap.sites WHERE domain='DOMAIN' DELETE FROM indiemap.pages WHERE domain='DOMAIN'
- Deleted the domain in Kumu manually. Try going to this URL:
https://kumu.io/snarfed/indie-map#indie-map/DOMAIN_WITHOUT_DOTS
. If that doesn't find it, search for the domain. Then click the Trash icon in the right pane. - Added a note to the docs.