This file describes the steps taken to crawl the data.gov site in order to create a static version for deployment to Federalist.
In order to convert the current Wordpress data.gov site into a static version, it is necessary to:
- Count site pages
- Crawl old site
- Clean crawled pages
- Deploy static site
- Confirm page count
In an attempt to quantify data.gov in terms of pages that need to be crawled and that should be available in a working static copy of the site, an initial crawl to get a count of the available pages is helpful. Below are my attempts at this, but wget
is a big thing and I'd be happy if someone finds a better way. Go hog wild digging through the wget man page.
📢 Note: wget
will likely get rate-limited unless either a --wait
is used or your IP is whitelisted to blast away at the poor site. This is true for all following wget
calls.
A bash one liner to do that looks something like:
wget -e robots=off -U mozilla --spider -o datagovlog -rpE -np --domains www.data.gov,data.gov -l inf www.data.gov
and line by line is (or just check the Explain Shell):
wget
--execute robots=off
executes command ignore robots.txt--user-agent mozilla
sets user agent to mozilla--spider
spider mode, checks existence of page. downloads to temp only--output-file=<log file>
creates an output log file--recursive
recursively searches though html files for new links--adjust-extension
saves with appropriate extension--domains <domain>
limits to given domain--no-parent
does not search upstream of parent--level=inf
recursive search depth<url>
For reference, on my machine, connection and no --wait
, the crawl takes about 15 minutes.
With --wait=1
, it's over an hour to complete.
To summarize the log file by counting 20X
response codes:
grep -B 2 '20*' datagovlog | grep "http" | cut -d " " -f 4 | sort -u | wc -l
and for counting 404
responses:
grep -B 2 '404' datagovlog | grep "http" | cut -d " " -f 4 | sort -u | wc -l
In the case that you'd like to return a list of URLs, remove the last statement (| wc -l
) on each of these.
For specific response codes, replace 20*
with a specific code: 200
or 404
.
For context, here are the results of a recent run (on October 21, 2021):
Code | Count |
---|---|
20* | 2,067 |
200 | 1,431 |
404 | 375 |
500 | 2 |
I'm not sure how useful these numbers are, but in theory once the site is deployed to Federalist the numbers should line up if everything was captured correctly.
The crawl is done using wget
and largely as described by the link in Bret's original sketch of the story: Linux Jouranl - Downloading an Entire Website with wget.
The final command as a one-liner to crawl www.data.gov is:
wget -e robots=off -U mozilla --recursive --page-requisites --html-extension --domains www.data.gov,data.gov --no-parent --level=inf --convert-links --restrict-file-names=windows www.data.gov
It is also possible that --mirror
mode might be the better option.
The final output of a successful run:
(many lines like this above)
...
--2021-10-22 09:37:25-- https://www.data.gov/developers/page/38/
Reusing existing connection to www.data.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.data.gov/developers/page/38/index.html’
www.data.gov/developers/page [ <=> ] 34.31K --.-KB/s in 0.001s
2021-10-22 09:37:25 (29.5 MB/s) - ‘www.data.gov/developers/page/38/index.html’ saved [35136]
FINISHED --2021-10-22 09:37:25--
Total wall clock time: 9m 11s
Downloaded: 2283 files, 69M in 14s (5.06 MB/s)
Converting links in www.data.gov/climate/humanhealth/highlights@currentpage=2.html... 68-15
Converting links in www.data.gov/climate/ecosystem-vulnerability-launch/index.html... 58-9
...
(many lines like this)
...
Converting links in www.data.gov/app/themes/roots-nextdatagov/assets/LeafletMap2/Leaflet.defaultextent/dist/leaflet.defaultextent.css@ver=5.5.6.css... 2-0
Converted links in 993 files in 13 seconds.
Oh man, there ends up being a lot of cleaning up to do.
While --convert-links
does a great job, there remain many issues.
The most common seem to be 'versioning' in files names, references to those bad filenames, and un-locally-converted links.
The main type of issue has been when a file is saved with a strange extension, for example: some_page_index.css?ver=2.2.2
.
The problem is that then Federalist interprets this example file as being of the type .2
instead of .css
and doesn't set the MIME type correctly.
To fix this, the files need to be renamed with their appropriate extension as well as references to them updated to the new file names.
One way of achieving this is (with your pwd
as the root of the crawl):
📢 Test what this will do first by cutting mv $fname ${fname%@*};
out to do a dry run.
find . -iname "*@ver=*" | while read fname; do echo "$fname --> ${fname%@*}"; mv $fname ${fname%@*}; done
Also, to note, this will work only for files that match the @ver=
syntax.
There will likely also be files that match ?ver=
, %3Fver=
, and perhaps other patterns.
All of these will need to be addressed by searching for matching patterns and changing the above command to rename them.
Due to the issues above with versioned file names, the links to those files now point to the now renamed files. To fix this, those links must be updated. I'm sure there is some great way of doing this, however I relied upon the VSCode Search & Replace interface and a bunch of regex. This interface is nice because VSCode runs the regex as you work on it, showing matching results and replacements in realtime.
Examples:
(href=")(.*?)(\?ver=.*?)(")
replacing with$1$2$4
. This returns about 8k results(src=')(.*?)(\?ver=.*?)(')
replacing with$1$2$4
. This returns about 11k results
To search for any additional patterns or links that may have been missed ver=
should return any.
While --convert-links
converts links from the original domain to be locally referenced, many links still contain www.data.gov
and data.gov
.
I'm not quite sure why most are caught but some are not, though it appears to be more common in nested pages.
In any case, similarly to above, VSCode's Search and Replace interface is handy to target these.
To initially search for these: just use www.data.gov
.
Ideally, replacing a long URL like this with a local reference will result in links that work correctly for their given page.
For example:
<a href="/ocean/page/ocean-technical">Get Involved</a>
instead of:
<a href="https://www.data.gov/ocean/page/ocean-technical">Get Involved</a>
The real trick, however, is to write a regex that works well for nested links and does not result in links on a page like http://localhost:8000/energy/energy-apps/
going to http://localhost:8000/energy/energy-apps/energy/page/energy-maps
instead of the correct localhost:8000/energy/page/energy-maps
. I never got one working quite correctly, but good luck!
The deployment in this case is relatively straight forward. We are using Federalist to host statically for the meantime. The Federalist documentation can be used to set this up, but the current site should also be available as gsa/datagov-website (list of builds, preview site links are also available).
Once a static version of the site is deployed onto Federalist, confirming that it is a complete copy is necessary.
Thankfully, the steps in Count Pages can be repeated, with the <url>
being the new Federalist URL and probably the <log file>
pointing to a new file as well.
Ideally, once crawled and then summarized, the counts should be comparable.