Skip to content

Commit 561a702

Browse files
authored
doc: Fix overloaded extract; polish CDXT examples (#18)
1 parent b786c75 commit 561a702

File tree

2 files changed

+36
-18
lines changed

2 files changed

+36
-18
lines changed

Makefile

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,13 @@ extract:
3636
@echo "hint: python -m json.tool extraction.json"
3737

3838
cdx_toolkit:
39-
@echo look up this capture in the comoncrawl cdx index
40-
#cdxt --cc --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
41-
cdxt --limit 1 --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
39+
@echo demonstrate that we have this entry in the index
40+
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
4241
@echo
43-
@echo extract the content from the commoncrawl s3 bucket
42+
@echo cleanup previous work
4443
rm -f TEST-000000.extracted.warc.gz
45-
cdxt --limit 1 --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 warc an.wikipedia.org/wiki/Escopete
44+
@echo retrieve the content from the commoncrawl s3 bucket
45+
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 warc an.wikipedia.org/wiki/Escopete
4646
@echo
4747
@echo index this new warc
4848
cdxj-indexer TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj

README.md

Lines changed: 31 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Whirlwind Tour of Common Crawl's Datasets using Python
22

3-
The Common Crawl corpus contains petabytes of crawl data, including raw web page data, metadata extracts, and text extracts. Common Crawl's data storage is a little complicated, as you might expect for such a large and rich dataset. We make our crawl data available in a variety of formats (WARC, WET, WAT) and we also have two index files of the crawled webpages: CDXJ and columnar.
3+
The Common Crawl corpus contains petabytes of crawl data, including raw web page data, metadata, and parsed text. Common Crawl's data storage is a little complicated, as you might expect for such a large and rich dataset. We make our crawl data available in a variety of formats (WARC, WET, WAT) and we also have two index files of the crawled webpages: CDXJ and columnar.
44
```mermaid
55
flowchart TD
66
WEB["WEB"] -- crawler --> cc["Common Crawl"]
@@ -87,15 +87,15 @@ You'll see four records total, with the start of each record marked with the hea
8787

8888
### WET
8989

90-
WET (WARC Encapsulated Text) files only contain the body text of web pages extracted from the HTML and exclude any HTML code, images, or other media. This makes them useful for text analysis and natural language processing (NLP) tasks.
90+
WET (WARC Encapsulated Text) files only contain the body text of web pages parsed from the HTML and exclude any HTML code, images, or other media. This makes them useful for text analysis and natural language processing (NLP) tasks.
9191

9292
Open `whirlwind.warc.wet`: this is the WET derived from our original WARC. We can see that it's still in WARC format with two records:
9393
1) a `warcinfo` record.
94-
2) a `conversion` record: the extracted text with the HTTP headers removed.
94+
2) a `conversion` record: the parsed text with HTTP headers removed.
9595

9696
### WAT
9797

98-
WAT (Web ARChive Timestamp) files contain metadata associated with the crawled web pages (e.g. parsed data from the HTTP response headers, links extracted from HTML pages, server response codes etc.). They are useful for analysis that requires understanding the structure of the web.
98+
WAT (Web ARChive Timestamp) files contain metadata associated with the crawled web pages (e.g. parsed data from the HTTP response headers, links recovered from HTML pages, server response codes etc.). They are useful for analysis that requires understanding the structure of the web.
9999

100100
Open `whirlwind.warc.wat`: this is the WAT derived from our original WARC. Like the WET file, it's also in WARC format. It contains two records:
101101
1) a `warcinfo` record.
@@ -217,9 +217,9 @@ For each of these records, there's one text line in the index - yes, it's a flat
217217

218218
What is the purpose of this funky format? It's done this way because these flat files (300 gigabytes total per crawl) can be sorted on the primary key using any out-of-core sort utility e.g. the standard Linux `sort`, or one of the Hadoop-based out-of-core sort functions.
219219

220-
The JSON blob has enough information to extract individual records: it says which warc file the record is in, and the offset and length of the record. We'll use that in the next section.
220+
The JSON blob has enough information to cleanly isolate the raw data of a single record: it defines which WARC file the record is in, and the byte offset and length of the record within this file. We'll use that in the next section.
221221

222-
## Task 4: Use the CDXJ index to extract raw content from the local WARC, WET, and WAT
222+
## Task 4: Use the CDXJ index to extract a subset of raw content from the local WARC, WET, and WAT
223223

224224
Normally, compressed files aren't random access. However, the WARC files use a trick to make this possible, which is that every record needs to be separately compressed. The `gzip` compression utility supports this, but it's rarely used.
225225

@@ -350,18 +350,19 @@ The output looks like this:
350350
<summary>Click to view output</summary>
351351

352352
```
353-
look up this capture in the comoncrawl cdx index for CC-MAIN-2024-22, returning only the first match:
354-
cdxt --limit 1 --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
353+
demonstrate that we have this entry in the index
354+
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
355355
status 200, timestamp 20240518015810, url https://an.wikipedia.org/wiki/Escopete
356356
357-
extract the content from the commoncrawl s3 bucket
357+
cleanup previous work
358358
rm -f TEST-000000.extracted.warc.gz
359-
cdxt --cc --from 20240518015810 --to 20240518015810 warc an.wikipedia.org/wiki/Escopete
359+
retrieve the content from the commoncrawl s3 bucket
360+
cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 warc an.wikipedia.org/wiki/Escopete
360361
361362
index this new warc
362363
cdxj-indexer TEST-000000.extracted.warc.gz > TEST-000000.extracted.warc.cdxj
363364
cat TEST-000000.extracted.warc.cdxj
364-
org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17455", "offset": "379", "filename": "TEST-000000.extracted.warc.gz"}
365+
org,wikipedia,an)/wiki/escopete 20240518015810 {"url": "https://an.wikipedia.org/wiki/Escopete", "mime": "text/html", "status": "200", "digest": "sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU", "length": "17455", "offset": "406", "filename": "TEST-000000.extracted.warc.gz"}
365366
366367
iterate this new warc
367368
python ./warcio-iterator.py TEST-000000.extracted.warc.gz
@@ -372,9 +373,26 @@ python ./warcio-iterator.py TEST-000000.extracted.warc.gz
372373

373374
</details>
374375

375-
We look up the capture using the `cdxt` commands by specifying the exact URL (`an.wikipedia.org/wiki/Escopete`) and the date of its capture (20240518015810). The output is the WARC file `TEST-000000.extracted.warc.gz` which contains a `warcinfo` record explaining what the WARC is, followed by the `response` record we requested. The Makefile target then runs `cdxj-indexer` on this new WARC to make a CDXJ index of it as in Task 3, and finally iterates over the WARC using `warcio-iterator.py` as in Task 2.
376+
There's a lot going on here so let's unpack it a little.
376377

377-
If you dig into cdx_toolkit's code, you'll find that it is using the offset and length of the WARC record, as returned by the CDX index query, to make a HTTP byte range request to S3 to download the single WARC record we want. It only downloads the response WARC record because our CDX index only has the response records indexed.
378+
#### Check that the crawl has a record for the page we are interested in
379+
380+
We check for capture results using the `cdxt` command `iter`, specifying the exact URL `an.wikipedia.org/wiki/Escopete` and the timestamp range `--from 20240518015810 --to 20240518015810`. The result of this tells us that the crawl successfuly fetched this page at timestamp `20240518015810`.
381+
* Captures are named by the surtkey and the time.
382+
* Instead of `--crawl CC-MAIN-2024-22`, you could pass `--cc` to search across all crawls.
383+
* You can pass `--limit <N>` to limit the number of results returned - in this case because we have restricted the timestamp range to a single value, we only expect one result.
384+
* URLs may be specified with wildcards to return even more results: `"an.wikipedia.org/wiki/Escop*"` matches `an.wikipedia.org/wiki/Escopulión` and `an.wikipedia.org/wiki/Escopete`.
385+
386+
#### Retrieve the fetched content as WARC
387+
388+
Next, we use the `cdxt` command `warc` to retrieve the content and save it locally as a new WARC file, again specifying the exact URL, crawl identifier, and timestamp range. This creates the WARC file `TEST-000000.extracted.warc.gz` which contains a `warcinfo` record explaining what the WARC is, followed by the `response` record we requested.
389+
* If you dig into cdx_toolkit's code, you'll find that it is using the offset and length of the WARC record (as returned by the CDX index query) to make a HTTP byte range request to S3 that isolates and returns just the single record we want from the full file. It only downloads the response WARC record because our CDX index only has the response records indexed.
390+
* By default `cdxt` avoids overwriting existing files by automatically incrementing the counter in the filename. If you run this again without deleting `TEST-000000.extracted.warc.gz`, the data will be written again to a new file `TEST-000001.extracted.warc.gz`.
391+
* Limit, timestamp, and crawl index args, as well as URL wildcards, work as for `iter`.
392+
393+
### Indexing the WARC and viewing its contents
394+
395+
Finally, we run `cdxj-indexer` on this new WARC to make a CDXJ index of it as in Task 3, and then iterate over the WARC using `warcio-iterator.py` as in Task 2.
378396

379397
## Task 7: Find the right part of the columnar index
380398

0 commit comments

Comments
 (0)