Fix occasional bug in iterating over gzipped WARC's with missing headers #1097

dolsysmith · 2021-12-02T13:59:16Z

For at least one collection (0287d41512b3492b801db3256112c103), the Twitter rest exporter throws a UnicodeDecodeError. In this case, the content-encoding header, which should be set to gzip, was either missing or duplicated by a different value for a certain number of lines in the warc.gz files. The warcio.WARCIterator class, which is used by warc_iter.py to read the WARC's, defaults in these cases to a type of reader that does not allow for proper decoding of the content, which, in every case tested, appears to be an empty bytestring.

Solution: in warc_iter.py, wrap the line line = stream.readline().decode('utf-8') in a try/except block, simply skipping the line if the decoding fails.

The text was updated successfully, but these errors were encountered:

dolsysmith added bug low effort level labels Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix occasional bug in iterating over gzipped WARC's with missing headers #1097

Fix occasional bug in iterating over gzipped WARC's with missing headers #1097

dolsysmith commented Dec 2, 2021 •

edited

Loading

Fix occasional bug in iterating over gzipped WARC's with missing headers #1097

Fix occasional bug in iterating over gzipped WARC's with missing headers #1097

Comments

dolsysmith commented Dec 2, 2021 • edited Loading

dolsysmith commented Dec 2, 2021 •

edited

Loading