You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For at least one collection (0287d41512b3492b801db3256112c103), the Twitter rest exporter throws a UnicodeDecodeError. In this case, the content-encoding header, which should be set to gzip, was either missing or duplicated by a different value for a certain number of lines in the warc.gz files. The warcio.WARCIterator class, which is used by warc_iter.py to read the WARC's, defaults in these cases to a type of reader that does not allow for proper decoding of the content, which, in every case tested, appears to be an empty bytestring.
Solution: in warc_iter.py, wrap the line line = stream.readline().decode('utf-8') in a try/except block, simply skipping the line if the decoding fails.
The text was updated successfully, but these errors were encountered:
For at least one collection (0287d41512b3492b801db3256112c103), the Twitter rest exporter throws a
UnicodeDecodeError
. In this case, thecontent-encoding
header, which should be set togzip
, was either missing or duplicated by a different value for a certain number of lines in thewarc.gz
files. Thewarcio.WARCIterator
class, which is used bywarc_iter.py
to read the WARC's, defaults in these cases to a type of reader that does not allow for proper decoding of the content, which, in every case tested, appears to be an empty bytestring.Solution: in
warc_iter.py
, wrap the lineline = stream.readline().decode('utf-8')
in a try/except block, simply skipping the line if the decoding fails.The text was updated successfully, but these errors were encountered: