You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a known bug where warc-extractor.py does not handle windows paths properly.
Windows has far more restrictions on what is an appropriate path than Linux. Unfortunately, dealing with all of the possible crazy path names that warc scrappers can sometimes pull is a big job. Thankfully, most of these requests are 404 errors. Until a proper fix is implemented it is recommended to run the -dump content command along with the http:error:200 filter. If path errors still persist in windows also use the -error flag to skip unworkable path names.
The text was updated successfully, but these errors were encountered:
Gonna leave a comment here. I've modified the behaviour so that OS specific save failures (most notably bad paths) skip automatically. The dump will leave a message in it's output about each file. The -error flag will still split all bad entires off into a new warc file called errors.warc.
There is a known bug where warc-extractor.py does not handle windows paths properly.
Windows has far more restrictions on what is an appropriate path than Linux. Unfortunately, dealing with all of the possible crazy path names that warc scrappers can sometimes pull is a big job. Thankfully, most of these requests are 404 errors. Until a proper fix is implemented it is recommended to run the -dump content command along with the http:error:200 filter. If path errors still persist in windows also use the -error flag to skip unworkable path names.
The text was updated successfully, but these errors were encountered: