Paths in windows. #1

recrm · 2015-01-29T00:01:13Z

There is a known bug where warc-extractor.py does not handle windows paths properly.

Windows has far more restrictions on what is an appropriate path than Linux. Unfortunately, dealing with all of the possible crazy path names that warc scrappers can sometimes pull is a big job. Thankfully, most of these requests are 404 errors. Until a proper fix is implemented it is recommended to run the -dump content command along with the http:error:200 filter. If path errors still persist in windows also use the -error flag to skip unworkable path names.

recrm · 2021-11-24T19:41:24Z

Gonna leave a comment here. I've modified the behaviour so that OS specific save failures (most notably bad paths) skip automatically. The dump will leave a message in it's output about each file. The -error flag will still split all bad entires off into a new warc file called errors.warc.

recrm added the bug label Mar 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paths in windows. #1

Paths in windows. #1

recrm commented Jan 29, 2015

recrm commented Nov 24, 2021

Paths in windows. #1

Paths in windows. #1

Comments

recrm commented Jan 29, 2015

recrm commented Nov 24, 2021