Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paths in windows. #1

Open
recrm opened this issue Jan 29, 2015 · 1 comment
Open

Paths in windows. #1

recrm opened this issue Jan 29, 2015 · 1 comment
Labels

Comments

@recrm
Copy link
Owner

recrm commented Jan 29, 2015

There is a known bug where warc-extractor.py does not handle windows paths properly.

Windows has far more restrictions on what is an appropriate path than Linux. Unfortunately, dealing with all of the possible crazy path names that warc scrappers can sometimes pull is a big job. Thankfully, most of these requests are 404 errors. Until a proper fix is implemented it is recommended to run the -dump content command along with the http:error:200 filter. If path errors still persist in windows also use the -error flag to skip unworkable path names.

@recrm recrm added the bug label Mar 19, 2015
@recrm
Copy link
Owner Author

recrm commented Nov 24, 2021

Gonna leave a comment here. I've modified the behaviour so that OS specific save failures (most notably bad paths) skip automatically. The dump will leave a message in it's output about each file. The -error flag will still split all bad entires off into a new warc file called errors.warc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant