Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to skip on error #6

Open
DaleTrexel opened this issue Nov 15, 2023 · 1 comment
Open

Option to skip on error #6

DaleTrexel opened this issue Nov 15, 2023 · 1 comment

Comments

@DaleTrexel
Copy link

DaleTrexel commented Nov 15, 2023

This is a great utility! It would be even greater if you offered a flag that allowed the extraction to skip specific records if they throw an error.

In my case, I have a WARC archive that contains some really long URLs, and during extraction it gets to the point that it throws this error, then stops:

Exception in thread "main" java.nio.file.FileSystemException: test-extract/maps.google.com/index;ll=44.969598%2C-93.247374&spn=0.007658%2C0.03006&ie=UTF8&hl=en_US&z=15&t=roadmap&sll=44.969598%2C-93.247374&sspn=0.007658%2C0.03006&q=414%20Cedar%20Ave%2C%20Minneapolis%2C%20MN%2055454%2C%20USA%20%28Malabari%20Kitchen%20Restaurant%29&output=embed.html: File name too long
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:261)
	at java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:482)
	at java.base/java.nio.file.Files.newOutputStream(Files.java:227)
	at org.netpreserve.warc2html.Warc2Html.writeTo(Warc2Html.java:227)
	at org.netpreserve.warc2html.Warc2Html.main(Warc2Html.java:70)

I'm OK with the extraction process skipping this record and proceeding to the next, if that is possible, though it would be good to get output of what records were skipped at the end. As it is now, I've got most of the site that I wanted to pull out of the archive, but I'm missing some of the CSS and JS files used to display it because they presumably occur later in the archive.

This is running WARC2HTML on MacOS (Sonoma), which may determine how long a filename is too long in this instance.

@nextgenthemes
Copy link

Running on Ubuntu 24.04 installed on ext4, I have that exact same issue. Are you also trying to extract warc files from Heritrix? Were you able to solve this? I just want to crawl a single domain anyway, so why it is saving things from Google? For me, the URL is accounts.google.com/InteractiveLogin;continue=https___www.youtube.com_signin_action_handle_signin... can the filenames be limited. In the Heritrix config? Or domains like google just excluded?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants