Option to skip on error #6

DaleTrexel · 2023-11-15T14:43:05Z

This is a great utility! It would be even greater if you offered a flag that allowed the extraction to skip specific records if they throw an error.

In my case, I have a WARC archive that contains some really long URLs, and during extraction it gets to the point that it throws this error, then stops:

Exception in thread "main" java.nio.file.FileSystemException: test-extract/maps.google.com/index;ll=44.969598%2C-93.247374&spn=0.007658%2C0.03006&ie=UTF8&hl=en_US&z=15&t=roadmap&sll=44.969598%2C-93.247374&sspn=0.007658%2C0.03006&q=414%20Cedar%20Ave%2C%20Minneapolis%2C%20MN%2055454%2C%20USA%20%28Malabari%20Kitchen%20Restaurant%29&output=embed.html: File name too long
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:261)
	at java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:482)
	at java.base/java.nio.file.Files.newOutputStream(Files.java:227)
	at org.netpreserve.warc2html.Warc2Html.writeTo(Warc2Html.java:227)
	at org.netpreserve.warc2html.Warc2Html.main(Warc2Html.java:70)

I'm OK with the extraction process skipping this record and proceeding to the next, if that is possible, though it would be good to get output of what records were skipped at the end. As it is now, I've got most of the site that I wanted to pull out of the archive, but I'm missing some of the CSS and JS files used to display it because they presumably occur later in the archive.

This is running WARC2HTML on MacOS (Sonoma), which may determine how long a filename is too long in this instance.

The text was updated successfully, but these errors were encountered:

nextgenthemes · 2024-12-11T17:31:02Z

Running on Ubuntu 24.04 installed on ext4, I have that exact same issue. Are you also trying to extract warc files from Heritrix? Were you able to solve this? I just want to crawl a single domain anyway, so why it is saving things from Google? For me, the URL is accounts.google.com/InteractiveLogin;continue=https___www.youtube.com_signin_action_handle_signin... can the filenames be limited. In the Heritrix config? Or domains like google just excluded?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to skip on error #6

Option to skip on error #6

DaleTrexel commented Nov 15, 2023 •

edited

Loading

nextgenthemes commented Dec 11, 2024

Option to skip on error #6

Option to skip on error #6

Comments

DaleTrexel commented Nov 15, 2023 • edited Loading

nextgenthemes commented Dec 11, 2024

DaleTrexel commented Nov 15, 2023 •

edited

Loading