You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a great utility! It would be even greater if you offered a flag that allowed the extraction to skip specific records if they throw an error.
In my case, I have a WARC archive that contains some really long URLs, and during extraction it gets to the point that it throws this error, then stops:
Exception in thread "main" java.nio.file.FileSystemException: test-extract/maps.google.com/index;ll=44.969598%2C-93.247374&spn=0.007658%2C0.03006&ie=UTF8&hl=en_US&z=15&t=roadmap&sll=44.969598%2C-93.247374&sspn=0.007658%2C0.03006&q=414%20Cedar%20Ave%2C%20Minneapolis%2C%20MN%2055454%2C%20USA%20%28Malabari%20Kitchen%20Restaurant%29&output=embed.html: File name too long
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:261)
at java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:482)
at java.base/java.nio.file.Files.newOutputStream(Files.java:227)
at org.netpreserve.warc2html.Warc2Html.writeTo(Warc2Html.java:227)
at org.netpreserve.warc2html.Warc2Html.main(Warc2Html.java:70)
I'm OK with the extraction process skipping this record and proceeding to the next, if that is possible, though it would be good to get output of what records were skipped at the end. As it is now, I've got most of the site that I wanted to pull out of the archive, but I'm missing some of the CSS and JS files used to display it because they presumably occur later in the archive.
This is running WARC2HTML on MacOS (Sonoma), which may determine how long a filename is too long in this instance.
The text was updated successfully, but these errors were encountered:
Running on Ubuntu 24.04 installed on ext4, I have that exact same issue. Are you also trying to extract warc files from Heritrix? Were you able to solve this? I just want to crawl a single domain anyway, so why it is saving things from Google? For me, the URL is accounts.google.com/InteractiveLogin;continue=https___www.youtube.com_signin_action_handle_signin... can the filenames be limited. In the Heritrix config? Or domains like google just excluded?
This is a great utility! It would be even greater if you offered a flag that allowed the extraction to skip specific records if they throw an error.
In my case, I have a WARC archive that contains some really long URLs, and during extraction it gets to the point that it throws this error, then stops:
I'm OK with the extraction process skipping this record and proceeding to the next, if that is possible, though it would be good to get output of what records were skipped at the end. As it is now, I've got most of the site that I wanted to pull out of the archive, but I'm missing some of the CSS and JS files used to display it because they presumably occur later in the archive.
This is running WARC2HTML on MacOS (Sonoma), which may determine how long a filename is too long in this instance.
The text was updated successfully, but these errors were encountered: