You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which replies with return code 100 (i.e. ok) while it could detect that --zim-file parameter is an unsupported file name for local filesystem.
Final error was:
Processing WARC files in /output/.tmpx0ezwt48/collections/crawl-20240414022153206/archive
16 WARC files found
Calling warc2zim with these args: ['--zim-file=Turok Sanctum 04/13/2024_115fb1f0.zim', '--name=turoksanctum.com_115fb1f0', '--publisher=openZIM', '--output', '/output', '--url', 'https://turoksanctum.com/', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmpx0ezwt48/collections/crawl-20240414022153206/archive']
[DEBUG] Confirming output is writable using /output/tmplf6kxht_
[DEBUG] Title: Turok Sanctum – Turok Mods and Maps
[DEBUG] Language: en-US
[DEBUG] Favicon: https://i0.wp.com/turoksanctum.com/wp-content/uploads/2017/04/cropped-1_1428292389.png?fit=32%2C32&ssl=1
[DEBUG] Found WARC record for favicon: https://i0.wp.com/turoksanctum.com/wp-content/uploads/2017/04/cropped-1_1428292389.png?fit=32%2C32&ssl=1
Traceback (most recent call last):
File "/usr/bin/zimit", line 566, in <module>
zimit()
File "/usr/bin/zimit", line 464, in zimit
return warc2zim(warc2zim_args)
File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 113, in main
return converter.run()
File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 265, in run
self.creator = Creator(
File "libzim/libzim.pyx", line 263, in libzim._Creator.__cinit__
OSError: Unable to write ZIM file at /output/Turok Sanctum 04/13/2024_115fb1f0.zim
The text was updated successfully, but these errors were encountered:
One way we can check for invalid filenames is by using a regular expression pattern (like the one found here) to match invalid characters and then return an error before the scraping process starts.
Or we can use a library like pathvalidate to validate filenames. It has useful functions that can validate a filename and return the reason for the error (see validate_filename) or just check the filename and return a boolean value (see is_valid_filename).
What do you think would be the more suitable method to solve this issue?
It is not like we are validating 10s of filename per seconds where it would be an issue.
WDYT about this idea?
Please note that we are already about to check that output folder is really usable with #106, you should probably just enhance this code with a check on ZIM filename.
Sure, that sounds like a good idea, much simpler than using regex or a separate library.
I'll add a try-except block and use touch and unlink to create the file and delete it immediately. If a exception is thrown, I'll log and exit.
When filename is passed, we do not check it is valid before starting the crawler. It would save time.
See https://farm.youzim.it/pipeline/b9b7b43d-01e0-4ae6-851f-13b4042d1d3d/debug
Command:
Which is turn starts:
Which replies with return code 100 (i.e. ok) while it could detect that
--zim-file
parameter is an unsupported file name for local filesystem.Final error was:
The text was updated successfully, but these errors were encountered: