Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper should check filename is valid #232

Closed
benoit74 opened this issue Apr 16, 2024 · 3 comments · Fixed by #234
Closed

Scraper should check filename is valid #232

benoit74 opened this issue Apr 16, 2024 · 3 comments · Fixed by #234
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@benoit74
Copy link
Collaborator

When filename is passed, we do not check it is valid before starting the crawler. It would save time.

See https://farm.youzim.it/pipeline/b9b7b43d-01e0-4ae6-851f-13b4042d1d3d/debug

Command:

zimit --zim-file=Turok Sanctum 04/13/2024_115fb1f0.zim --lang=English --url=https://turoksanctum.com/ --name=turoksanctum.com_115fb1f0 --userAgentSuffix=Youzim.it+ --sizeLimit=4294967296 --timeLimit=7200 --output=/output --statsFilename=/output/task_progress.json --adminEmail=contact+zimfarm@kiwix.org --keep --publisher=openZIM

Which is turn starts:

warc2zim --zim-file=Turok Sanctum 04/13/2024_115fb1f0.zim --name=turoksanctum.com_115fb1f0 --publisher=openZIM --output /output --url https://turoksanctum.com/

Which replies with return code 100 (i.e. ok) while it could detect that --zim-file parameter is an unsupported file name for local filesystem.

Final error was:

Processing WARC files in /output/.tmpx0ezwt48/collections/crawl-20240414022153206/archive
16 WARC files found
Calling warc2zim with these args: ['--zim-file=Turok Sanctum 04/13/2024_115fb1f0.zim', '--name=turoksanctum.com_115fb1f0', '--publisher=openZIM', '--output', '/output', '--url', 'https://turoksanctum.com/', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmpx0ezwt48/collections/crawl-20240414022153206/archive']
[DEBUG] Confirming output is writable using /output/tmplf6kxht_
[DEBUG] Title: Turok Sanctum – Turok Mods and Maps
[DEBUG] Language: en-US
[DEBUG] Favicon: https://i0.wp.com/turoksanctum.com/wp-content/uploads/2017/04/cropped-1_1428292389.png?fit=32%2C32&ssl=1
[DEBUG] Found WARC record for favicon: https://i0.wp.com/turoksanctum.com/wp-content/uploads/2017/04/cropped-1_1428292389.png?fit=32%2C32&ssl=1
Traceback (most recent call last):
  File "/usr/bin/zimit", line 566, in <module>
    zimit()
  File "/usr/bin/zimit", line 464, in zimit
    return warc2zim(warc2zim_args)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 113, in main
    return converter.run()
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 265, in run
    self.creator = Creator(
  File "libzim/libzim.pyx", line 263, in libzim._Creator.__cinit__
OSError: Unable to write ZIM file at /output/Turok Sanctum 04/13/2024_115fb1f0.zim
@benoit74 benoit74 added enhancement New feature or request good first issue Good for newcomers labels Apr 16, 2024
@dan-niles
Copy link
Contributor

dan-niles commented Apr 16, 2024

@benoit74 I would like to work on this.

One way we can check for invalid filenames is by using a regular expression pattern (like the one found here) to match invalid characters and then return an error before the scraping process starts.

Or we can use a library like pathvalidate to validate filenames. It has useful functions that can validate a filename and return the reason for the error (see validate_filename) or just check the filename and return a boolean value (see is_valid_filename).

What do you think would be the more suitable method to solve this issue?

@benoit74
Copy link
Collaborator Author

@dan-niles thank you!

What about simply doing a touch on the filename with https://docs.python.org/3/library/pathlib.html#pathlib.Path.touch and immediately removing the created file? This would avoid relying on regexp (always a bit fragile) or adding a library for a simple need.

It is not like we are validating 10s of filename per seconds where it would be an issue.

WDYT about this idea?

Please note that we are already about to check that output folder is really usable with #106, you should probably just enhance this code with a check on ZIM filename.

@dan-niles
Copy link
Contributor

dan-niles commented Apr 17, 2024

Sure, that sounds like a good idea, much simpler than using regex or a separate library.
I'll add a try-except block and use touch and unlink to create the file and delete it immediately. If a exception is thrown, I'll log and exit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants