Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some spiders create millions of files in a directory #740

Closed
jpmckinney opened this issue Jun 15, 2021 · 6 comments · Fixed by #1027
Closed

Some spiders create millions of files in a directory #740

jpmckinney opened this issue Jun 15, 2021 · 6 comments · Fixed by #1027
Labels
discussion framework Relating to other common functionality
Milestone

Comments

@jpmckinney
Copy link
Member

This creates performance issues on some filesystems. The approach taken by caching libraries is to generate and fill in arbitrary directory trees; if we take the same approach, there is existing code/libraries to reuse. Otherwise, we can consider other ways to segment files, which might be specific to those spiders that have a lot of files.

@jpmckinney jpmckinney added framework Relating to other common functionality discussion labels Jun 15, 2021
@jpmckinney
Copy link
Member Author

cc @jakubkrafka if you have some suggestion.

@yolile
Copy link
Member

yolile commented Jun 8, 2022

Note that this won't be an issue for some spiders after #944 anymore.

@jpmckinney
Copy link
Member Author

For clarity, this issue is still relevant to some Digiwhist sources (line-delimited JSON).

@jpmckinney
Copy link
Member Author

ProZorro actually hit the limit of files per directory (perhaps max_dir_size_kb), causing the (misleading) error message: OSError: [Errno 28] No space left on device

@jpmckinney jpmckinney added this to the Priority milestone Sep 20, 2023
@yolile
Copy link
Member

yolile commented Sep 25, 2023

From a conversation with @jpmckinney:

Creating arbitrary (but predictable) directories is the best approach.

Ruby on Rails’ FileStore does:

hash = Zlib.adler32(fname)
hash, dir_1 = hash.divmod(0x1000)
dir_2 = hash.modulo(0x1000)

And then formats the numbers dir_1 and dir_2 as "%03X"

File.join(cache_path, DIR_FORMATTER % dir_1, DIR_FORMATTER % dir_2, fname)

We could follow the same approach and also use adler32, as other existing hashing libraries for Python (such as hashlib) only return byte or hex strings and not integers.

We should implement this as part of the FilesStore extension and for all the existing spiders, to be consistent with the output.

@jpmckinney
Copy link
Member Author

jpmckinney commented Sep 25, 2023

DatabaseStore's yield_items_from_directory probably needs to use os.walk instead of os.scandir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion framework Relating to other common functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants