-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some spiders create millions of files in a directory #740
Comments
cc @jakubkrafka if you have some suggestion. |
Note that this won't be an issue for some spiders after #944 anymore. |
For clarity, this issue is still relevant to some Digiwhist sources (line-delimited JSON). |
ProZorro actually hit the limit of files per directory (perhaps |
From a conversation with @jpmckinney: Creating arbitrary (but predictable) directories is the best approach. Ruby on Rails’ FileStore does:
And then formats the numbers dir_1 and dir_2 as "%03X"
We could follow the same approach and also use adler32, as other existing hashing libraries for Python (such as hashlib) only return byte or hex strings and not integers. We should implement this as part of the FilesStore extension and for all the existing spiders, to be consistent with the output. |
DatabaseStore's yield_items_from_directory probably needs to use os.walk instead of os.scandir. |
This creates performance issues on some filesystems. The approach taken by caching libraries is to generate and fill in arbitrary directory trees; if we take the same approach, there is existing code/libraries to reuse. Otherwise, we can consider other ways to segment files, which might be specific to those spiders that have a lot of files.
The text was updated successfully, but these errors were encountered: