Some spiders create millions of files in a directory #740

jpmckinney · 2021-06-15T16:30:53Z

This creates performance issues on some filesystems. The approach taken by caching libraries is to generate and fill in arbitrary directory trees; if we take the same approach, there is existing code/libraries to reuse. Otherwise, we can consider other ways to segment files, which might be specific to those spiders that have a lot of files.

jpmckinney · 2021-06-15T16:31:06Z

cc @jakubkrafka if you have some suggestion.

yolile · 2022-06-08T00:23:26Z

Note that this won't be an issue for some spiders after #944 anymore.

jpmckinney · 2022-06-08T16:42:38Z

For clarity, this issue is still relevant to some Digiwhist sources (line-delimited JSON).

jpmckinney · 2023-09-18T16:11:10Z

ProZorro actually hit the limit of files per directory (perhaps max_dir_size_kb), causing the (misleading) error message: OSError: [Errno 28] No space left on device

yolile · 2023-09-25T20:23:10Z

From a conversation with @jpmckinney:

Creating arbitrary (but predictable) directories is the best approach.

Ruby on Rails’ FileStore does:

hash = Zlib.adler32(fname)
hash, dir_1 = hash.divmod(0x1000)
dir_2 = hash.modulo(0x1000)

And then formats the numbers dir_1 and dir_2 as "%03X"

File.join(cache_path, DIR_FORMATTER % dir_1, DIR_FORMATTER % dir_2, fname)

We could follow the same approach and also use adler32, as other existing hashing libraries for Python (such as hashlib) only return byte or hex strings and not integers.

We should implement this as part of the FilesStore extension and for all the existing spiders, to be consistent with the output.

jpmckinney · 2023-09-25T20:52:14Z

DatabaseStore's yield_items_from_directory probably needs to use os.walk instead of os.scandir.

jpmckinney added framework Relating to other common functionality discussion labels Jun 15, 2021

jpmckinney mentioned this issue Sep 18, 2023

Allow JOB_TASKS_PLAN to be configurable per publication open-contracting/data-registry#304

Open

4 tasks

jpmckinney added this to the Priority milestone Sep 20, 2023

yolile mentioned this issue Sep 26, 2023

feat(FilesStore): generate directory trees #1027

Merged

yolile closed this as completed in #1027 Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some spiders create millions of files in a directory #740

Some spiders create millions of files in a directory #740

jpmckinney commented Jun 15, 2021

jpmckinney commented Jun 15, 2021

yolile commented Jun 8, 2022

jpmckinney commented Jun 8, 2022

jpmckinney commented Sep 18, 2023

yolile commented Sep 25, 2023

jpmckinney commented Sep 25, 2023 •

edited

Loading

Some spiders create millions of files in a directory #740

Some spiders create millions of files in a directory #740

Comments

jpmckinney commented Jun 15, 2021

jpmckinney commented Jun 15, 2021

yolile commented Jun 8, 2022

jpmckinney commented Jun 8, 2022

jpmckinney commented Sep 18, 2023

yolile commented Sep 25, 2023

jpmckinney commented Sep 25, 2023 • edited Loading

jpmckinney commented Sep 25, 2023 •

edited

Loading