Speedup loading of tables (from cache) #225

hagenw · 2022-08-11T07:52:01Z

Currently, the following steps are taken when loading tables:

load_to()

cache: no cache used
_find_tables() returns list of tables,
that don't have a CSV file in db_root
_get_tables() removes PKL files
for requested tables;
load requested tables from backend
and store as CSV files in db_root
_save_database() load table
by reading CSV file;
store table as CSV file
by overwriting existing CSV file;
store table as PKL file

load_table()

cache: cache/name/version
checks if both a CSV file and a PKL file
of the table exists in cache
_cached_versions() loads deps for every
version of the table it finds in cache
_get_tables_from_cache()
(I do not understand completely what's going on,
but MD5SUMs are calculated to check if a table
from cache can be used).
Copy PKL and CSV files
if table can be found in cache
_get_tables_from_backend() gets CSV file from backend
for table, and load it from CSV file.
Store table as PKL file.
It then loads the table again from PKL file

load()

cache: cache/name/version/flavor

The same as load_table(),
but in the end we update
indices of tables.

I would propose the following changes to speed loading the tables up (besides fixing some obvious bugs in load_to() like storing the table twice as CSV and loading the table twice, or calculating MD5SUMs too often as described in #226):

store each table as CSV and PKL under cache/name/version whereas the version is extracted from deps, so basically only storing it under the version the table was added/modified to the database
load_to() can then copy tables from the cache as well
for load() we could copy from the cache cache/name/version to the required flavor folder cache/name/version/flavor; maybe even only copy the PKL file
load_table() can directly use the new cache as no flavor is needed

The text was updated successfully, but these errors were encountered:

hagenw · 2022-08-11T12:15:47Z

In principle, we could also speed up loading of media files by storing only the version when the media file was created changed in the flavor of the corresponding version in the cache instead of copying the data to every version.
We only have to add the correct paths when we update the index with the full paths.

It's not so nice, but for databases with lots of different versions our current approach is also not that nice.

frankenjoe · 2022-08-11T12:20:14Z

I thought about this at some point, but it will add dependencies that are hard to resolve. E.g. when version 2.0.0 references files from 1.0.0 and then someone deletes 1.0.0 some media will be missing.

hagenw · 2022-08-11T12:22:25Z

When media is missing it will be downloaded again.

When using the data, you will also have a problem at the moment when somebody else deletes the cache.

frankenjoe · 2022-08-11T12:27:00Z

When media is missing it will be downloaded again.

But that means we have to check every time we load a database which files are missing. Very costly.

hagenw · 2022-08-11T12:30:24Z

When media is missing it will be downloaded again.

But that means we have to check every time we load a database which files are missing. Very costly.

I would only check if the version folder is present in cache. If yes we assume that nobody deleted data.

At least I don't see why it should be different to now, you can also delete a single file in a flavor. If we check for all files in this case it should take the same time.

frankenjoe · 2022-08-11T12:32:07Z

At least I don't see why it should be different to now, you can also delete a single file in a flavor. If we check for all files in this case it should take the same time.

A user is not supposed to delete a single file from a flavor. But deleting whole versions is ok.

Another disadvantage is that the file paths might be different between caches on different machines.

hagenw · 2022-08-11T12:37:20Z

But deleting whole versions is ok.

Yes, that why I said I would just check if the version folder exists which should be fine as long as we have <100.000 versions of the database.

Another disadvantage is that the file paths might be different between caches on different machines.

This sounds only like a problem if you use different caching folders together, which you might be able with the shared cache.
So maybe we have to check how to handle this.

hagenw · 2022-08-11T12:37:52Z

But I would anyway first start with addressing the table issue described here, as this slows down data publication and is more pressing in my opinion.

frankenjoe · 2022-08-11T12:38:45Z

I would not care so much about data publication as you only do it once.

frankenjoe · 2022-08-11T12:39:27Z

Once per version I mean :)

hagenw · 2022-08-11T12:40:14Z

Yes and no.

If you have a growing database with lots of versions, you will experience it.
The other case it when testing the script, e.g. you will need to start uncomment audb.load_to() as it takes ages even when more or less no data was changed.

frankenjoe · 2022-08-11T12:44:31Z

The other case it when testing the script, e.g. you will need to start uncomment audb.load_to() as it takes ages even when more or less no data was changed.

Yes, that's indeed a pain.

hagenw · 2022-08-11T12:44:45Z

The problem for tables was mitigated before because we used a shared build folder, now that we have switched to use only_metadata=True and dedicated build folders it means you have to wait up to 15 minutes even when just fixing a typo in the description of the database.

Not saying that it was better before, there you had to wait 30 minutes until all MD5SUMs for the media files were calculated ;)

frankenjoe · 2022-08-12T08:00:14Z

The problem for tables was mitigated before because we used a shared build folder, now that we have switched to use only_metadata=True and dedicated build folders it means you have to wait up to 15 minutes even when just fixing a typo in the description of the database.

Could it help to work with two folders, e.g. cache/ and build/:

load previous version to cache/
update database and save to build/
publish from build/

hagenw · 2022-08-12T08:04:56Z

It would not help as load_to() would still not share the cache with load(), which means if you have loaded a big database with load() it can still not copy the tables from the cache and has to store them again.

frankenjoe · 2022-08-12T08:07:03Z

Also not if you share cache/ across versions as we did with the build/ folder before?

frankenjoe · 2022-08-12T08:08:04Z

It would not help when you load it for the first time, but when you publish a new version where you only change the description it should still safe time.

hagenw · 2022-08-12T08:13:57Z

Loading the second time etc. would help, but why adding a second cache folder and not using the existing one?

In my opinion load_to(), load_table() and load() could all use the same cache for the tables, and for this I would store it outside of the flavor folder under the version the table was created/changed. load() can then copy the table from there to the flavor folder, load_table() can directly load from there, and load_to() can also copy from there.

frankenjoe · 2022-08-12T08:17:03Z

Sure, just wanted to mention a possible workaround until this feature is available.

hagenw · 2022-08-12T08:18:58Z

An easy workaround is start using a shared ~~cache~~ folder again (../build) and just deleting media files inside, if some were added.

frankenjoe · 2022-08-12T08:48:58Z

load() can then copy the table from there to the flavor folder, load_table() can directly load from there, and load_to() can also copy from there.

Instead of copying we could also create a symbolic link. But probably for tables copying is fast enough. For media, however, symbolic links might offer a solution to share media files across versions without the need to change the file path in the tables.

The question is of course if Windows supports symbolic links?

hagenw added the enhancement New feature or request label Aug 11, 2022

hagenw mentioned this issue Aug 12, 2022

Upgrade existing cache version to a new version #228

Open

hagenw added the load label Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup loading of tables (from cache) #225

Speedup loading of tables (from cache) #225

hagenw commented Aug 11, 2022 •

edited

Loading

hagenw commented Aug 11, 2022

frankenjoe commented Aug 11, 2022 •

edited

Loading

hagenw commented Aug 11, 2022

frankenjoe commented Aug 11, 2022

hagenw commented Aug 11, 2022 •

edited

Loading

frankenjoe commented Aug 11, 2022

hagenw commented Aug 11, 2022

hagenw commented Aug 11, 2022

frankenjoe commented Aug 11, 2022

frankenjoe commented Aug 11, 2022

hagenw commented Aug 11, 2022

frankenjoe commented Aug 11, 2022

hagenw commented Aug 11, 2022 •

edited

Loading

frankenjoe commented Aug 12, 2022

hagenw commented Aug 12, 2022

frankenjoe commented Aug 12, 2022

frankenjoe commented Aug 12, 2022 •

edited

Loading

hagenw commented Aug 12, 2022

frankenjoe commented Aug 12, 2022

hagenw commented Aug 12, 2022 •

edited

Loading

frankenjoe commented Aug 12, 2022 •

edited

Loading

Speedup loading of tables (from cache) #225

Speedup loading of tables (from cache) #225

Comments

hagenw commented Aug 11, 2022 • edited Loading

load_to()

load_table()

load()

hagenw commented Aug 11, 2022

frankenjoe commented Aug 11, 2022 • edited Loading

hagenw commented Aug 11, 2022

frankenjoe commented Aug 11, 2022

hagenw commented Aug 11, 2022 • edited Loading

frankenjoe commented Aug 11, 2022

hagenw commented Aug 11, 2022

hagenw commented Aug 11, 2022

frankenjoe commented Aug 11, 2022

frankenjoe commented Aug 11, 2022

hagenw commented Aug 11, 2022

frankenjoe commented Aug 11, 2022

hagenw commented Aug 11, 2022 • edited Loading

frankenjoe commented Aug 12, 2022

hagenw commented Aug 12, 2022

frankenjoe commented Aug 12, 2022

frankenjoe commented Aug 12, 2022 • edited Loading

hagenw commented Aug 12, 2022

frankenjoe commented Aug 12, 2022

hagenw commented Aug 12, 2022 • edited Loading

frankenjoe commented Aug 12, 2022 • edited Loading

hagenw commented Aug 11, 2022 •

edited

Loading

frankenjoe commented Aug 11, 2022 •

edited

Loading

hagenw commented Aug 11, 2022 •

edited

Loading

hagenw commented Aug 11, 2022 •

edited

Loading

frankenjoe commented Aug 12, 2022 •

edited

Loading

hagenw commented Aug 12, 2022 •

edited

Loading

frankenjoe commented Aug 12, 2022 •

edited

Loading