Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup loading of tables (from cache) #225

Open
hagenw opened this issue Aug 11, 2022 · 21 comments
Open

Speedup loading of tables (from cache) #225

hagenw opened this issue Aug 11, 2022 · 21 comments
Labels
enhancement New feature or request load

Comments

@hagenw
Copy link
Member

hagenw commented Aug 11, 2022

Currently, the following steps are taken when loading tables:

load_to()

  • cache: no cache used
  • _find_tables() returns list of tables,
    that don't have a CSV file in db_root
  • _get_tables() removes PKL files
    for requested tables;
    load requested tables from backend
    and store as CSV files in db_root
  • _save_database() load table
    by reading CSV file;
    store table as CSV file
    by overwriting existing CSV file;
    store table as PKL file

load_table()

  • cache: cache/name/version
  • checks if both a CSV file and a PKL file
    of the table exists in cache
  • _cached_versions() loads deps for every
    version of the table it finds in cache
  • _get_tables_from_cache()
    (I do not understand completely what's going on,
    but MD5SUMs are calculated to check if a table
    from cache can be used).
    Copy PKL and CSV files
    if table can be found in cache
  • _get_tables_from_backend() gets CSV file from backend
    for table, and load it from CSV file.
    Store table as PKL file.
  • It then loads the table again from PKL file

load()

  • cache: cache/name/version/flavor

The same as load_table(),
but in the end we update
indices of tables.


I would propose the following changes to speed loading the tables up (besides fixing some obvious bugs in load_to() like storing the table twice as CSV and loading the table twice, or calculating MD5SUMs too often as described in #226):

  • store each table as CSV and PKL under cache/name/version whereas the version is extracted from deps, so basically only storing it under the version the table was added/modified to the database
  • load_to() can then copy tables from the cache as well
  • for load() we could copy from the cache cache/name/version to the required flavor folder cache/name/version/flavor; maybe even only copy the PKL file
  • load_table() can directly use the new cache as no flavor is needed
@hagenw hagenw added the enhancement New feature or request label Aug 11, 2022
@hagenw
Copy link
Member Author

hagenw commented Aug 11, 2022

In principle, we could also speed up loading of media files by storing only the version when the media file was created changed in the flavor of the corresponding version in the cache instead of copying the data to every version.
We only have to add the correct paths when we update the index with the full paths.

It's not so nice, but for databases with lots of different versions our current approach is also not that nice.

@frankenjoe
Copy link
Collaborator

frankenjoe commented Aug 11, 2022

I thought about this at some point, but it will add dependencies that are hard to resolve. E.g. when version 2.0.0 references files from 1.0.0 and then someone deletes 1.0.0 some media will be missing.

@hagenw
Copy link
Member Author

hagenw commented Aug 11, 2022

When media is missing it will be downloaded again.

When using the data, you will also have a problem at the moment when somebody else deletes the cache.

@frankenjoe
Copy link
Collaborator

When media is missing it will be downloaded again.

But that means we have to check every time we load a database which files are missing. Very costly.

@hagenw
Copy link
Member Author

hagenw commented Aug 11, 2022

When media is missing it will be downloaded again.

But that means we have to check every time we load a database which files are missing. Very costly.

I would only check if the version folder is present in cache. If yes we assume that nobody deleted data.

At least I don't see why it should be different to now, you can also delete a single file in a flavor. If we check for all files in this case it should take the same time.

@frankenjoe
Copy link
Collaborator

At least I don't see why it should be different to now, you can also delete a single file in a flavor. If we check for all files in this case it should take the same time.

A user is not supposed to delete a single file from a flavor. But deleting whole versions is ok.

Another disadvantage is that the file paths might be different between caches on different machines.

@hagenw
Copy link
Member Author

hagenw commented Aug 11, 2022

But deleting whole versions is ok.

Yes, that why I said I would just check if the version folder exists which should be fine as long as we have <100.000 versions of the database.

Another disadvantage is that the file paths might be different between caches on different machines.

This sounds only like a problem if you use different caching folders together, which you might be able with the shared cache.
So maybe we have to check how to handle this.

@hagenw
Copy link
Member Author

hagenw commented Aug 11, 2022

But I would anyway first start with addressing the table issue described here, as this slows down data publication and is more pressing in my opinion.

@frankenjoe
Copy link
Collaborator

I would not care so much about data publication as you only do it once.

@frankenjoe
Copy link
Collaborator

Once per version I mean :)

@hagenw
Copy link
Member Author

hagenw commented Aug 11, 2022

Yes and no.

If you have a growing database with lots of versions, you will experience it.
The other case it when testing the script, e.g. you will need to start uncomment audb.load_to() as it takes ages even when more or less no data was changed.

@frankenjoe
Copy link
Collaborator

The other case it when testing the script, e.g. you will need to start uncomment audb.load_to() as it takes ages even when more or less no data was changed.

Yes, that's indeed a pain.

@hagenw
Copy link
Member Author

hagenw commented Aug 11, 2022

The problem for tables was mitigated before because we used a shared build folder, now that we have switched to use only_metadata=True and dedicated build folders it means you have to wait up to 15 minutes even when just fixing a typo in the description of the database.


Not saying that it was better before, there you had to wait 30 minutes until all MD5SUMs for the media files were calculated ;)

@frankenjoe
Copy link
Collaborator

The problem for tables was mitigated before because we used a shared build folder, now that we have switched to use only_metadata=True and dedicated build folders it means you have to wait up to 15 minutes even when just fixing a typo in the description of the database.

Could it help to work with two folders, e.g. cache/ and build/:

  1. load previous version to cache/
  2. update database and save to build/
  3. publish from build/

@hagenw
Copy link
Member Author

hagenw commented Aug 12, 2022

It would not help as load_to() would still not share the cache with load(), which means if you have loaded a big database with load() it can still not copy the tables from the cache and has to store them again.

@frankenjoe
Copy link
Collaborator

Also not if you share cache/ across versions as we did with the build/ folder before?

@frankenjoe
Copy link
Collaborator

frankenjoe commented Aug 12, 2022

It would not help when you load it for the first time, but when you publish a new version where you only change the description it should still safe time.

@hagenw
Copy link
Member Author

hagenw commented Aug 12, 2022

Loading the second time etc. would help, but why adding a second cache folder and not using the existing one?

In my opinion load_to(), load_table() and load() could all use the same cache for the tables, and for this I would store it outside of the flavor folder under the version the table was created/changed. load() can then copy the table from there to the flavor folder, load_table() can directly load from there, and load_to() can also copy from there.

@frankenjoe
Copy link
Collaborator

Sure, just wanted to mention a possible workaround until this feature is available.

@hagenw
Copy link
Member Author

hagenw commented Aug 12, 2022

An easy workaround is start using a shared cache folder again (../build) and just deleting media files inside, if some were added.

@frankenjoe
Copy link
Collaborator

frankenjoe commented Aug 12, 2022

load() can then copy the table from there to the flavor folder, load_table() can directly load from there, and load_to() can also copy from there.

Instead of copying we could also create a symbolic link. But probably for tables copying is fast enough. For media, however, symbolic links might offer a solution to share media files across versions without the need to change the file path in the tables.

The question is of course if Windows supports symbolic links?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request load
Projects
None yet
Development

No branches or pull requests

2 participants