-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup loading of tables (from cache) #225
Comments
In principle, we could also speed up loading of media files by storing only the version when the media file was created changed in the flavor of the corresponding version in the cache instead of copying the data to every version. It's not so nice, but for databases with lots of different versions our current approach is also not that nice. |
I thought about this at some point, but it will add dependencies that are hard to resolve. E.g. when version |
When media is missing it will be downloaded again. When using the data, you will also have a problem at the moment when somebody else deletes the cache. |
But that means we have to check every time we load a database which files are missing. Very costly. |
I would only check if the version folder is present in cache. If yes we assume that nobody deleted data. At least I don't see why it should be different to now, you can also delete a single file in a flavor. If we check for all files in this case it should take the same time. |
A user is not supposed to delete a single file from a flavor. But deleting whole versions is ok. Another disadvantage is that the file paths might be different between caches on different machines. |
Yes, that why I said I would just check if the version folder exists which should be fine as long as we have <100.000 versions of the database.
This sounds only like a problem if you use different caching folders together, which you might be able with the shared cache. |
But I would anyway first start with addressing the table issue described here, as this slows down data publication and is more pressing in my opinion. |
I would not care so much about data publication as you only do it once. |
Once per version I mean :) |
Yes and no. If you have a growing database with lots of versions, you will experience it. |
Yes, that's indeed a pain. |
The problem for tables was mitigated before because we used a shared build folder, now that we have switched to use Not saying that it was better before, there you had to wait 30 minutes until all MD5SUMs for the media files were calculated ;) |
Could it help to work with two folders, e.g.
|
It would not help as |
Also not if you share |
It would not help when you load it for the first time, but when you publish a new version where you only change the description it should still safe time. |
Loading the second time etc. would help, but why adding a second cache folder and not using the existing one? In my opinion |
Sure, just wanted to mention a possible workaround until this feature is available. |
An easy workaround is start using a shared |
Instead of copying we could also create a symbolic link. But probably for tables copying is fast enough. For media, however, symbolic links might offer a solution to share media files across versions without the need to change the file path in the tables. The question is of course if Windows supports symbolic links? |
Currently, the following steps are taken when loading tables:
load_to()
_find_tables()
returns list of tables,that don't have a CSV file in
db_root
_get_tables()
removes PKL filesfor requested tables;
load requested tables from backend
and store as CSV files in
db_root
_save_database()
load tableby reading CSV file;
store table as CSV file
by overwriting existing CSV file;
store table as PKL file
load_table()
cache/name/version
of the table exists in cache
_cached_versions()
loadsdeps
for everyversion of the table it finds in cache
_get_tables_from_cache()
(I do not understand completely what's going on,
but MD5SUMs are calculated to check if a table
from cache can be used).
Copy PKL and CSV files
if table can be found in cache
_get_tables_from_backend()
gets CSV file from backendfor table, and load it from CSV file.
Store table as PKL file.
load()
cache/name/version/flavor
The same as
load_table()
,but in the end we update
indices of tables.
I would propose the following changes to speed loading the tables up (besides fixing some obvious bugs in
load_to()
like storing the table twice as CSV and loading the table twice, or calculating MD5SUMs too often as described in #226):cache/name/version
whereas the version is extracted fromdeps
, so basically only storing it under the version the table was added/modified to the databaseload_to()
can then copy tables from the cache as wellload()
we could copy from the cachecache/name/version
to the required flavor foldercache/name/version/flavor
; maybe even only copy the PKL fileload_table()
can directly use the new cache as no flavor is neededThe text was updated successfully, but these errors were encountered: