-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import update (i.e. update metadata only) is very slow #827
Comments
Hello, indeed it's more difficult to update, there are more steps, but 20h for 1M might reveal a problem somewhere else. |
Import+update is the recommend way to update metadata for all instruments (because the true metadata should be on the files on the drive). It is indeed long. => it would be interesting to implement if:
|
There is a massive one currently ongoing: jobid 57626 |
#830 prevents viewing the log (too big) interactively but it's on the server (/home/ecotaxa/ecotaxa/temptask/task057626). top:
A typical update sequence from this job:
|
I did not do a precise statistical analysis of the time spent in each sequence but it looks quite flat over the job duration, i.e. no degradation since the job started which is good. Next step is to do a timing comparison with similar direct SQL updates on the DB. On my dev DB:
So we have 100 more updates in approximately the same time, clearly an improvement of /10 is doable. I'd say, just from observations, that the diff algorithm and in-memory updates via SQL Alchemy take the majority of CPU time. |
On my other dev DB, the times are rather around 2s, obviously a serious benchmark is needed in similar conditions. |
OK after more code examination, the update has been coded as "non-breaking exceptions" of the general import, which means minimal code but minimal performance as well.
Optimally, we would group operations that can be (e.g. read all objects together and write them the same), and do all this in a specific For benchmarking first, a test project cannot be used, a real one of decent size (> 10K) is, the measurement would be done with an update of e.g. coordinates + some free columns. Of course we need 2 updates to have a flip-flop repeatable case, as a second update with same data would do nothing. Some coding is needed already to hack the "optimal" solution described above. Second step would be finalization of the "hacks" + addition of more unit tests in QA directory. Workload estimate is 8 + 8 hours. |
De-assigning myself until decision is made to work on this. |
Recently an import + update of 4 variables at object level only, for 1M objects took >20h to complete, while an initial import of the same amount of objects with much more metadata (and images) is much faster. The internal mechanism is likely different and the update version may need to be rethought to make it faster.
The text was updated successfully, but these errors were encountered: