Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import update (i.e. update metadata only) is very slow #827

Open
jiho opened this issue Jun 8, 2022 · 8 comments
Open

Import update (i.e. update metadata only) is very slow #827

jiho opened this issue Jun 8, 2022 · 8 comments
Labels
db Affects the structure of the database page-import Everything related to import functionality performance Speeeeeed!

Comments

@jiho
Copy link
Contributor

jiho commented Jun 8, 2022

Recently an import + update of 4 variables at object level only, for 1M objects took >20h to complete, while an initial import of the same amount of objects with much more metadata (and images) is much faster. The internal mechanism is likely different and the update version may need to be rethought to make it faster.

@jiho jiho added db Affects the structure of the database performance Speeeeeed! labels Jun 8, 2022
@grololo06
Copy link
Member

Hello, indeed it's more difficult to update, there are more steps, but 20h for 1M might reveal a problem somewhere else.
Is there a job ID or a log for diagnosing?

@jiho jiho assigned jiho and grololo06 and unassigned jiho Jun 28, 2022
@jiho
Copy link
Contributor Author

jiho commented Jun 28, 2022

Import+update is the recommend way to update metadata for all instruments (because the true metadata should be on the files on the drive). It is indeed long.
It would be an improvement (but not an indispensable one) to speed it up.

=> it would be interesting to implement if:

  • there is an idea to speed it up without being too complicated (e.g. parallel requests to the db)
  • there is a possibility to gain 10x

@jiho
Copy link
Contributor Author

jiho commented Jun 28, 2022

There is a massive one currently ongoing: jobid 57626

@grololo06
Copy link
Member

#830 prevents viewing the log (too big) interactively but it's on the server (/home/ecotaxa/ecotaxa/temptask/task057626).

top:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                      
22121 www-data  20   0 7096812 3,882g   5788 S  98,3  6,2   1133:23 python3                                                                                      
 2003 postgres  20   0 20,672g 604900 599620 S   2,6  0,9   2:17.39 postgres                                                                                     
15779 salinas   20   0   44480   4232   3276 R   0,7  0,0   0:00.22 top    

A typical update sequence from this job:

2022-06-29 04:29:07,208:DB.helpers.DBWriter:INFO Batch save objects of 0/0/0/0
2022-06-29 04:29:09,663:BO.TSVFile:INFO Updating 'objfid=332333794' using [('t06', "None->'True'"), ('n01', '0.865252->0.9985572695732117')]
2022-06-29 04:29:09,683:BO.TSVFile:INFO Updating 'objfid=332333795' using [('t06', "None->'True'"), ('n01', '1.0->0.9998865127563477')]
2022-06-29 04:29:09,707:BO.TSVFile:INFO Updating 'objfid=332333796' using [('t06', "None->'True'"), ('n01', '0.5142616->0.9972186088562012')]
2022-06-29 04:29:09,729:BO.TSVFile:INFO Updating 'objfid=332333797' using [('t06', "None->'True'"), ('n01', '0.9999999->0.9999988079071045')]
2022-06-29 04:29:09,752:BO.TSVFile:INFO Updating 'objfid=332333798' using [('t06', "None->'True'"), ('n01', '0.8022315->0.9642857313156128')]
2022-06-29 04:29:09,776:BO.TSVFile:INFO Updating 'objfid=332333799' using [('t06', "None->'True'"), ('n01', '0.88742197->0.9996620416641235')]
2022-06-29 04:29:09,798:BO.TSVFile:INFO Updating 'objfid=332333801' using [('t06', "None->'True'"), ('n01', '0.7112805->0.97272127866745')]
2022-06-29 04:29:09,820:BO.TSVFile:INFO Updating 'objfid=332333802' using [('t06', "None->'True'"), ('n01', '0.84401834->0.9835104942321777')]
2022-06-29 04:29:09,846:BO.TSVFile:INFO Updating 'objfid=332333803' using [('t06', "None->'True'"), ('n01', '0.60491717->0.9994377493858337')]
2022-06-29 04:29:09,872:BO.TSVFile:INFO Updating 'objfid=332333804' using [('t06', "None->'True'"), ('n01', '0.7606511->0.9999871253967285')]
2022-06-29 04:29:09,894:BO.TSVFile:INFO Updating 'objfid=332333805' using [('t06', "None->'True'"), ('n01', '0.68993723->0.9978905320167542')]
2022-06-29 04:29:09,917:BO.TSVFile:INFO Updating 'objfid=332333806' using [('t06', "None->'True'"), ('n01', '0.62982476->0.9999721050262451')]
2022-06-29 04:29:09,943:BO.TSVFile:INFO Updating 'objfid=332333807' using [('t06', "None->'True'"), ('n01', '0.89774394->0.6851480603218079')]
2022-06-29 04:29:09,965:BO.TSVFile:INFO Updating 'objfid=332333808' using [('t06', "None->'True'"), ('n01', '0.45348385->0.988058865070343')]
2022-06-29 04:29:09,988:BO.TSVFile:INFO Updating 'objfid=332333809' using [('t06', "None->'True'"), ('n01', '0.89016265->0.9057826399803162')]
2022-06-29 04:29:10,013:BO.TSVFile:INFO Updating 'objfid=332333810' using [('t06', "None->'True'"), ('n01', '0.57306963->0.6248078942298889')]
2022-06-29 04:29:10,038:BO.TSVFile:INFO Updating 'objfid=332333811' using [('t06', "None->'True'"), ('n01', '0.6401201->0.7173159718513489')]
...
2022-06-29 04:29:10,472:BO.TSVFile:INFO Updating 'objfid=332333829' using [('t06', "None->'True'"), ('n01', '0.9634766->0.9904594421386719')]
2022-06-29 04:29:10,494:BO.TSVFile:INFO Updating 'objfid=332333830' using [('t06', "None->'True'"), ('n01', '0.5982595->0.9993420243263245')]
2022-06-29 04:29:10,517:BO.TSVFile:INFO Updating 'objfid=332333831' using [('t06', "None->'True'"), ('n01', '0.9280701->0.981404721736908')]
2022-06-29 04:29:10,543:BO.TSVFile:INFO Updating 'objfid=332333834' using [('t06', "None->'True'"), ('n01', '0.9728491->0.99470055103302')]
2022-06-29 04:29:10,548:BO.TSVFile:INFO Updating 'objid=332333833' using [('classif_id', '342->11762')]
2022-06-29 04:29:10,568:BO.TSVFile:INFO Updating 'objfid=332333833' using [('t06', "None->'True'"), ('n01', '0.62476903->0.9709104299545288')]
2022-06-29 04:29:10,572:BO.TSVFile:INFO Updating 'objid=332333832' using [('classif_id', '342->11762')]
2022-06-29 04:29:10,592:BO.TSVFile:INFO Updating 'objfid=332333832' using [('t06', "None->'True'"), ('n01', '0.48558292->0.9297972917556763')]
2022-06-29 04:29:10,615:BO.TSVFile:INFO Updating 'objfid=332333835' using [('t06', "None->'True'"), ('n01', '0.6353145->0.9741307497024536')]
2022-06-29 04:29:10,637:BO.TSVFile:INFO Updating 'objfid=332333836' using [('t06', "None->'True'"), ('n01', '0.582546->0.999795138835907')]
2022-06-29 04:29:10,662:BO.TSVFile:INFO Updating 'objfid=332333837' using [('t06', "None->'True'"), ('n01', '0.64749044->0.9995095729827881')]
2022-06-29 04:29:10,686:BO.TSVFile:INFO Updating 'objfid=332333838' using [('t06', "None->'True'"), ('n01', '0.93026793->0.9999659061431885')]
2022-06-29 04:29:10,708:BO.TSVFile:INFO Updating 'objfid=332333839' using [('t06', "None->'True'"), ('n01', '0.67226845->0.9999951124191284')]
2022-06-29 04:29:10,735:BO.TSVFile:INFO Updating 'objfid=332333840' using [('t06', "None->'True'"), ('n01', '0.55110526->0.996381402015686')]
2022-06-29 04:29:10,757:BO.TSVFile:INFO Updating 'objfid=332333841' using [('t06', "None->'True'"), ('n01', '0.60473084->0.999871015548706')]
2022-06-29 04:29:10,778:BO.TSVFile:INFO Updating 'objfid=332333843' using [('t06', "None->'True'"), ('n01', '0.9978815->0.9999985694885254')]
2022-06-29 04:29:10,803:BO.TSVFile:INFO Updating 'objfid=332333842' using [('t06', "None->'True'"), ('n01', '0.9818188->0.999953031539917')]
2022-06-29 04:29:10,827:BO.TSVFile:INFO Updating 'objfid=332333844' using [('t06', "None->'True'"), ('n01', '0.43761125->0.996690034866333')]
2022-06-29 04:29:10,850:BO.TSVFile:INFO Updating 'objfid=332333846' using [('t06', "None->'True'"), ('n01', '0.98839927->0.999998927116394')]
2022-06-29 04:29:10,876:BO.TSVFile:INFO Updating 'objfid=332333845' using [('t06', "None->'True'"), ('n01', '0.49382433->0.7769779562950134')]
2022-06-29 04:29:10,899:BO.TSVFile:INFO Updating 'objfid=332333847' using [('t06', "None->'True'"), ('n01', '0.58303624->0.9998514652252197')]
2022-06-29 04:29:10,920:BO.TSVFile:INFO Updating 'objfid=332333848' using [('t06', "None->'True'"), ('n01', '0.9764341->0.997356653213501')]
2022-06-29 04:29:10,945:BO.TSVFile:INFO Updating 'objfid=332333849' using [('t06', "None->'True'"), ('n01', '0.8680194->0.9494059085845947')]
2022-06-29 04:29:10,971:BO.TSVFile:INFO Updating 'objfid=332333850' using [('t06', "None->'True'"), ('n01', '0.8457306->0.9975295662879944')]
2022-06-29 04:29:10,993:BO.TSVFile:INFO Updating 'objfid=332333851' using [('t06', "None->'True'"), ('n01', '0.8954615->0.9944500923156738')]
2022-06-29 04:29:11,017:BO.TSVFile:INFO Updating 'objfid=332333852' using [('t06', "None->'True'"), ('n01', '0.6071773->0.8210511207580566')]
2022-06-29 04:29:11,039:BO.TSVFile:INFO Updating 'objfid=332333853' using [('t06', "None->'True'"), ('n01', '0.6931323->0.9998435974121094')]
2022-06-29 04:29:11,043:BO.TSVFile:INFO Updating 'objid=332333854' using [('classif_id', '342->25828')]
2022-06-29 04:29:11,063:BO.TSVFile:INFO Updating 'objfid=332333854' using [('t06', "None->'True'"), ('n01', '0.6064511->0.5608027577400208')]
2022-06-29 04:29:11,089:BO.TSVFile:INFO Updating 'objfid=332333855' using [('t06', "None->'True'"), ('n01', '0.8928067->0.9987032413482666')]
2022-06-29 04:29:11,111:BO.TSVFile:INFO Updating 'objfid=332333856' using [('t06', "None->'True'"), ('n01', '0.8291849->0.9996920824050903')]
2022-06-29 04:29:11,133:BO.TSVFile:INFO Updating 'objfid=332333858' using [('t06', "None->'True'"), ('n01', '0.99571073->0.9999972581863403')]
2022-06-29 04:29:11,156:BO.TSVFile:INFO Updating 'objfid=332333857' using [('t06', "None->'True'"), ('n01', '0.6518947->0.9989802241325378')]
2022-06-29 04:29:11,182:BO.TSVFile:INFO Updating 'objfid=332333860' using [('t06', "None->'True'"), ('n01', '0.8961648->0.9999814033508301')]
2022-06-29 04:29:11,202:BO.TSVFile:INFO Updating 'objfid=332333859' using [('t06', "None->'True'"), ('n01', '0.6239822->0.94720858335495')]
2022-06-29 04:29:11,225:BO.TSVFile:INFO Updating 'objfid=332333861' using [('t06', "None->'True'"), ('n01', '0.55995965->0.9992514252662659')]
2022-06-29 04:29:11,251:BO.TSVFile:INFO Updating 'objfid=332333863' using [('t06', "None->'True'"), ('n01', '0.5958765->0.9991888403892517')]
2022-06-29 04:29:11,273:BO.TSVFile:INFO Updating 'objfid=332333862' using [('t06', "None->'True'"), ('n01', '0.90727866->0.9999890327453613')]
...
2022-06-29 04:29:11,683:BO.TSVFile:INFO Updating 'objfid=332333881' using [('t06', "None->'True'"), ('n01', '0.6987737->0.9997115731239319')]
2022-06-29 04:29:11,707:BO.TSVFile:INFO Updating 'objfid=332333882' using [('t06', "None->'True'"), ('n01', '0.69469154->0.997368335723877')]
2022-06-29 04:29:11,730:BO.TSVFile:INFO Updating 'objfid=332333883' using [('t06', "None->'True'"), ('n01', '0.8432302->0.9998617172241211')]
2022-06-29 04:29:11,736:BO.TSVFile:INFO Updating 'objid=332333884' using [('classif_id', '92053->11762')]
2022-06-29 04:29:11,756:BO.TSVFile:INFO Updating 'objfid=332333884' using [('t06', "None->'True'"), ('n01', '0.8904505->0.9996005892753601')]
2022-06-29 04:29:11,780:BO.TSVFile:INFO Updating 'objfid=332333885' using [('t06', "None->'True'"), ('n01', '0.5576295->0.9995793700218201')]
2022-06-29 04:29:11,802:BO.TSVFile:INFO Updating 'objfid=332333886' using [('t06', "None->'True'"), ('n01', '0.6991085->0.973975658416748')]
2022-06-29 04:29:11,824:BO.TSVFile:INFO Updating 'objfid=332333888' using [('t06', "None->'True'"), ('n01', '0.9999994->0.9499883651733398')]
2022-06-29 04:29:11,849:BO.TSVFile:INFO Updating 'objfid=332333887' using [('t06', "None->'True'"), ('n01', '0.86853385->0.9989874958992004')]
2022-06-29 04:29:11,874:BO.TSVFile:INFO Updating 'objfid=332333889' using [('t06', "None->'True'"), ('n01', '0.44377014->0.6608572006225586')]
2022-06-29 04:29:11,898:BO.TSVFile:INFO Updating 'objfid=332333890' using [('t06', "None->'True'"), ('n01', '0.8826317->0.999769389629364')]
2022-06-29 04:29:11,924:BO.TSVFile:INFO Updating 'objfid=332333893' using [('t06', "None->'True'"), ('n01', '0.5418574->0.8624780178070068')]
2022-06-29 04:29:11,928:BO.TSVFile:INFO Updating 'objid=332333892' using [('classif_id', '342->11762')]
2022-06-29 04:29:11,950:BO.TSVFile:INFO Updating 'objfid=332333892' using [('t06', "None->'True'"), ('n01', '0.5792397->0.9869982600212097')]
2022-06-29 04:29:11,973:BO.TSVFile:INFO Updating 'objfid=332333891' using [('t06', "None->'True'"), ('n01', '0.91960746->0.9957197308540344')]
2022-06-29 04:29:11,994:BO.TSVFile:INFO Updating 'objfid=332333894' using [('t06', "None->'True'"), ('n01', '0.9999993->0.9999442100524902')]
2022-06-29 04:29:12,016:BO.TSVFile:INFO Updating 'objfid=332333896' using [('t06', "None->'True'"), ('n01', '0.62283224->0.9919296503067017')]
2022-06-29 04:29:12,018:DB.helpers.DBWriter:INFO Batch save objects of 0/0/0/0
2022-06-29 04:29:14,451:BO.TSVFile:INFO Updating 'objfid=332333895' using [('t06', "None->'True'"), ('n01', '1.0->0.9999991655349731')]

@grololo06
Copy link
Member

grololo06 commented Jun 29, 2022

I did not do a precise statistical analysis of the time spent in each sequence but it looks quite flat over the job duration, i.e. no degradation since the job started which is good.
During each sequence, the code reads each impacted object, determines the diff with what's needed from the TSV, does an in-memory update. This takes around 2.5s for 100 objects. Then the DB commit takes around 2.5s.
From the logs, we can see that sometimes it's in fact 2 updates: one on obj_head and one on obj_field.

Next step is to do a timing comparison with similar direct SQL updates on the DB.

On my dev DB:

ecotaxa4=# create temp table toto as select objid from objects where projid = 3948;
SELECT 1612584
Time: 1008,506 ms (00:01,009)

ecotaxa4=# begin; update obj_field set t06='True', n01='0.98' where objfid in (select objid from toto order by objid limit 10000); commit;
BEGIN
Time: 0,357 ms
UPDATE 10000
Time: 3816,911 ms (00:03,817)
COMMIT
Time: 1308,454 ms (00:01,308)

Next chunk:
ecotaxa4=# delete from toto where objid in (select objid from toto order by objid limit 10000);
DELETE 10000
Time: 240,435 ms


ecotaxa4=# begin; update obj_field set t06='True', n01='0.98' where objfid in (select objid from toto order by objid limit 10000); commit;
BEGIN
Time: 0,354 ms
UPDATE 10000
Time: 4301,617 ms (00:04,302)
COMMIT
Time: 35,538 ms
ecotaxa4=# delete from toto where objid in (select objid from toto order by objid limit 10000);
DELETE 10000
Time: 277,821 ms
ecotaxa4=# begin; update obj_field set t06='True', n01='0.98' where objfid in (select objid from toto order by objid limit 10000); commit;
BEGIN
Time: 0,291 ms
UPDATE 10000
Time: 3062,607 ms (00:03,063)
COMMIT
Time: 251,056 ms
ecotaxa4=# delete from toto where objid in (select objid from toto order by objid limit 10000);
DELETE 10000
Time: 237,067 ms
ecotaxa4=# begin; update obj_field set t06='True', n01='0.98' where objfid in (select objid from toto order by objid limit 10000); commit;
BEGIN
Time: 0,335 ms
UPDATE 10000
Time: 3452,675 ms (00:03,453)
COMMIT
Time: 253,899 ms

So we have 100 more updates in approximately the same time, clearly an improvement of /10 is doable. I'd say, just from observations, that the diff algorithm and in-memory updates via SQL Alchemy take the majority of CPU time.

@grololo06
Copy link
Member

On my other dev DB, the times are rather around 2s, obviously a serious benchmark is needed in similar conditions.

@grololo06
Copy link
Member

OK after more code examination, the update has been coded as "non-breaking exceptions" of the general import, which means minimal code but minimal performance as well.
The update code:

  • Reads TSV lines (same code as plain import)
  • Read (SQL SELECT) each object to be updated in a specific query
  • Determines the wanted updates for this object
  • Writes (SQL UPDATE) the updated object
  • Persists the writes (SQL COMMIT) every 100 TSV lines

Optimally, we would group operations that can be (e.g. read all objects together and write them the same), and do all this in a specific def, not using a "branch flag" like now.

For benchmarking first, a test project cannot be used, a real one of decent size (> 10K) is, the measurement would be done with an update of e.g. coordinates + some free columns. Of course we need 2 updates to have a flip-flop repeatable case, as a second update with same data would do nothing. Some coding is needed already to hack the "optimal" solution described above.
Then we'll know how much time we can save.

Second step would be finalization of the "hacks" + addition of more unit tests in QA directory.

Workload estimate is 8 + 8 hours.

@grololo06
Copy link
Member

De-assigning myself until decision is made to work on this.

@grololo06 grololo06 removed their assignment Jan 8, 2023
@jiho jiho added the page-import Everything related to import functionality label Jan 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
db Affects the structure of the database page-import Everything related to import functionality performance Speeeeeed!
Projects
None yet
Development

No branches or pull requests

2 participants