Import update (i.e. update metadata only) is very slow #827

jiho · 2022-06-08T11:36:34Z

Recently an import + update of 4 variables at object level only, for 1M objects took >20h to complete, while an initial import of the same amount of objects with much more metadata (and images) is much faster. The internal mechanism is likely different and the update version may need to be rethought to make it faster.

grololo06 · 2022-06-10T11:47:10Z

Hello, indeed it's more difficult to update, there are more steps, but 20h for 1M might reveal a problem somewhere else.
Is there a job ID or a log for diagnosing?

jiho · 2022-06-28T09:34:51Z

Import+update is the recommend way to update metadata for all instruments (because the true metadata should be on the files on the drive). It is indeed long.
It would be an improvement (but not an indispensable one) to speed it up.

=> it would be interesting to implement if:

there is an idea to speed it up without being too complicated (e.g. parallel requests to the db)
there is a possibility to gain 10x

jiho · 2022-06-28T09:36:32Z

There is a massive one currently ongoing: jobid 57626

grololo06 · 2022-06-29T04:57:23Z

#830 prevents viewing the log (too big) interactively but it's on the server (/home/ecotaxa/ecotaxa/temptask/task057626).

top:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                      
22121 www-data  20   0 7096812 3,882g   5788 S  98,3  6,2   1133:23 python3                                                                                      
 2003 postgres  20   0 20,672g 604900 599620 S   2,6  0,9   2:17.39 postgres                                                                                     
15779 salinas   20   0   44480   4232   3276 R   0,7  0,0   0:00.22 top

A typical update sequence from this job:

2022-06-29 04:29:07,208:DB.helpers.DBWriter:INFO Batch save objects of 0/0/0/0
2022-06-29 04:29:09,663:BO.TSVFile:INFO Updating 'objfid=332333794' using [('t06', "None->'True'"), ('n01', '0.865252->0.9985572695732117')]
2022-06-29 04:29:09,683:BO.TSVFile:INFO Updating 'objfid=332333795' using [('t06', "None->'True'"), ('n01', '1.0->0.9998865127563477')]
2022-06-29 04:29:09,707:BO.TSVFile:INFO Updating 'objfid=332333796' using [('t06', "None->'True'"), ('n01', '0.5142616->0.9972186088562012')]
2022-06-29 04:29:09,729:BO.TSVFile:INFO Updating 'objfid=332333797' using [('t06', "None->'True'"), ('n01', '0.9999999->0.9999988079071045')]
2022-06-29 04:29:09,752:BO.TSVFile:INFO Updating 'objfid=332333798' using [('t06', "None->'True'"), ('n01', '0.8022315->0.9642857313156128')]
2022-06-29 04:29:09,776:BO.TSVFile:INFO Updating 'objfid=332333799' using [('t06', "None->'True'"), ('n01', '0.88742197->0.9996620416641235')]
2022-06-29 04:29:09,798:BO.TSVFile:INFO Updating 'objfid=332333801' using [('t06', "None->'True'"), ('n01', '0.7112805->0.97272127866745')]
2022-06-29 04:29:09,820:BO.TSVFile:INFO Updating 'objfid=332333802' using [('t06', "None->'True'"), ('n01', '0.84401834->0.9835104942321777')]
2022-06-29 04:29:09,846:BO.TSVFile:INFO Updating 'objfid=332333803' using [('t06', "None->'True'"), ('n01', '0.60491717->0.9994377493858337')]
2022-06-29 04:29:09,872:BO.TSVFile:INFO Updating 'objfid=332333804' using [('t06', "None->'True'"), ('n01', '0.7606511->0.9999871253967285')]
2022-06-29 04:29:09,894:BO.TSVFile:INFO Updating 'objfid=332333805' using [('t06', "None->'True'"), ('n01', '0.68993723->0.9978905320167542')]
2022-06-29 04:29:09,917:BO.TSVFile:INFO Updating 'objfid=332333806' using [('t06', "None->'True'"), ('n01', '0.62982476->0.9999721050262451')]
2022-06-29 04:29:09,943:BO.TSVFile:INFO Updating 'objfid=332333807' using [('t06', "None->'True'"), ('n01', '0.89774394->0.6851480603218079')]
2022-06-29 04:29:09,965:BO.TSVFile:INFO Updating 'objfid=332333808' using [('t06', "None->'True'"), ('n01', '0.45348385->0.988058865070343')]
2022-06-29 04:29:09,988:BO.TSVFile:INFO Updating 'objfid=332333809' using [('t06', "None->'True'"), ('n01', '0.89016265->0.9057826399803162')]
2022-06-29 04:29:10,013:BO.TSVFile:INFO Updating 'objfid=332333810' using [('t06', "None->'True'"), ('n01', '0.57306963->0.6248078942298889')]
2022-06-29 04:29:10,038:BO.TSVFile:INFO Updating 'objfid=332333811' using [('t06', "None->'True'"), ('n01', '0.6401201->0.7173159718513489')]
...
2022-06-29 04:29:10,472:BO.TSVFile:INFO Updating 'objfid=332333829' using [('t06', "None->'True'"), ('n01', '0.9634766->0.9904594421386719')]
2022-06-29 04:29:10,494:BO.TSVFile:INFO Updating 'objfid=332333830' using [('t06', "None->'True'"), ('n01', '0.5982595->0.9993420243263245')]
2022-06-29 04:29:10,517:BO.TSVFile:INFO Updating 'objfid=332333831' using [('t06', "None->'True'"), ('n01', '0.9280701->0.981404721736908')]
2022-06-29 04:29:10,543:BO.TSVFile:INFO Updating 'objfid=332333834' using [('t06', "None->'True'"), ('n01', '0.9728491->0.99470055103302')]
2022-06-29 04:29:10,548:BO.TSVFile:INFO Updating 'objid=332333833' using [('classif_id', '342->11762')]
2022-06-29 04:29:10,568:BO.TSVFile:INFO Updating 'objfid=332333833' using [('t06', "None->'True'"), ('n01', '0.62476903->0.9709104299545288')]
2022-06-29 04:29:10,572:BO.TSVFile:INFO Updating 'objid=332333832' using [('classif_id', '342->11762')]
2022-06-29 04:29:10,592:BO.TSVFile:INFO Updating 'objfid=332333832' using [('t06', "None->'True'"), ('n01', '0.48558292->0.9297972917556763')]
2022-06-29 04:29:10,615:BO.TSVFile:INFO Updating 'objfid=332333835' using [('t06', "None->'True'"), ('n01', '0.6353145->0.9741307497024536')]
2022-06-29 04:29:10,637:BO.TSVFile:INFO Updating 'objfid=332333836' using [('t06', "None->'True'"), ('n01', '0.582546->0.999795138835907')]
2022-06-29 04:29:10,662:BO.TSVFile:INFO Updating 'objfid=332333837' using [('t06', "None->'True'"), ('n01', '0.64749044->0.9995095729827881')]
2022-06-29 04:29:10,686:BO.TSVFile:INFO Updating 'objfid=332333838' using [('t06', "None->'True'"), ('n01', '0.93026793->0.9999659061431885')]
2022-06-29 04:29:10,708:BO.TSVFile:INFO Updating 'objfid=332333839' using [('t06', "None->'True'"), ('n01', '0.67226845->0.9999951124191284')]
2022-06-29 04:29:10,735:BO.TSVFile:INFO Updating 'objfid=332333840' using [('t06', "None->'True'"), ('n01', '0.55110526->0.996381402015686')]
2022-06-29 04:29:10,757:BO.TSVFile:INFO Updating 'objfid=332333841' using [('t06', "None->'True'"), ('n01', '0.60473084->0.999871015548706')]
2022-06-29 04:29:10,778:BO.TSVFile:INFO Updating 'objfid=332333843' using [('t06', "None->'True'"), ('n01', '0.9978815->0.9999985694885254')]
2022-06-29 04:29:10,803:BO.TSVFile:INFO Updating 'objfid=332333842' using [('t06', "None->'True'"), ('n01', '0.9818188->0.999953031539917')]
2022-06-29 04:29:10,827:BO.TSVFile:INFO Updating 'objfid=332333844' using [('t06', "None->'True'"), ('n01', '0.43761125->0.996690034866333')]
2022-06-29 04:29:10,850:BO.TSVFile:INFO Updating 'objfid=332333846' using [('t06', "None->'True'"), ('n01', '0.98839927->0.999998927116394')]
2022-06-29 04:29:10,876:BO.TSVFile:INFO Updating 'objfid=332333845' using [('t06', "None->'True'"), ('n01', '0.49382433->0.7769779562950134')]
2022-06-29 04:29:10,899:BO.TSVFile:INFO Updating 'objfid=332333847' using [('t06', "None->'True'"), ('n01', '0.58303624->0.9998514652252197')]
2022-06-29 04:29:10,920:BO.TSVFile:INFO Updating 'objfid=332333848' using [('t06', "None->'True'"), ('n01', '0.9764341->0.997356653213501')]
2022-06-29 04:29:10,945:BO.TSVFile:INFO Updating 'objfid=332333849' using [('t06', "None->'True'"), ('n01', '0.8680194->0.9494059085845947')]
2022-06-29 04:29:10,971:BO.TSVFile:INFO Updating 'objfid=332333850' using [('t06', "None->'True'"), ('n01', '0.8457306->0.9975295662879944')]
2022-06-29 04:29:10,993:BO.TSVFile:INFO Updating 'objfid=332333851' using [('t06', "None->'True'"), ('n01', '0.8954615->0.9944500923156738')]
2022-06-29 04:29:11,017:BO.TSVFile:INFO Updating 'objfid=332333852' using [('t06', "None->'True'"), ('n01', '0.6071773->0.8210511207580566')]
2022-06-29 04:29:11,039:BO.TSVFile:INFO Updating 'objfid=332333853' using [('t06', "None->'True'"), ('n01', '0.6931323->0.9998435974121094')]
2022-06-29 04:29:11,043:BO.TSVFile:INFO Updating 'objid=332333854' using [('classif_id', '342->25828')]
2022-06-29 04:29:11,063:BO.TSVFile:INFO Updating 'objfid=332333854' using [('t06', "None->'True'"), ('n01', '0.6064511->0.5608027577400208')]
2022-06-29 04:29:11,089:BO.TSVFile:INFO Updating 'objfid=332333855' using [('t06', "None->'True'"), ('n01', '0.8928067->0.9987032413482666')]
2022-06-29 04:29:11,111:BO.TSVFile:INFO Updating 'objfid=332333856' using [('t06', "None->'True'"), ('n01', '0.8291849->0.9996920824050903')]
2022-06-29 04:29:11,133:BO.TSVFile:INFO Updating 'objfid=332333858' using [('t06', "None->'True'"), ('n01', '0.99571073->0.9999972581863403')]
2022-06-29 04:29:11,156:BO.TSVFile:INFO Updating 'objfid=332333857' using [('t06', "None->'True'"), ('n01', '0.6518947->0.9989802241325378')]
2022-06-29 04:29:11,182:BO.TSVFile:INFO Updating 'objfid=332333860' using [('t06', "None->'True'"), ('n01', '0.8961648->0.9999814033508301')]
2022-06-29 04:29:11,202:BO.TSVFile:INFO Updating 'objfid=332333859' using [('t06', "None->'True'"), ('n01', '0.6239822->0.94720858335495')]
2022-06-29 04:29:11,225:BO.TSVFile:INFO Updating 'objfid=332333861' using [('t06', "None->'True'"), ('n01', '0.55995965->0.9992514252662659')]
2022-06-29 04:29:11,251:BO.TSVFile:INFO Updating 'objfid=332333863' using [('t06', "None->'True'"), ('n01', '0.5958765->0.9991888403892517')]
2022-06-29 04:29:11,273:BO.TSVFile:INFO Updating 'objfid=332333862' using [('t06', "None->'True'"), ('n01', '0.90727866->0.9999890327453613')]
...
2022-06-29 04:29:11,683:BO.TSVFile:INFO Updating 'objfid=332333881' using [('t06', "None->'True'"), ('n01', '0.6987737->0.9997115731239319')]
2022-06-29 04:29:11,707:BO.TSVFile:INFO Updating 'objfid=332333882' using [('t06', "None->'True'"), ('n01', '0.69469154->0.997368335723877')]
2022-06-29 04:29:11,730:BO.TSVFile:INFO Updating 'objfid=332333883' using [('t06', "None->'True'"), ('n01', '0.8432302->0.9998617172241211')]
2022-06-29 04:29:11,736:BO.TSVFile:INFO Updating 'objid=332333884' using [('classif_id', '92053->11762')]
2022-06-29 04:29:11,756:BO.TSVFile:INFO Updating 'objfid=332333884' using [('t06', "None->'True'"), ('n01', '0.8904505->0.9996005892753601')]
2022-06-29 04:29:11,780:BO.TSVFile:INFO Updating 'objfid=332333885' using [('t06', "None->'True'"), ('n01', '0.5576295->0.9995793700218201')]
2022-06-29 04:29:11,802:BO.TSVFile:INFO Updating 'objfid=332333886' using [('t06', "None->'True'"), ('n01', '0.6991085->0.973975658416748')]
2022-06-29 04:29:11,824:BO.TSVFile:INFO Updating 'objfid=332333888' using [('t06', "None->'True'"), ('n01', '0.9999994->0.9499883651733398')]
2022-06-29 04:29:11,849:BO.TSVFile:INFO Updating 'objfid=332333887' using [('t06', "None->'True'"), ('n01', '0.86853385->0.9989874958992004')]
2022-06-29 04:29:11,874:BO.TSVFile:INFO Updating 'objfid=332333889' using [('t06', "None->'True'"), ('n01', '0.44377014->0.6608572006225586')]
2022-06-29 04:29:11,898:BO.TSVFile:INFO Updating 'objfid=332333890' using [('t06', "None->'True'"), ('n01', '0.8826317->0.999769389629364')]
2022-06-29 04:29:11,924:BO.TSVFile:INFO Updating 'objfid=332333893' using [('t06', "None->'True'"), ('n01', '0.5418574->0.8624780178070068')]
2022-06-29 04:29:11,928:BO.TSVFile:INFO Updating 'objid=332333892' using [('classif_id', '342->11762')]
2022-06-29 04:29:11,950:BO.TSVFile:INFO Updating 'objfid=332333892' using [('t06', "None->'True'"), ('n01', '0.5792397->0.9869982600212097')]
2022-06-29 04:29:11,973:BO.TSVFile:INFO Updating 'objfid=332333891' using [('t06', "None->'True'"), ('n01', '0.91960746->0.9957197308540344')]
2022-06-29 04:29:11,994:BO.TSVFile:INFO Updating 'objfid=332333894' using [('t06', "None->'True'"), ('n01', '0.9999993->0.9999442100524902')]
2022-06-29 04:29:12,016:BO.TSVFile:INFO Updating 'objfid=332333896' using [('t06', "None->'True'"), ('n01', '0.62283224->0.9919296503067017')]
2022-06-29 04:29:12,018:DB.helpers.DBWriter:INFO Batch save objects of 0/0/0/0
2022-06-29 04:29:14,451:BO.TSVFile:INFO Updating 'objfid=332333895' using [('t06', "None->'True'"), ('n01', '1.0->0.9999991655349731')]

grololo06 · 2022-06-29T06:00:47Z

I did not do a precise statistical analysis of the time spent in each sequence but it looks quite flat over the job duration, i.e. no degradation since the job started which is good.
During each sequence, the code reads each impacted object, determines the diff with what's needed from the TSV, does an in-memory update. This takes around 2.5s for 100 objects. Then the DB commit takes around 2.5s.
From the logs, we can see that sometimes it's in fact 2 updates: one on obj_head and one on obj_field.

Next step is to do a timing comparison with similar direct SQL updates on the DB.

On my dev DB:

ecotaxa4=# create temp table toto as select objid from objects where projid = 3948;
SELECT 1612584
Time: 1008,506 ms (00:01,009)

ecotaxa4=# begin; update obj_field set t06='True', n01='0.98' where objfid in (select objid from toto order by objid limit 10000); commit;
BEGIN
Time: 0,357 ms
UPDATE 10000
Time: 3816,911 ms (00:03,817)
COMMIT
Time: 1308,454 ms (00:01,308)

Next chunk:
ecotaxa4=# delete from toto where objid in (select objid from toto order by objid limit 10000);
DELETE 10000
Time: 240,435 ms


ecotaxa4=# begin; update obj_field set t06='True', n01='0.98' where objfid in (select objid from toto order by objid limit 10000); commit;
BEGIN
Time: 0,354 ms
UPDATE 10000
Time: 4301,617 ms (00:04,302)
COMMIT
Time: 35,538 ms
ecotaxa4=# delete from toto where objid in (select objid from toto order by objid limit 10000);
DELETE 10000
Time: 277,821 ms
ecotaxa4=# begin; update obj_field set t06='True', n01='0.98' where objfid in (select objid from toto order by objid limit 10000); commit;
BEGIN
Time: 0,291 ms
UPDATE 10000
Time: 3062,607 ms (00:03,063)
COMMIT
Time: 251,056 ms
ecotaxa4=# delete from toto where objid in (select objid from toto order by objid limit 10000);
DELETE 10000
Time: 237,067 ms
ecotaxa4=# begin; update obj_field set t06='True', n01='0.98' where objfid in (select objid from toto order by objid limit 10000); commit;
BEGIN
Time: 0,335 ms
UPDATE 10000
Time: 3452,675 ms (00:03,453)
COMMIT
Time: 253,899 ms

So we have 100 more updates in approximately the same time, clearly an improvement of /10 is doable. I'd say, just from observations, that the diff algorithm and in-memory updates via SQL Alchemy take the majority of CPU time.

grololo06 · 2022-06-30T05:14:21Z

On my other dev DB, the times are rather around 2s, obviously a serious benchmark is needed in similar conditions.

grololo06 · 2023-01-08T08:58:30Z

OK after more code examination, the update has been coded as "non-breaking exceptions" of the general import, which means minimal code but minimal performance as well.
The update code:

Reads TSV lines (same code as plain import)
Read (SQL SELECT) each object to be updated in a specific query
Determines the wanted updates for this object
Writes (SQL UPDATE) the updated object
Persists the writes (SQL COMMIT) every 100 TSV lines

Optimally, we would group operations that can be (e.g. read all objects together and write them the same), and do all this in a specific def, not using a "branch flag" like now.

For benchmarking first, a test project cannot be used, a real one of decent size (> 10K) is, the measurement would be done with an update of e.g. coordinates + some free columns. Of course we need 2 updates to have a flip-flop repeatable case, as a second update with same data would do nothing. Some coding is needed already to hack the "optimal" solution described above.
Then we'll know how much time we can save.

Second step would be finalization of the "hacks" + addition of more unit tests in QA directory.

Workload estimate is 8 + 8 hours.

grololo06 · 2023-01-08T08:59:14Z

De-assigning myself until decision is made to work on this.

jiho added db Affects the structure of the database performance Speeeeeed! labels Jun 8, 2022

jiho assigned jiho and grololo06 and unassigned jiho Jun 28, 2022

grololo06 removed their assignment Jan 8, 2023

jiho added the page-import Everything related to import functionality label Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import update (i.e. update metadata only) is very slow #827

Import update (i.e. update metadata only) is very slow #827

jiho commented Jun 8, 2022 •

edited

Loading

grololo06 commented Jun 10, 2022

jiho commented Jun 28, 2022

jiho commented Jun 28, 2022 •

edited

Loading

grololo06 commented Jun 29, 2022

grololo06 commented Jun 29, 2022 •

edited

Loading

grololo06 commented Jun 30, 2022

grololo06 commented Jan 8, 2023

grololo06 commented Jan 8, 2023

Import update (i.e. update metadata only) is very slow #827

Import update (i.e. update metadata only) is very slow #827

Comments

jiho commented Jun 8, 2022 • edited Loading

grololo06 commented Jun 10, 2022

jiho commented Jun 28, 2022

jiho commented Jun 28, 2022 • edited Loading

grololo06 commented Jun 29, 2022

grololo06 commented Jun 29, 2022 • edited Loading

grololo06 commented Jun 30, 2022

grololo06 commented Jan 8, 2023

grololo06 commented Jan 8, 2023

jiho commented Jun 8, 2022 •

edited

Loading

jiho commented Jun 28, 2022 •

edited

Loading

grololo06 commented Jun 29, 2022 •

edited

Loading