Extended Loading Time for Mainstem Data from CSV #111

webb-ben · 2023-07-11T18:52:35Z

Description:

Issue Summary:
The loading process for the mainstem data from a CSV file is taking an unusually long time, significantly impacting overall system performance. This issue aims to investigate and optimize the loading time to improve the efficiency of the process.

Steps to Reproduce:

Start the application.
Observe the logs during the data loading process.

Expected Behavior:

The loading of the mainstem data for the demo should be completed within a reasonable time frame, similar to other data loading processes.

Actual Behavior:

The loading of the mainstem data from the CSV file is taking a considerable amount of time, as indicated in the following logs:

18:30:32.666 INFO  [liquibase.changelog.ChangeSet]: Custom SQL executed
18:30:32.667 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/grants.sql::grant.select.usage.update.on.web_service_log_web_service_log_id_seq.to.${NLDI_READ_ONLY_USERNAME}::drsteini ran successfully in 4ms
18:30:32.681 INFO  [liquibase.changelog.ChangeSet]: Data deleted from crawler_source
18:30:33.855 INFO  [liquibase.changelog.ChangeSet]: Data loaded from crawler_source.tsv into crawler_source
18:30:33.860 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/update_crawler_source/changeLog.yml::load.nldi_data.update_crawler_source::kkehl ran successfully in 1184ms
18:30:33.893 INFO  [liquibase.changelog.ChangeSet]: Data deleted from mainstem_lookup
18:34:11.939 INFO  [liquibase.changelog.ChangeSet]: Data loaded from /liquibase/mainstem_lookup.csv into mainstem_lookup
18:34:11.948 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/copyMainstemData.yml::load.nldi_data.mainstem_lookup::egrahn ran successfully in 218061ms
18:34:12.025 INFO  [liquibase.changelog.ChangeSet]: Custom SQL executed
18:34:12.026 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/characteristic_data/tables.sql::create.characteristic_data.characteristic_metadata::ayan ran successfully in 47ms
18:34:12.048 INFO  [liquibase.changelog.ChangeSet]: Custom SQL executed

As seen, the loading process took approximately 218 seconds

The text was updated successfully, but these errors were encountered:

dblodgett-usgs · 2023-07-11T19:15:27Z

This is because it is downloading and loading the national table. I believe it's grabbing this file: https://code.usgs.gov/wma/nhgf/reference-hydrofabric/-/raw/main/workspace/data/mainstem_lookup.csv.gz?inline=false ??

Would it be helpful to check in one for the demo database there or should we do it elsewhere?

webb-ben · 2023-07-11T21:49:39Z

I believe the file is downloaded as a .gz during the docker build - so the length of this step is just to import mainstem_lookup.csv. What confuses me is that this isn't indexing any columns in the mainstem_lookup table.

Docker desktop now has a handy memory plotter - this is what I see while starting nldi-db:demo from scratch

logs:

17:30:59.740 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/grants.sql::grant.select.usage.update.on.web_service_log_web_service_log_id_seq.to.${NLDI_READ_ONLY_USERNAME}::drsteini ran successfully in 5ms
17:30:59.749 INFO  [liquibase.changelog.ChangeSet]: Data deleted from crawler_source
17:31:00.930 INFO  [liquibase.changelog.ChangeSet]: Data loaded from crawler_source.tsv into crawler_source
17:31:00.935 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/update_crawler_source/changeLog.yml::load.nldi_data.update_crawler_source::kkehl ran successfully in 1190ms
17:31:00.972 INFO  [liquibase.changelog.ChangeSet]: Data deleted from mainstem_lookup
17:35:24.511 INFO  [liquibase.changelog.ChangeSet]: Data loaded from /liquibase/mainstem_lookup.csv into mainstem_lookup
17:35:24.521 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/copyMainstemData.yml::load.nldi_data.mainstem_lookup::egrahn ran successfully in 263554ms

Becomes somewhat a question of ensuring we establish clear understandings of the different flavors of nldi-db being published (re: #100). Because having a yahara csv of mainstems would be a lot smaller - and faster than the full scale mainstem lookup table.

webb-ben · 2023-07-12T11:36:51Z

@dblodgett-usgs For the demo database at least, I think a smaller subset of mainstem_lookup.csv makes sense. The demo database is supposed to work out of the box - I think it is fine to only have the subset of comid's that are in the Yahara basin.

Based on our conversations about the future controls put on the NLDI Crawler, this would mean the demo database would only index data from yahara.

This SQL returns 192 rows and could easily be put into an artifact for the demo database Dockerfile

SELECT DISTINCT m.*
FROM nldi_data.mainstem_lookup m
LEFT JOIN nldi_data.feature f ON m.nhdpv2_comid = f.comid
WHERE f.comid IS NOT NULL;

webb-ben · 2023-07-12T12:12:23Z

With the smaller mainstem_lookup.csv from ^, the demo database takes under a minute to fully setup.

dblodgett-usgs · 2023-07-12T13:28:27Z

Nice!

webb-ben · 2023-07-12T13:55:26Z

I will create artifacts for the release then!

webb-ben changed the title ~~Mainstem lookup~~ Extended Loading Time for Mainstem Data from CSV Jul 11, 2023

dblodgett-usgs closed this as completed Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extended Loading Time for Mainstem Data from CSV #111

Extended Loading Time for Mainstem Data from CSV #111

webb-ben commented Jul 11, 2023 •

edited

Loading

dblodgett-usgs commented Jul 11, 2023

webb-ben commented Jul 11, 2023

webb-ben commented Jul 12, 2023 •

edited

Loading

webb-ben commented Jul 12, 2023

dblodgett-usgs commented Jul 12, 2023

webb-ben commented Jul 12, 2023

Extended Loading Time for Mainstem Data from CSV #111

Extended Loading Time for Mainstem Data from CSV #111

Comments

webb-ben commented Jul 11, 2023 • edited Loading

dblodgett-usgs commented Jul 11, 2023

webb-ben commented Jul 11, 2023

webb-ben commented Jul 12, 2023 • edited Loading

webb-ben commented Jul 12, 2023

dblodgett-usgs commented Jul 12, 2023

webb-ben commented Jul 12, 2023

webb-ben commented Jul 11, 2023 •

edited

Loading

webb-ben commented Jul 12, 2023 •

edited

Loading