Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended Loading Time for Mainstem Data from CSV #111

Closed
webb-ben opened this issue Jul 11, 2023 · 6 comments
Closed

Extended Loading Time for Mainstem Data from CSV #111

webb-ben opened this issue Jul 11, 2023 · 6 comments

Comments

@webb-ben
Copy link
Member

webb-ben commented Jul 11, 2023

Description:

Issue Summary:
The loading process for the mainstem data from a CSV file is taking an unusually long time, significantly impacting overall system performance. This issue aims to investigate and optimize the loading time to improve the efficiency of the process.

Steps to Reproduce:

  1. Start the application.
  2. Observe the logs during the data loading process.

Expected Behavior:

The loading of the mainstem data for the demo should be completed within a reasonable time frame, similar to other data loading processes.

Actual Behavior:

The loading of the mainstem data from the CSV file is taking a considerable amount of time, as indicated in the following logs:

18:30:32.666 INFO  [liquibase.changelog.ChangeSet]: Custom SQL executed
18:30:32.667 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/grants.sql::grant.select.usage.update.on.web_service_log_web_service_log_id_seq.to.${NLDI_READ_ONLY_USERNAME}::drsteini ran successfully in 4ms
18:30:32.681 INFO  [liquibase.changelog.ChangeSet]: Data deleted from crawler_source
18:30:33.855 INFO  [liquibase.changelog.ChangeSet]: Data loaded from crawler_source.tsv into crawler_source
18:30:33.860 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/update_crawler_source/changeLog.yml::load.nldi_data.update_crawler_source::kkehl ran successfully in 1184ms
18:30:33.893 INFO  [liquibase.changelog.ChangeSet]: Data deleted from mainstem_lookup
18:34:11.939 INFO  [liquibase.changelog.ChangeSet]: Data loaded from /liquibase/mainstem_lookup.csv into mainstem_lookup
18:34:11.948 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/copyMainstemData.yml::load.nldi_data.mainstem_lookup::egrahn ran successfully in 218061ms
18:34:12.025 INFO  [liquibase.changelog.ChangeSet]: Custom SQL executed
18:34:12.026 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/characteristic_data/tables.sql::create.characteristic_data.characteristic_metadata::ayan ran successfully in 47ms
18:34:12.048 INFO  [liquibase.changelog.ChangeSet]: Custom SQL executed

As seen, the loading process took approximately 218 seconds

@webb-ben webb-ben changed the title Mainstem lookup Extended Loading Time for Mainstem Data from CSV Jul 11, 2023
@dblodgett-usgs
Copy link
Member

This is because it is downloading and loading the national table. I believe it's grabbing this file: https://code.usgs.gov/wma/nhgf/reference-hydrofabric/-/raw/main/workspace/data/mainstem_lookup.csv.gz?inline=false ??

Would it be helpful to check in one for the demo database there or should we do it elsewhere?

@webb-ben
Copy link
Member Author

I believe the file is downloaded as a .gz during the docker build - so the length of this step is just to import mainstem_lookup.csv. What confuses me is that this isn't indexing any columns in the mainstem_lookup table.

Docker desktop now has a handy memory plotter - this is what I see while starting nldi-db:demo from scratch
image
logs:

17:30:59.740 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/grants.sql::grant.select.usage.update.on.web_service_log_web_service_log_id_seq.to.${NLDI_READ_ONLY_USERNAME}::drsteini ran successfully in 5ms
17:30:59.749 INFO  [liquibase.changelog.ChangeSet]: Data deleted from crawler_source
17:31:00.930 INFO  [liquibase.changelog.ChangeSet]: Data loaded from crawler_source.tsv into crawler_source
17:31:00.935 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/update_crawler_source/changeLog.yml::load.nldi_data.update_crawler_source::kkehl ran successfully in 1190ms
17:31:00.972 INFO  [liquibase.changelog.ChangeSet]: Data deleted from mainstem_lookup
17:35:24.511 INFO  [liquibase.changelog.ChangeSet]: Data loaded from /liquibase/mainstem_lookup.csv into mainstem_lookup
17:35:24.521 INFO  [liquibase.changelog.ChangeSet]: ChangeSet /liquibase/nldi/nldi_data/copyMainstemData.yml::load.nldi_data.mainstem_lookup::egrahn ran successfully in 263554ms

Becomes somewhat a question of ensuring we establish clear understandings of the different flavors of nldi-db being published (re: #100). Because having a yahara csv of mainstems would be a lot smaller - and faster than the full scale mainstem lookup table.

@webb-ben
Copy link
Member Author

webb-ben commented Jul 12, 2023

@dblodgett-usgs For the demo database at least, I think a smaller subset of mainstem_lookup.csv makes sense. The demo database is supposed to work out of the box - I think it is fine to only have the subset of comid's that are in the Yahara basin.

Based on our conversations about the future controls put on the NLDI Crawler, this would mean the demo database would only index data from yahara.

This SQL returns 192 rows and could easily be put into an artifact for the demo database Dockerfile

SELECT DISTINCT m.*
FROM nldi_data.mainstem_lookup m
LEFT JOIN nldi_data.feature f ON m.nhdpv2_comid = f.comid
WHERE f.comid IS NOT NULL;

@webb-ben
Copy link
Member Author

With the smaller mainstem_lookup.csv from ^, the demo database takes under a minute to fully setup.
image

@dblodgett-usgs
Copy link
Member

Nice!

@webb-ben
Copy link
Member Author

I will create artifacts for the release then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants