Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow reading of mhydro file #749

Closed
Havrevoll opened this issue Nov 15, 2024 · 4 comments · Fixed by #750
Closed

Very slow reading of mhydro file #749

Havrevoll opened this issue Nov 15, 2024 · 4 comments · Fixed by #750

Comments

@Havrevoll
Copy link
Collaborator

Describe the bug
I have three different mhydro files, of sizes 2-5 MB. When reading them with mikeio.read_pfs, one takes 2 minutes to read, one takes 10 minutes and the last takes 20 minutes to read, and I can't understand why it takes so long. Are there any catchment polygons or other data that are unusually complicated to parse? After the files are read into memory, all reading and changing of values goes quickly. Writing the file again is done in no time.

To Reproduce
Steps to reproduce the behavior:
Load the relevant file, sent by email to JEM and JAN:
mhr_file = mikeio.read_pfs("river_name.mhydro")
I don't think it matters if I use IPython, Jupyter Notebook or plain Python.

Expected behavior
The file would be loaded into memory in some seconds.

System information:

  • Python version 3.12.5
  • MIKE IO version 1.6.3
@ecomodeller
Copy link
Member

ecomodeller commented Nov 18, 2024

Initial profiling has revealed that ~100% of the time is spent in

_COMMA_MATCHER = re.compile(r",(?=(?:[^\"']*[\"'][^\"']*[\"'])*[^\"']*$)")
def _split_line_by_comma(self, s: str) -> list[str]:
return self._COMMA_MATCHER.split(s)

@jsmariegaard the name of the private method _split_line_by_comma suggests that it should be simple, but the reason the regex is needed is to avoid splitting on commas inside strings. (regex was introduced in 3bf4a58)

There are some really long lines in the example file, one line contains a line like this:

Shape = 'MULTIPOLYGON(((475130.002457948 6524999.99673699,4751

with 25799 commas inside it, I don't know if these are the ones that takes time, but at least it contains many commas.

@jsmariegaard
Copy link
Member

I wonder if we could have special handling of a MULTIPOLYGON 🤔?

@ecomodeller
Copy link
Member

ecomodeller commented Nov 18, 2024

I wonder if we could have special handling of a MULTIPOLYGON 🤔?

To try this out I removed the 11 lines with MULTIPOLYGON

grep -v MULTIPOLYGON file_name.mhydro > stripped.mhydro

Parsing the original pfs file took ~10 minutes, the stripped file with 8992 lines (11 lines shorter) took 0.3 seconds to read🤯.

@ecomodeller
Copy link
Member

ecomodeller commented Nov 19, 2024

Special handling of MULTIPOLYGON brings the time down to 1.0 s
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants