Very slow reading of mhydro file #749

Havrevoll · 2024-11-15T08:10:15Z

Describe the bug
I have three different mhydro files, of sizes 2-5 MB. When reading them with mikeio.read_pfs, one takes 2 minutes to read, one takes 10 minutes and the last takes 20 minutes to read, and I can't understand why it takes so long. Are there any catchment polygons or other data that are unusually complicated to parse? After the files are read into memory, all reading and changing of values goes quickly. Writing the file again is done in no time.

To Reproduce
Steps to reproduce the behavior:
Load the relevant file, sent by email to JEM and JAN:
mhr_file = mikeio.read_pfs("river_name.mhydro")
I don't think it matters if I use IPython, Jupyter Notebook or plain Python.

Expected behavior
The file would be loaded into memory in some seconds.

System information:

Python version 3.12.5
MIKE IO version 1.6.3

The text was updated successfully, but these errors were encountered:

ecomodeller · 2024-11-18T14:05:33Z

Initial profiling has revealed that ~100% of the time is spent in

mikeio/mikeio/pfs/_pfsdocument.py

Lines 361 to 364 in 299e509

    
           _COMMA_MATCHER = re.compile(r",(?=(?:[^\"']*[\"'][^\"']*[\"'])*[^\"']*$)") 
        
           def _split_line_by_comma(self, s: str) -> list[str]: 
        
               return self._COMMA_MATCHER.split(s)

@jsmariegaard the name of the private method _split_line_by_comma suggests that it should be simple, but the reason the regex is needed is to avoid splitting on commas inside strings. (regex was introduced in 3bf4a58)

There are some really long lines in the example file, one line contains a line like this:

Shape = 'MULTIPOLYGON(((475130.002457948 6524999.99673699,4751

with 25799 commas inside it, I don't know if these are the ones that takes time, but at least it contains many commas.

jsmariegaard · 2024-11-18T14:58:53Z

I wonder if we could have special handling of a MULTIPOLYGON 🤔?

ecomodeller · 2024-11-18T17:40:20Z

I wonder if we could have special handling of a MULTIPOLYGON 🤔?

To try this out I removed the 11 lines with MULTIPOLYGON

grep -v MULTIPOLYGON file_name.mhydro > stripped.mhydro

Parsing the original pfs file took ~10 minutes, the stripped file with 8992 lines (11 lines shorter) took 0.3 seconds to read🤯.

ecomodeller · 2024-11-19T06:54:39Z

Special handling of MULTIPOLYGON brings the time down to 1.0 s

ecomodeller mentioned this issue Nov 19, 2024

Pfs - handle slow reading of MIKE Hydro #750

Merged

ecomodeller linked a pull request Nov 19, 2024 that will close this issue

Pfs - handle slow reading of MIKE Hydro #750

Merged

ecomodeller closed this as completed in #750 Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow reading of mhydro file #749

Very slow reading of mhydro file #749

Havrevoll commented Nov 15, 2024

ecomodeller commented Nov 18, 2024 •

edited

Loading

jsmariegaard commented Nov 18, 2024

ecomodeller commented Nov 18, 2024 •

edited

Loading

ecomodeller commented Nov 19, 2024 •

edited

Loading

Very slow reading of mhydro file #749

Very slow reading of mhydro file #749

Comments

Havrevoll commented Nov 15, 2024

ecomodeller commented Nov 18, 2024 • edited Loading

jsmariegaard commented Nov 18, 2024

ecomodeller commented Nov 18, 2024 • edited Loading

ecomodeller commented Nov 19, 2024 • edited Loading

ecomodeller commented Nov 18, 2024 •

edited

Loading

ecomodeller commented Nov 18, 2024 •

edited

Loading

ecomodeller commented Nov 19, 2024 •

edited

Loading