Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data loss issue in combine_echodata #824

Merged
merged 100 commits into from
Oct 7, 2022

Conversation

b-reyes
Copy link
Contributor

@b-reyes b-reyes commented Oct 1, 2022

This PR addresses #822. This is done by first creating a mapping between uniform chunks and the initial starting chunks. Then a Dask Lock is assigned to each of the initial starting chunks so that no starting chunks are written to the uniform chunk at the same time. To illustrate the approach, consider the following simplified example:

Say we have three files each with the variable 'back_r' that contain the following values (these would be the starting chunks):

file 1: back_r = [0,1,2]

file 2: back_r = [3,4,5]

file 3: back_r = [6,7]

We then want to combine all of these back_r variables into back_r_combined = [0,1,2,3,4,5,6,7] with uniform chunk size 2.

Thus, the chunks would be as follows:

chunk 1: [0,1]

chunk 2: [2,3]

chunk 3: [4,5]

chunk 4: [6,7]

For all chunks besides chunk 2, we can safely write the processes in parallel. However, we see that chunk 2 contains data from file 1 and file 2. Thus, two different processes will attempt to write to chunk 2 at the same time and data corruption will likely occur. To remedy this, we can assign a lock name to each write to a uniform chunk:

data write [0,1]  will be given lock name = "lock1"

data write [2] and [3]  will be given lock name = "lock2"

data write [4,5]  will be given lock name = "lock3"

data write [6,7]  will be given lock name = "lock4"

We then use Dask Lock with the establish lock names and we can prevent chunks being written to at the same time.

b-reyes and others added 30 commits August 26, 2022 17:02
@b-reyes
Copy link
Contributor Author

b-reyes commented Oct 5, 2022

@lsetiawan I agree with you that I was incorrectly using Optional for storage_options, this should always be a required variable and have an empty dict for its value. I have hopefully changed all typing to reflect this.

I am slightly puzzled about your changes to the doc strings. Based on your comments, it appears that you do not want typing types in the doc strings i.e. things like Dict, Optional, etc.. If this is the case, can you explain to me why this is? I don't necessarily disagree with it, but I would like some guidelines for my future documentation of echopype.

@lsetiawan
Copy link
Member

lsetiawan commented Oct 5, 2022

I am slightly puzzled about your changes to the doc strings. Based on your comments, it appears that you do not want typing types in the doc strings i.e. things like Dict, Optional, etc.. If this is the case, can you explain to me why this is? I don't necessarily disagree with it, but I would like some guidelines for my future documentation of echopype.

I do like the straight type hints in there however, for users that are not familiar with type hinting, they will be reading it and might be puzzled. Type hints are a new feature in python, but the primitive types are a lot more familiar with the majority of people and it's much more readable. Take for example a docstring with a parameter of start_count and it can take int or float. A type hint would be Union[int, float]... however that's not very intuitive, it's great for IDE, but not a new python user that just wants to use the function based on the docstring so the more appropriate thing would be start_count: int or float or start_count: int | float. Anyways, type hints can get very cryptic real quick, and I don't think docstring is the right place for them.

@b-reyes
Copy link
Contributor Author

b-reyes commented Oct 5, 2022

Anyways, type hints can get very cryptic real quick, and I don't think docstring is the right place for them.

That is a fair point. I will go ahead and fix all doc strings that have them in it.

To your point of user readability, I think I like start_count: int or float, rather than start_count: int | float. Do you mind if I use the format start_count: int or float? For an optional, this would look like start_count: int or None.

@lsetiawan
Copy link
Member

lsetiawan commented Oct 5, 2022

Do you mind if I use the format start_count: int or float?

Yea that's the best I think also.

For an optional, this would look like start_count: int or None

For this, based on numpydoc convention, it should be int, optional.

See: https://numpydoc.readthedocs.io/en/latest/format.html#parameters

Screenshot from 2022-10-05 16-27-22

@b-reyes
Copy link
Contributor Author

b-reyes commented Oct 5, 2022

For this, based on numpydoc convention, it should be int, optional.

Thank you for that reference! Great, I will go with int, optional

@lsetiawan
Copy link
Member

Thanks! Sorry for all of these stylistic changes! I guess it's better cleaning them up now than later 😛

@b-reyes
Copy link
Contributor Author

b-reyes commented Oct 5, 2022

Thanks! Sorry for all of these stylistic changes! I guess it's better cleaning them up now than later 😛

No worries. I agree, it is better to change them now.

Copy link
Member

@lsetiawan lsetiawan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After detailed testing found here: https://nbviewer.org/gist/lsetiawan/ebb3faed65e53a3188518d62dbe0968a I conclude that this combine_echodata doesn't have any data loss issue and it's able to combine a large amount of data with minimal impact on memory consumption/spikes.

The example notebook that I tested on converted 318 EK60 Raw Files from OOI in 16min and then combined those files in about 15min. These times may be due to limitation in my CPU clock speed. The memory consumption in my machine never blew up and at the end I was able to explore 133GB of data with ease.

AWESOME WORK @b-reyes! This PR is ready for merging. Please put your testing results in this PR for the Hake data.

@b-reyes
Copy link
Contributor Author

b-reyes commented Oct 7, 2022

@lsetiawan thank you very much for investigating this PR and testing out a large amount of data files! I am glad to hear that memory consumption stayed steady and the runtime for the combination of the files was small.

As @lsetiawan mentioned, I have also tested out Hake data. Specifically, I tested out 107 files within s3://ncei-wcsd-archive/data/raw/Bell_M._Shimada/SH1707/EK60. It took roughly 7 minutes to convert the data and 5 minutes to combine them. The notebook to run this can be found here: https://nbviewer.org/gist/b-reyes/77bbede005f266509618d1430ae8f33e, please be sure to provide a Client to combine_echodata before you run this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants