-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data loss issue in combine_echodata
#824
Conversation
…p using combine_lazily_v2
…a full EchoData combine
…llel write of coords
for more information, see https://pre-commit.ci
…ault compressor in zarr_combine.py
for more information, see https://pre-commit.ci
…check_zarr_path Co-authored-by: Don Setiawan <landungs@uw.edu>
for more information, see https://pre-commit.ci
Co-authored-by: Don Setiawan <landungs@uw.edu>
for more information, see https://pre-commit.ci
Co-authored-by: Don Setiawan <landungs@uw.edu>
@lsetiawan I agree with you that I was incorrectly using I am slightly puzzled about your changes to the doc strings. Based on your comments, it appears that you do not want typing types in the doc strings i.e. things like |
I do like the straight type hints in there however, for users that are not familiar with type hinting, they will be reading it and might be puzzled. Type hints are a new feature in python, but the primitive types are a lot more familiar with the majority of people and it's much more readable. Take for example a docstring with a parameter of |
That is a fair point. I will go ahead and fix all doc strings that have them in it. To your point of user readability, I think I like |
Yea that's the best I think also.
For this, based on numpydoc convention, it should be See: https://numpydoc.readthedocs.io/en/latest/format.html#parameters |
Thank you for that reference! Great, I will go with |
Thanks! Sorry for all of these stylistic changes! I guess it's better cleaning them up now than later 😛 |
No worries. I agree, it is better to change them now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After detailed testing found here: https://nbviewer.org/gist/lsetiawan/ebb3faed65e53a3188518d62dbe0968a I conclude that this combine_echodata
doesn't have any data loss issue and it's able to combine a large amount of data with minimal impact on memory consumption/spikes.
The example notebook that I tested on converted 318
EK60 Raw Files from OOI in 16min and then combined those files in about 15min. These times may be due to limitation in my CPU clock speed. The memory consumption in my machine never blew up and at the end I was able to explore 133GB of data with ease.
AWESOME WORK @b-reyes! This PR is ready for merging. Please put your testing results in this PR for the Hake data.
@lsetiawan thank you very much for investigating this PR and testing out a large amount of data files! I am glad to hear that memory consumption stayed steady and the runtime for the combination of the files was small. As @lsetiawan mentioned, I have also tested out Hake data. Specifically, I tested out |
This PR addresses #822. This is done by first creating a mapping between uniform chunks and the initial starting chunks. Then a Dask Lock is assigned to each of the initial starting chunks so that no starting chunks are written to the uniform chunk at the same time. To illustrate the approach, consider the following simplified example:
Say we have three files each with the variable 'back_r' that contain the following values (these would be the starting chunks):
We then want to combine all of these
back_r
variables intoback_r_combined = [0,1,2,3,4,5,6,7]
with uniform chunk size 2.Thus, the chunks would be as follows:
For all chunks besides
chunk 2
, we can safely write the processes in parallel. However, we see thatchunk 2
contains data from file 1 and file 2. Thus, two different processes will attempt to write tochunk 2
at the same time and data corruption will likely occur. To remedy this, we can assign a lock name to each write to a uniform chunk:We then use Dask Lock with the establish lock names and we can prevent chunks being written to at the same time.