-
-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a crash caused by malformed data #4850
Conversation
This PR - fixes an abort of the user's program (issue HDFGroupGH-4435) because the malformed data caused a very large size of memory to be allocated. - removes memory leaks causing the free-list code to abort in addition to the reported issue. NF provided helpful code to the fix.
Can we add a reproduction script to the tests? |
@@ -594,8 +594,11 @@ H5C__flush_single_entry(H5F_t *f, H5C_cache_entry_t *entry_ptr, unsigned flags) | |||
else | |||
mem_type = entry_ptr->type->mem_type; | |||
|
|||
if (H5F_block_write(f, mem_type, entry_ptr->addr, entry_ptr->size, entry_ptr->image_ptr) < 0) | |||
if (H5F_block_write(f, mem_type, entry_ptr->addr, entry_ptr->size, entry_ptr->image_ptr) < 0) { | |||
H5C__REMOVE_ENTRY_FROM_SLIST(cache_ptr, entry_ptr, during_flush, FAIL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like something that should be handled by the caller rather than the function itself, as the caller should determine whether or not to remove the entry from the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are likely other things to be considered here, such as:
- the entry may also need to be removed from the collective list when operating in parallel
- the entry may also need to go through the rest of the
destroy
sequence even on write failure, unless it should also be left up to the caller to handle that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - how is this change to writing cache entries related to the crash on reading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The crash happened during the shutdown process. When some memory was not freed, causing failure because the factory list still has allocated objects, and processing the failure crashed:
H5Eint.c:1464: H5E__push_stack: Assertion `cls_id > 0' failed.
This shows how the writing cache entries got involved:
...
#13: ../../src/H5Fint.c line 1538 in H5F__dest(): problems closing file
major: File accessibility
minor: Unable to release object
#14: ../../src/H5AC.c line 477 in H5AC_dest(): can't destroy cache
major: Object cache
minor: Unable to free object
#15: ../../src/H5C.c line 546 in H5C_dest(): disable slist on flush dest failure failed
major: Object cache
minor: Internal error detected
#16: ../../src/H5C.c line 1115 in H5C_set_slist_enabled(): slist not empty?
major: Object cache
minor: Internal error detected
#17: ../../src/H5C.c line 502 in H5C_dest(): unable to flush cache
major: Object cache
minor: Unable to flush data from cache
#18: ../../src/H5Cint.c line 1052 in H5C__flush_invalidate_cache(): flush invalidate ring failed
major: Object cache
minor: Unable to flush data from cache
#19: ../../src/H5Cint.c line 1349 in H5C__flush_invalidate_ring(): dirty entry flush destroy failed
major: Object cache
minor: Unable to flush data from cache
#20: ../../src/H5Centry.c line 598 in H5C__flush_single_entry(): Can't write image to file
major: Object cache
minor: Unable to flush data from cache
#21: ../../src/H5Fio.c line 219 in H5F_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#22: ../../src/H5PB.c line 1001 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#23: ../../src/H5Faccum.c line 633 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#24: ../../src/H5FDint.c line 312 in H5FD_write(): addr overflow, addr = 4294967295, size=16, eoa=2848
major: Invalid arguments to routine
minor: Address overflowed
H5Fclose failed
I'd appreciate any suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are likely other things to be considered here, such as:
- the entry may also need to be removed from the collective list when operating in parallel
- the entry may also need to go through the rest of the
destroy
sequence even on write failure, unless it should also be left up to the caller to handle that
I'm looking into these suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my suggestion is to find out what is happening when the read occurs that causes memory to be corrupted (which is almost certainly why the eventual crash occurs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corrupted data was acquired in H5HL__hdr_deserialize() at:
H5_DECODE_LENGTH_LEN(image, heap->dblk_size, udata->sizeof_size);
H5_DECODE_LENGTH_LEN(image, heap->free_block, udata->sizeof_size);
during the call to H5Gcreate1(). Both heap->dblk_size and heap->free_block are huge. I put in the fix (NF's help) to catch that but during the shutdown process, memory leaks caused subsequent crashes, hence, the need to close those.
Although, at this point, I just realized I missed something in the PR, let me take care of that while you're thinking. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you certain that you've reached the root cause for this failure?
After reviewing the example code in the bug report, it definitely seems like something that should be detected at read time. |
I want to take more time to do this right. |
The bad file will be added to the cve_hdf5 repo. |
This is outdated now, I'll start a fresh PR when I'm ready. |
This PR
@fortnern provided helpful code to the fix.