Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in flecs_get_flattened_target #965

Closed
Tisten opened this issue Apr 20, 2023 · 7 comments
Closed

Segmentation fault in flecs_get_flattened_target #965

Tisten opened this issue Apr 20, 2023 · 7 comments
Labels
bug Something isn't working needs-repro

Comments

@Tisten
Copy link

Tisten commented Apr 20, 2023

Describe the problem you are trying to solve.
I have a crash in flecs' internals which is deterministic and would like to provide a repro for it without sending a whole game engine and the whole game data. This problem is with stripping down the data into only the things which matters for flecs, like the registered component ids and their sizes, the queries provided by each system, and the created entities and their components and relationships.

I've seen that there is "shapshots" which capture some of this, but I wouldn't get the registered components and the queries for the systems.

Describe the solution you'd like
Advice on how to best gather the data for a reproduction case. Something which makes it easy to create tests which use data which is more involved than what a small unit test normally consist of.

@Tisten Tisten added the enhancement New feature or request label Apr 20, 2023
@SanderMertens
Copy link
Owner

You can try using the FLECS_JOURNAL addon (not enabled by defauit). When journaling is compiled in and tracing is enabled, it will trace most ECS operations, which can help figure out what's happening in order to reproduce the crash.

Other than that, my approach is to strip down the application (remove systems / modules / ..) until the issue no longer reproduces. It can take a while, but is useful, as it helps to prove the issue is in Flecs, as opposed to (for example) random memory corruption.

@Tisten
Copy link
Author

Tisten commented Apr 20, 2023

Thanks, I will try that. The issue happens after using the flatten function, if I skip that then it works fine. Will divide and conquer until it I can provide more info.

@Tisten
Copy link
Author

Tisten commented Apr 21, 2023

Thank you Sander, the journal feature helped me understand a timing problem with our initialization, and revealed that we are getting callbacks to a system where the query isn't matched, so there shouldn't have been any callback and even though the data provided is full of null pointers, those are correct. The root problem wasn't that flecs provided bad data, it was that a callback was triggered even though it shouldn't. I have not been able to write a test for the issue, but I have been able to resolve this issue by handling our initialization in a better way so it doesn't happen.

But that issue which I fixed wasn't the crash in the internals. I'm currently thinking that the crash might be come from that we are adding and removing tags and components on entities which have been flattened, and that this might cause issues? I'm not getting any asserts on it, I'm just getting

Exception thrown: read access violation.
**r** was 0xFFFFFFFFFFFFFFF7.

Here:

static
ecs_entity_t flecs_get_flattened_target(
    ecs_world_t *world,
    EcsTarget *cur,
    ecs_entity_t rel,
    ecs_id_t id)
{
    ecs_id_record_t *idr = flecs_id_record_get(world, id);
    if (!idr) {
        return 0;
    }

    ecs_record_t *r = cur->target;
    ecs_assert(r != NULL, ECS_INTERNAL_ERROR, NULL);

--> ecs_table_t *table = r->table; <--
    if (!table) {
        return 0;
    }

Callstack

flecs_get_flattened_target(ecs_world_t * world, EcsTarget * cur, unsigned __int64 rel, unsigned __int64 id) Line 340	C
flecs_entity_filter_next(ecs_entity_filter_iter_t * it) Line 564	C
ecs_query_populate(ecs_iter_t * it) Line 2319	C
ecs_query_next_instanced(ecs_iter_t * it) Line 2385	C
ecs_query_next(ecs_iter_t * it) Line 2352	C
ecs_run_intern(ecs_world_t * world, ecs_stage_t * stage, unsigned __int64 system, ecs_system_t * system_data, int stage_index, int stage_count, float delta_time, int offset, int limit, void * param) Line 116	C
flecs_run_pipeline(ecs_world_t * world, ecs_pipeline_state_t * pq, float delta_time) Line 584	C
flecs_workers_progress(ecs_world_t * world, ecs_pipeline_state_t * pq, float delta_time) Line 358	C
ecs_progress(ecs_world_t * world, float user_delta_time) Line 687	C

The EcsTarget *cur is broken.

In flecs_entity_filter_next() I see that EcsTarget *ft has {count=0x00000001 target=0x00000180103598e0} and ft_offset is 0x2d.

No entities have been added or removed in the flattened subtree, only tags and components, so if the documentation at https://www.flecs.dev/flecs/group__entity__info.html#gaa15f32a7016cfe84291207e209ba7a1a is true then we are not doing anything which is forbidden.

I will continue to debug a bit more.

@Tisten
Copy link
Author

Tisten commented Apr 21, 2023

I switched to debug instead which enabled asserts in flecs and now got this assert:

    ecs_assert(gen == (index & (0xFFFFFFFFull << 32)),
        ECS_INVALID_PARAMETER, NULL);

gen is 0x0000011b00000000
index is 0x8000011b00000299
So the topmost bit is set. This index is an entity sent to flecs_journal_begin(), called from ecs_delete_with() called from flecs_flatten() as an ecs_pair(rel, entities[i]) where the topmost bit is set by adding the ECS_PAIR bit.

Full callstack

flecs_sparse_strip_generation(unsigned __int64 * index_out) Line 156	C
flecs_sparse_is_alive(const ecs_sparse_t * sparse, unsigned __int64 index) Line 649	C
ecs_is_alive(const ecs_world_t * world, unsigned __int64 entity) Line 3773	C
flecs_path_append(const ecs_world_t * world, unsigned __int64 parent, unsigned __int64 child, const char * sep, const char * prefix, ecs_strbuf_t * buf) Line 27	C
ecs_get_path_w_sep_buf(const ecs_world_t * world, unsigned __int64 parent, unsigned __int64 child, const char * sep, const char * prefix, ecs_strbuf_t * buf) Line 270	C
ecs_get_path_w_sep(const ecs_world_t * world, unsigned __int64 parent, unsigned __int64 child, const char * sep, const char * prefix) Line 287	C
flecs_journal_begin(ecs_world_t * world, ecs_journal_kind_t kind, unsigned __int64 entity, ecs_type_t * add, ecs_type_t * remove) Line 70	C
ecs_delete_with(ecs_world_t * world, unsigned __int64 id) Line 2430	C
flecs_flatten(ecs_world_t * world, unsigned __int64 root, unsigned __int64 pair, int depth, const ecs_flatten_desc_t * desc) Line 3543	C
flecs_flatten(ecs_world_t * world, unsigned __int64 root, unsigned __int64 pair, int depth, const ecs_flatten_desc_t * desc) Line 3483	C
flecs_flatten(ecs_world_t * world, unsigned __int64 root, unsigned __int64 pair, int depth, const ecs_flatten_desc_t * desc) Line 3483	C
flecs_flatten(ecs_world_t * world, unsigned __int64 root, unsigned __int64 pair, int depth, const ecs_flatten_desc_t * desc) Line 3483	C
flecs_flatten(ecs_world_t * world, unsigned __int64 root, unsigned __int64 pair, int depth, const ecs_flatten_desc_t * desc) Line 3483	C
ecs_flatten(ecs_world_t * world, unsigned __int64 pair, const ecs_flatten_desc_t * desc) Line 3586	C

@Tisten
Copy link
Author

Tisten commented Apr 21, 2023

Can the journal be enabled in the tests, and be run in debug? If so, how do I do that?
Edit: I managed to add journaling by hardcoding the define and the log level. Debug was set from command line. Running it takes forever since my logging goes to stdout though, would need to find a way to log to file instead to make it more useful.

Edit2: The api tests in debug with journal active finished in 40 minutes with PASS:2321, FAIL: 82, EMPTY: 1 (api.all) but I have no idea which tests failed since the log scrollback is far to short.

Edit3: Sending stdout to a file meant I only got stderr and that says sparse.c: 157: assert: gen == (index & (0xFFFFFFFFull << 32)) INVALID_PARAMETER 82 times. The log file now provided the info: 3 failures from Hierarchies, 62 from FixedHierarchies, 9 from OnDelete, 4 from Query and 4 from Observer

@SanderMertens
Copy link
Owner

Looks like there are some issues with the journaling addon. I could look into those but it's not high on my priority list since journaling is only used for debugging.

I would like to get to the bottom of the crash in the flattening code. All of the flattening tests are running in CI with address sanitizer enabled and they're passing, so the scenario under which this occurs might be hard to find without more context. I'll run a couple of tests with valgrind and see if I can find anything.

@SanderMertens SanderMertens added bug Something isn't working needs-repro and removed enhancement New feature or request labels Apr 24, 2023
@SanderMertens SanderMertens changed the title Best way to generate a repro case? Segmentation fault in flecs_get_flattened_target Apr 24, 2023
@SanderMertens
Copy link
Owner

Closing this issue as I don't know the scenario to reproduce and valgrind/sanitizer logs didn't show anything for the tests. Feel free to reopen this issue if you find a reproducer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-repro
Projects
None yet
Development

No branches or pull requests

2 participants