Skip to content

Conversation

@kounelisagis
Copy link
Member

This PR fixes test_schema_evolution_drop_fixed_add_var flaky test where multiple calls to tiledb_timestamp_now_ms() within the same millisecond could cause timestamp collisions, leading to incorrect schema selection during time-travel reads. The fix uses a single base timestamp with consistent offsets and explicitly specifies timestamps via TemporalPolicy for all array reads to ensure deterministic schema loading.

Below is the failure:

-------------------------------------------------------------------------------
C++ API: SchemaEvolution, drop fixed attribute and add back as var-sized
-------------------------------------------------------------------------------
/Users/agis/TileDB_workspace/TileDB/test/src/unit-cppapi-schema-evolution.cc:1163
...............................................................................

/Users/agis/TileDB_workspace/TileDB/test/support/src/vfs_helpers.cc:147: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
  C API: There is no field a

Closes CORE-49


TYPE: IMPROVEMENT
DESC: Fix flaky schema evolution test using explicit timestamp ordering.

std::string buffer;
std::vector<uint64_t> offsets(10);
Array array_r(ctx, array_uri, TILEDB_READ);
TemporalPolicy temporal_latest(TimeTravel, initial_ts + 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TimeTravel represents opening at a single point in time

This is a change in the semantics of opening the array, i.e. from opening all of the fragments to just opening at a single point in time. I am concerned that this changes the intent of the test and risks masking a problem which is underneath. The original test probably wanted to check that the most recently evolved schema is used, right? This no longer does that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I am aware of this, and I realize now that I should have provided a more detailed explanation in the PR description. I focused on fixing the issue rather than explaining the cause of the test failure.

Before that, I would like to mention that this test is mixing timestamp and non-timestamp operations, which I believe is generally best to avoid. IMO, this alone is a good enough reason to move forward with merging this PR.

The test in this PR is the timestamped version. Of course, there can also be a non-timestamped version.


Explanation:

Let's start with fragment_write_ts, which is only used at the end of the test to read back the original array. I am not sure why it is defined as fragment_write_ts = tiledb_timestamp_now_ms() + 1 instead of simply fragment_write_ts = tiledb_timestamp_now_ms(). The 100 ms sleep right after this ensures that the array will not change during that interval. Otherwise, the first schema evolution might happen at that tiledb_timestamp_now_ms() + 1, and we would end up reading the array after that change at the end of the test.

After at least the initial 100 + 1 ms, we perform the first schema evolution, followed by another 100 ms sleep. Both the first and second schema evolutions use now = tiledb_timestamp_now_ms() + 1;. Considering that two consecutive calls to tiledb_timestamp_now_ms() may return the same timestamp, the 100 ms sleep helps ensure these operations occur at different timestamps. However, there is no sleep after the second schema evolution. This means that the array open can occur before the second schema evolution. I believe this is the reason for the failures. If a 100 ms sleep is added there as well, everything appears to behave correctly - at least in my local tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally ok with going ahead with the solution of adding an extra sleep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay it took me a bit but I see what you are saying now, thank you!

Copy link
Contributor

@bekadavis9 bekadavis9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
iirc, this was essentially the intent with the original test; I forgot TemporalPolicy starts at ts=0 by default rather than now().
Thanks!

Copy link
Member

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kounelisagis kounelisagis merged commit 4aad2c8 into main Nov 19, 2025
56 checks passed
@kounelisagis kounelisagis deleted the agis/flaky-schema-evolution-test branch November 19, 2025 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants