-
Notifications
You must be signed in to change notification settings - Fork 199
Fix flaky schema evolution test using explicit timestamp ordering #5689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| std::string buffer; | ||
| std::vector<uint64_t> offsets(10); | ||
| Array array_r(ctx, array_uri, TILEDB_READ); | ||
| TemporalPolicy temporal_latest(TimeTravel, initial_ts + 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TimeTravel represents opening at a single point in time
This is a change in the semantics of opening the array, i.e. from opening all of the fragments to just opening at a single point in time. I am concerned that this changes the intent of the test and risks masking a problem which is underneath. The original test probably wanted to check that the most recently evolved schema is used, right? This no longer does that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I am aware of this, and I realize now that I should have provided a more detailed explanation in the PR description. I focused on fixing the issue rather than explaining the cause of the test failure.
Before that, I would like to mention that this test is mixing timestamp and non-timestamp operations, which I believe is generally best to avoid. IMO, this alone is a good enough reason to move forward with merging this PR.
The test in this PR is the timestamped version. Of course, there can also be a non-timestamped version.
Explanation:
Let's start with fragment_write_ts, which is only used at the end of the test to read back the original array. I am not sure why it is defined as fragment_write_ts = tiledb_timestamp_now_ms() + 1 instead of simply fragment_write_ts = tiledb_timestamp_now_ms(). The 100 ms sleep right after this ensures that the array will not change during that interval. Otherwise, the first schema evolution might happen at that tiledb_timestamp_now_ms() + 1, and we would end up reading the array after that change at the end of the test.
After at least the initial 100 + 1 ms, we perform the first schema evolution, followed by another 100 ms sleep. Both the first and second schema evolutions use now = tiledb_timestamp_now_ms() + 1;. Considering that two consecutive calls to tiledb_timestamp_now_ms() may return the same timestamp, the 100 ms sleep helps ensure these operations occur at different timestamps. However, there is no sleep after the second schema evolution. This means that the array open can occur before the second schema evolution. I believe this is the reason for the failures. If a 100 ms sleep is added there as well, everything appears to behave correctly - at least in my local tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally ok with going ahead with the solution of adding an extra sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay it took me a bit but I see what you are saying now, thank you!
bekadavis9
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
iirc, this was essentially the intent with the original test; I forgot TemporalPolicy starts at ts=0 by default rather than now().
Thanks!
ypatia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR fixes
test_schema_evolution_drop_fixed_add_varflaky test where multiple calls totiledb_timestamp_now_ms()within the same millisecond could cause timestamp collisions, leading to incorrect schema selection during time-travel reads. The fix uses a single base timestamp with consistent offsets and explicitly specifies timestamps viaTemporalPolicyfor all array reads to ensure deterministic schema loading.Below is the failure:
Closes CORE-49
TYPE: IMPROVEMENT
DESC: Fix flaky schema evolution test using explicit timestamp ordering.