-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression when requesting a timestamp before the earliest version #2076
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main thing to change is the LOAD_ALL_UNDELETED
regression + testing.
Others are some small refactors.
Also please update the PR description to describe better what are the core changes (i.e. the changes to the version map entry) and the behavior changes.
And I think you might need to rebase the PR to get the ASV fixes. I would really like to see the logs from ASV saying we have improved the benchmark from_epoch
@@ -16,6 +16,8 @@ | |||
#include <deque> | |||
#include <vector> | |||
|
|||
#include "async/tasks.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this. Maybe remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we missed this?
@@ -241,7 +244,6 @@ struct VersionMapEntry { | |||
tombstone_all_.reset(); | |||
keys_.clear(); | |||
load_progress_ = LoadProgress{}; | |||
load_strategy_ = LoadStrategy{LoadType::NOT_LOADED, LoadObjective::INCLUDE_DELETED}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice that we remove the load_strategy from the entry. Could you explain in the PR description why it's now obsolete? Thanks!
const std::shared_ptr<VersionMap>& version_map, | ||
const StreamId& id, | ||
size_t number_of_versions, | ||
std::vector<VersionId> tombstones_versions = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like this refactor.
Currently this only allows us to specify a version chain where we only ever delete the latest version and can only have one tombstone all. It would be nice to also be able to test version chains like e.g.:
v0 <- v1 <- v2 <- t0 <- v3 <- tall2
To do this we can instead of the number_of_versions
, tombstone_versions
and tombstone_all_version
have one argument:
std::vector<VersionChainOperation> operations
where an VersionChainOperation
is just a struct of an enum {WRITE, TOMBSTONE, TOMBSTONE_ALL}
and a version_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, will do it
@@ -604,7 +617,7 @@ TEST(VersionMap, FollowingVersionChain){ | |||
auto store = std::make_shared<InMemoryStore>(); | |||
auto version_map = std::make_shared<VersionMap>(); | |||
StreamId id{"test"}; | |||
write_alternating_deleted_undeleted(store, version_map, id); | |||
write_versions(store, version_map, id, 3, {2}, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we use the write_alternating_deleted_undeleted
in a few places.
To avoid chainging all these usages we can keep the write_alternating_deleted_undeleted
and just define it as one call of the new write_versions
.
I don't have a strong opinion but I think we should do one of:
- delete
write_alternating_deleted_undeleted
- keep
write_alternating_deleted_undeleted
but make it use our newwrite_versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe if we update to the slightly longer write_versions
call it will be useful to keep around the write_two_versions
to not have to write in many places {{Write, 0}, {Write, 1}}
// FROM_TIME UNDELETED_ONLY should no longer be cached even though we used the same request before because the undeleted key it went to got deleted. So it will load the entire version chain | ||
check_loads_versions(LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(10)}, 3, 0); | ||
// We have the full version chain loaded, so has_cached_entry should always return true (even when requesting timestamp before earliest version) | ||
EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(-1)})); | ||
EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::INCLUDE_DELETED, static_cast<timestamp>(-1)})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe one more assertion to showcase that it even works with e.g. LoadType::ALL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we missed this one?
I mean something like:
EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::ALL, LoadObjective::INCLUDE_DELETED}));
check_all_caching({load_all_undeleted_param}, should_load_to_v[i], i != 0); | ||
|
||
// If we have loaded to version 0 all load requests are true | ||
check_all_caching(should_load_to_v[i], {load_all_param, load_all_undeleted_param}, i == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make check_all_caching(should_load_to_v[1], {load_all_undeleted_param}, true)
pass?
It will need the comment about LOAD_ALL
UNDELETED_ONLY
caching changed to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually even better, with your changes we can remove the separate testing for load_all
and load_all_undeleted
.
Whu don't just place the LOAD_ALL
in should_load_to_v[0]
and LOAD_ALL_UNDELETED
in should_load_to_v[1]
?
@@ -606,14 +608,9 @@ class VersionMapImpl { | |||
return cached_timestamp <= requested_load_strategy.load_from_time_.value(); | |||
} | |||
case LoadType::ALL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't think about this before but currently if we do: {LoadType::ALL, UNDELETED_ONLY}
repeatedly it would never get cached.
That is because if we have the following version chain:
v0 <- v1 <- tall1 <- v2
the load_all_undeleted would reach the tall0 and terminate there.
I think we already have the information we need. I.e. if earliest_loaded_version <= tombstone_all.value()
then we know that we have all undeleted versions.
It would also be good to test this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch! Didn't think about that case
…v/performance_timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more very minor comments. Otherwise all looks good! Thanks! Approved
break; | ||
} | ||
case VersionChainOperation::Type::TOMBSTONE_ALL: { | ||
version_map->delete_all_versions(store, id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Theoretically we can still have a tombstone_all to a version_id different than the last.
You can instead use the write_tombstone_all_internal
+ write_symbol_ref
to do that.
If we decide against that it would be good to document that VersionChainOperation{TOMBSTONE_ALL, 3}
might not do what we expect.
write_symbol_ref(store, key3, std::nullopt, entry->head_.value()); | ||
using Type = VersionChainOperation::Type; | ||
write_versions(store, version_map, id, {{Type::WRITE, 0}, | ||
{Type::TOMBSTONE_ALL}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: These extra whitespaces seem off. Maybe align all the operations the same number of spaces to the right.
Hopefully @poodlewars's formatting will get merged and we can all use the same formatting throughout.
// FROM_TIME UNDELETED_ONLY should no longer be cached even though we used the same request before because the undeleted key it went to got deleted. So it will load the entire version chain | ||
check_loads_versions(LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(10)}, 3, 0); | ||
// We have the full version chain loaded, so has_cached_entry should always return true (even when requesting timestamp before earliest version) | ||
EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(-1)})); | ||
EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::INCLUDE_DELETED, static_cast<timestamp>(-1)})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we missed this one?
I mean something like:
EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::ALL, LoadObjective::INCLUDE_DELETED}));
using Type = VersionChainOperation::Type; | ||
std::vector<VersionChainOperation> version_chain = {{Type::WRITE, 0}, | ||
{Type::WRITE, 1}, | ||
{Type::TOMBSTONE_ALL}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: some more weird whitespaces
@@ -16,6 +16,8 @@ | |||
#include <deque> | |||
#include <vector> | |||
|
|||
#include "async/tasks.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we missed this?
ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(1)})); | ||
ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(0)})); | ||
} | ||
// Tombstone All should not invalidate cache as it deletes everything so all undeleted versions have been loaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I caught when testing a change related to this case #2076 (comment) . Previously writing tombstone_all would cause any subsequent UNDELETED_ONLY lookups in cache to return false. However, I think this was wrong because writing tombstone_all deletes all previous versions so it is impossible for an undeleted version to exist in the version chain and not in the cache. Thus all undeleted versions have been loaded in cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add tests that check_reload
gives the correct result for these lookups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests where the tombstone_all version ID is not the latest version?
Please tidy up the git commit history |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a benchmark to demonstrate the performance improvement?
Is this functionality covered by an end to end (Python) test?
ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(1)})); | ||
ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(0)})); | ||
} | ||
// Tombstone All should not invalidate cache as it deletes everything so all undeleted versions have been loaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add tests that check_reload
gives the correct result for these lookups?
ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(1)})); | ||
ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(0)})); | ||
} | ||
// Tombstone All should not invalidate cache as it deletes everything so all undeleted versions have been loaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests where the tombstone_all version ID is not the latest version?
Will squash before merging |
timeout = sys.maxsize if use_caching else 0 | ||
lib = basic_store | ||
symbol = "test" | ||
dataframes = [sample_dataframe()] * 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way all dataframes are equal (in fact maybe pointing to the same dataframe object). It would be useful for your tests where you compare the dataframes to make them different.
Instead use something like:
[sample_dataframe() for _ in range(10)]
with pytest.raises(NoSuchVersionException): | ||
lib.read(symbol, as_of=timestamps[i].before) | ||
else: | ||
assert_equal(lib.read(symbol, as_of=timestamps[i].before).data, dataframes[i-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example where the same dataframes bites us. If we request as_of=timestamps[4].before
we should get dataframes[2]
because 3 is deleted.
Reference Issues/PRs
Fixes #1772
What does this implement or fix?
Version Map entry now stores whether the whole chain has been loaded, so we don't need to reload again in certain scenarios
load_strategy_ is removed from version_map_entry and replaced with a flag is_earliest_version_loaded in LoadProgress as we only need to know whether the entire version chain has been loaded previously. Previously we only kept whether the cache has been loaded with an ALL strategy but that missed the cases where we load via DOWN_TO 0 or FROM_TIME that is before earliest
if we haven't loaded the entire version chain, we additionally check if we have loaded all undeleted versions. This is in cases where we have tombstone_all on a certain version and an undeleted version is requested. See Performance regression when requesting a timestamp before the earliest version #2076 (comment)
Performance has improved
Any other comments?
Checklist
Checklist for code changes...