Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression when requesting a timestamp before the earliest version #2076

Merged
merged 14 commits into from
Jan 15, 2025

Conversation

ognyanstoimenov
Copy link
Collaborator

@ognyanstoimenov ognyanstoimenov commented Dec 17, 2024

Reference Issues/PRs

Fixes #1772

What does this implement or fix?

Version Map entry now stores whether the whole chain has been loaded, so we don't need to reload again in certain scenarios

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

Copy link
Collaborator

@IvoDD IvoDD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main thing to change is the LOAD_ALL_UNDELETED regression + testing.

Others are some small refactors.

Also please update the PR description to describe better what are the core changes (i.e. the changes to the version map entry) and the behavior changes.

And I think you might need to rebase the PR to get the ASV fixes. I would really like to see the logs from ASV saying we have improved the benchmark from_epoch

@@ -16,6 +16,8 @@
#include <deque>
#include <vector>

#include "async/tasks.hpp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. Maybe remove?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we missed this?

@@ -241,7 +244,6 @@ struct VersionMapEntry {
tombstone_all_.reset();
keys_.clear();
load_progress_ = LoadProgress{};
load_strategy_ = LoadStrategy{LoadType::NOT_LOADED, LoadObjective::INCLUDE_DELETED};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice that we remove the load_strategy from the entry. Could you explain in the PR description why it's now obsolete? Thanks!

const std::shared_ptr<VersionMap>& version_map,
const StreamId& id,
size_t number_of_versions,
std::vector<VersionId> tombstones_versions = {},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like this refactor.

Currently this only allows us to specify a version chain where we only ever delete the latest version and can only have one tombstone all. It would be nice to also be able to test version chains like e.g.:

v0 <- v1 <- v2 <- t0 <- v3 <- tall2

To do this we can instead of the number_of_versions, tombstone_versions and tombstone_all_version have one argument:
std::vector<VersionChainOperation> operations
where an VersionChainOperation is just a struct of an enum {WRITE, TOMBSTONE, TOMBSTONE_ALL} and a version_id.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, will do it

@@ -604,7 +617,7 @@ TEST(VersionMap, FollowingVersionChain){
auto store = std::make_shared<InMemoryStore>();
auto version_map = std::make_shared<VersionMap>();
StreamId id{"test"};
write_alternating_deleted_undeleted(store, version_map, id);
write_versions(store, version_map, id, 3, {2}, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we use the write_alternating_deleted_undeleted in a few places.

To avoid chainging all these usages we can keep the write_alternating_deleted_undeleted and just define it as one call of the new write_versions.

I don't have a strong opinion but I think we should do one of:

  • delete write_alternating_deleted_undeleted
  • keep write_alternating_deleted_undeleted but make it use our new write_versions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe if we update to the slightly longer write_versions call it will be useful to keep around the write_two_versions to not have to write in many places {{Write, 0}, {Write, 1}}

// FROM_TIME UNDELETED_ONLY should no longer be cached even though we used the same request before because the undeleted key it went to got deleted. So it will load the entire version chain
check_loads_versions(LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(10)}, 3, 0);
// We have the full version chain loaded, so has_cached_entry should always return true (even when requesting timestamp before earliest version)
EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(-1)}));
EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::INCLUDE_DELETED, static_cast<timestamp>(-1)}));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one more assertion to showcase that it even works with e.g. LoadType::ALL?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we missed this one?
I mean something like:

EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::ALL, LoadObjective::INCLUDE_DELETED}));

check_all_caching({load_all_undeleted_param}, should_load_to_v[i], i != 0);

// If we have loaded to version 0 all load requests are true
check_all_caching(should_load_to_v[i], {load_all_param, load_all_undeleted_param}, i == 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make check_all_caching(should_load_to_v[1], {load_all_undeleted_param}, true) pass?

It will need the comment about LOAD_ALL UNDELETED_ONLY caching changed to pass.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually even better, with your changes we can remove the separate testing for load_all and load_all_undeleted.

Whu don't just place the LOAD_ALL in should_load_to_v[0]
and LOAD_ALL_UNDELETED in should_load_to_v[1]?

@@ -606,14 +608,9 @@ class VersionMapImpl {
return cached_timestamp <= requested_load_strategy.load_from_time_.value();
}
case LoadType::ALL:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't think about this before but currently if we do: {LoadType::ALL, UNDELETED_ONLY} repeatedly it would never get cached.

That is because if we have the following version chain:

v0 <- v1 <- tall1 <- v2

the load_all_undeleted would reach the tall0 and terminate there.

I think we already have the information we need. I.e. if earliest_loaded_version <= tombstone_all.value() then we know that we have all undeleted versions.

It would also be good to test this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch! Didn't think about that case

Copy link
Collaborator

@IvoDD IvoDD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more very minor comments. Otherwise all looks good! Thanks! Approved

break;
}
case VersionChainOperation::Type::TOMBSTONE_ALL: {
version_map->delete_all_versions(store, id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Theoretically we can still have a tombstone_all to a version_id different than the last.
You can instead use the write_tombstone_all_internal + write_symbol_ref to do that.

If we decide against that it would be good to document that VersionChainOperation{TOMBSTONE_ALL, 3} might not do what we expect.

write_symbol_ref(store, key3, std::nullopt, entry->head_.value());
using Type = VersionChainOperation::Type;
write_versions(store, version_map, id, {{Type::WRITE, 0},
{Type::TOMBSTONE_ALL},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: These extra whitespaces seem off. Maybe align all the operations the same number of spaces to the right.

Hopefully @poodlewars's formatting will get merged and we can all use the same formatting throughout.

// FROM_TIME UNDELETED_ONLY should no longer be cached even though we used the same request before because the undeleted key it went to got deleted. So it will load the entire version chain
check_loads_versions(LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(10)}, 3, 0);
// We have the full version chain loaded, so has_cached_entry should always return true (even when requesting timestamp before earliest version)
EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(-1)}));
EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::INCLUDE_DELETED, static_cast<timestamp>(-1)}));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we missed this one?
I mean something like:

EXPECT_TRUE(version_map->has_cached_entry(id, LoadStrategy{LoadType::ALL, LoadObjective::INCLUDE_DELETED}));

using Type = VersionChainOperation::Type;
std::vector<VersionChainOperation> version_chain = {{Type::WRITE, 0},
{Type::WRITE, 1},
{Type::TOMBSTONE_ALL},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: some more weird whitespaces

@@ -16,6 +16,8 @@
#include <deque>
#include <vector>

#include "async/tasks.hpp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we missed this?

ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(1)}));
ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(0)}));
}
// Tombstone All should not invalidate cache as it deletes everything so all undeleted versions have been loaded.
Copy link
Collaborator Author

@ognyanstoimenov ognyanstoimenov Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I caught when testing a change related to this case #2076 (comment) . Previously writing tombstone_all would cause any subsequent UNDELETED_ONLY lookups in cache to return false. However, I think this was wrong because writing tombstone_all deletes all previous versions so it is impossible for an undeleted version to exist in the version chain and not in the cache. Thus all undeleted versions have been loaded in cache.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add tests that check_reload gives the correct result for these lookups?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests where the tombstone_all version ID is not the latest version?

@poodlewars
Copy link
Collaborator

Please tidy up the git commit history

Copy link
Collaborator

@poodlewars poodlewars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a benchmark to demonstrate the performance improvement?
Is this functionality covered by an end to end (Python) test?

ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(1)}));
ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(0)}));
}
// Tombstone All should not invalidate cache as it deletes everything so all undeleted versions have been loaded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add tests that check_reload gives the correct result for these lookups?

ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(1)}));
ASSERT_FALSE(version_map->has_cached_entry(id, LoadStrategy{LoadType::FROM_TIME, LoadObjective::UNDELETED_ONLY, static_cast<timestamp>(0)}));
}
// Tombstone All should not invalidate cache as it deletes everything so all undeleted versions have been loaded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests where the tombstone_all version ID is not the latest version?

@ognyanstoimenov
Copy link
Collaborator Author

Please tidy up the git commit history

Will squash before merging

timeout = sys.maxsize if use_caching else 0
lib = basic_store
symbol = "test"
dataframes = [sample_dataframe()] * 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way all dataframes are equal (in fact maybe pointing to the same dataframe object). It would be useful for your tests where you compare the dataframes to make them different.

Instead use something like:

[sample_dataframe() for _ in range(10)]

with pytest.raises(NoSuchVersionException):
lib.read(symbol, as_of=timestamps[i].before)
else:
assert_equal(lib.read(symbol, as_of=timestamps[i].before).data, dataframes[i-1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example where the same dataframes bites us. If we request as_of=timestamps[4].before we should get dataframes[2] because 3 is deleted.

@ognyanstoimenov ognyanstoimenov merged commit 6410a72 into master Jan 15, 2025
134 of 135 checks passed
@ognyanstoimenov ognyanstoimenov deleted the ostoimenov/performance_timestamp branch January 15, 2025 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression when requesting a timestamp before the earliest version
3 participants