Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Captive core fast startup #2994

Merged
merged 5 commits into from
May 13, 2021

Conversation

marta-lokhova
Copy link
Contributor

@marta-lokhova marta-lokhova commented Apr 2, 2021

Implementation based on the discussion in #2960

Resolves #2960
Resolves #2993

Copy link
Contributor

@MonsieurNicolas MonsieurNicolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few suggestions to make the code flow easier to understand

src/herder/test/HerderTests.cpp Show resolved Hide resolved
src/herder/test/HerderTests.cpp Show resolved Hide resolved
src/main/ApplicationImpl.cpp Show resolved Hide resolved
@@ -131,6 +155,63 @@ runWithConfig(Config cfg, optional<CatchupConfiguration> cc)
return 0;
}

int
rebuildInMemoryLedger(Application& app)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks more or less like rebuildLedgerFromBuckets, should be refactored

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(NB: there are bugs in the other one)

auto checkBucket = [&](std::string bucketStr) {
auto bucketHash = hexToBin256(bucketStr);
auto bucket = app.getBucketManager().getBucketByHash(bucketHash);
if (!bucket || (isZero(bucket->getHash()) && !isZero(bucketHash)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why || (isZero(bucket->getHash()) && !isZero(bucketHash))) ?
that doesn't seem to be related to on disk/not on disk (covered by !bucket)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you look at the implementation of getBucketByHash? It creates an empty bucket pointer if it can't find the bucket, so checking !bucket is not enough.

That being said, looking at the last known ledger loading code, we already check for missing buckets there, so I think I don't even need this check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you look at the implementation of getBucketByHash? It creates an empty bucket pointer if it can't find the bucket, so checking !bucket is not enough.

yes I did, when it doesn't find the file, it does return std::shared_ptr<Bucket>(); which is nullptr not an empty bucket.

other places in the code are only checking for nullptr

But yeah maybe we don't care anyways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, for some reason I was convinced that function uses std::make_shared. Sorry about that. But yeah, I removed this code now.

cc->toLedger() - lcl > RESTORE_STATE_LEDGER_WINDOW)
{
LOG_INFO(DEFAULT_LOG, "Cannot restore the in-memory state, "
"rebuilding the state from scratch");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to not use pass an extra flag to runWithConfig and just reset the app to genesis right here

"In-memory database not found, creating one...");
app = Application::create(clock, cfg, /* newDB */ true);
}
app->start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code should be in the try block: if we don't find the mini database we should proceed like normal as we're starting up at genesis

optional<CatchupConfiguration>
maybeEnableInMemoryLedgerMode(Config& config, bool inMemory,
uint32_t startAtLedger,
std::string const& startAtHash)
std::string const& startAtHash,
bool persistPartially = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not use default value here

return 0;
});
auto resetParser = [](bool& resetInMemoryState) {
return clara::Opt{resetInMemoryState}["--reset-for-in-memory-mode"](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use inMemoryParser: it was added so that we use the same option across subcommands

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call new-db --in-memory is confusing and contradicts your other comment about calling the database "an in-memory database". The flag is called "in-memory", however the command sets a persistent state, that the next commands also depend on. I'd like to be as clear to the users as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest reusing the term you're using above: minimal. I.e. stellar-core new-db --minimal-for-in-memory-mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I was especially concerned by the word "reset" here because this suggests a non-minimal DB can be changed / reset into a minimal one, and I don't think that's possible?)

// aggressively so that we only store a few ledgers worth of data
config.AUTOMATIC_MAINTENANCE_PERIOD = std::chrono::seconds(30);
config.AUTOMATIC_MAINTENANCE_COUNT = 100;
config.DISABLE_XDR_FSYNC = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should not touch DISABLE_XDR_FSYNC here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picked up the latest master, which already addresses this problem but not forcing DISABLE_XDR_FSYNC, so I believe this is resolved now.

@MonsieurNicolas
Copy link
Contributor

@graydon may have better insight than me

@marta-lokhova marta-lokhova force-pushed the captive_core_fast_startup branch from af73b1b to 6446e50 Compare May 6, 2021 00:29
auto resetParser = [](bool& resetInMemoryState) {
return clara::Opt{resetInMemoryState}["--reset-for-in-memory-mode"](
"Reset the special database used only for in-memory mode (see "
"--replay-in-memory flag");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the --replay-in-memory option is itself marked as deprecated; should reference --in-memory here.

}

void
setupMinimalDB(Config const& cfg, uint32_t startAtLedger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the term "minimal DB" you're using here -- I would repeat / reuse / surface this term to the user. It is clear and can be extended into the phrase "minimal DB for in-memory mode".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe also extend the name of this function so any reader immediately knows it's about in-memory mode: setupMinimalDBForInMemoryMode

uint32_t startAtLedger,
std::string const& startAtHash)
std::string
partialDBForInMemoryMode(Config const& cfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename "partial" here to "minimal" for consistency.

std::string
partialDBForInMemoryMode(Config const& cfg)
{
return fmt::format("sqlite3://{}/partial.db", cfg.BUCKET_DIR_PATH);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and partial.db here to minimal.db

@@ -957,11 +970,26 @@ int
runNewDB(CommandLineArgs const& args)
{
CommandLine::ConfigOption configOption;
bool resetInMemoryState = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename variable: minimalForInMemoryMode

@@ -130,6 +182,35 @@ runWithConfig(Config cfg, std::optional<CatchupConfiguration> cc)
return 0;
}

bool
rebuildLatestState(Application& app)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename function for specificity: applyBucketsOfLastClosedLedger

auto has = app.getLedgerManager().getLastClosedLedgerHAS();
auto lclHash =
app.getPersistentState().getState(PersistentState::kLastClosedLedger);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a releaseAssertOrThrow here that the ledger tables are all empty (using LedgerTxn::countObjects). We never want this code to run on a nonempty ledger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like a useful check, but the problem is that calling countObjects is not allowed on non-root LedgerTxn, which is what we use in in-memory mode (that is, the never committing LedgerTxn). There doesn't seem to be any other machinery to verify LedgerTxn's state (we could add some, but I don't think we should at this time). I will add a comment about this though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Marta, I think it's fine as is for now

@@ -392,6 +392,24 @@ ApplicationImpl::~ApplicationImpl()
LOG_INFO(DEFAULT_LOG, "Application destroyed");
}

void
ApplicationImpl::resetDBForInMemoryMode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me / reminds me a bit of a "schema migration", or just .. an alternative path for creating a DB (albeit a minimal one) that does not necessarily arrive at the same place we'd arrive if we reinitialized. I gather you're trying to preserve the overlay state here (but nothing else); would it be possible to do that explicitly? That is:

  1. load all the overlay state you want to preserve into a temporary data structure in memory.
  2. reinitialize the database using the normal code path.
  3. reinsert just the overlay state you wanted to preserve (data which we're a bit lax about anyways)

int
runWithConfig(Config cfg, std::optional<CatchupConfiguration> cc)
bool
canRebuildFromBuckets(uint32_t startAtLedger, uint32_t lcl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename for specificity: canRestartInMemoryLedgerFromOldBuckets

void
setupMinimalDB(Config const& cfg, uint32_t startAtLedger)
{
VirtualClock clock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a releaseAssert that we're in in-memory mode.

setupApp(Config& cfg, VirtualClock& clock, bool inMemory,
uint32_t startAtLedger, std::string const& startAtHash)
{
if (inMemory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate flag here seems redundant -- isn't this just cfg.isInMemoryMode()?

auto work = app.getWorkScheduler().scheduleWork<ApplyBucketsWork>(
buckets, has, maxProtocolVersion);

while (app.getClock().crank(true) && !work->isDone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside (not critical): we do this pattern in quite a few places, I wonder if it'd be worth adding a VirtualClock::crankUntilWorkDone(BasicWork&) or something.

, mDeleteEntireBucketDirInDtor(app.getConfig().isInMemoryMode())
, mDeleteEntireBucketDirInDtor(
app.getConfig().isInMemoryMode() &&
!app.getConfig().MODE_STORES_HISTORY_LEDGERHEADERS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment would be nice here explaining why !MODE_STORES_HISTORY_LEDGERHEADERS relates to the deletion of entries in the bucketlist; or perhaps a helper method like bool Config::isInMemoryModeWithoutMinimalDB() const { return isInMemoryMode() && !MODE_STORES_HISTORY_LEDGERHEADERS; } or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on adding a helper function for clarity

@@ -212,6 +212,9 @@ class Config : public std::enable_shared_from_this<Config>
// fees, and scp history in the database
bool MODE_STORES_HISTORY;

// A config parameter that stores ledger headers in the database
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend this comment to include the fact that this exists in order to support a sub-mode of in-memory mode, where MODE_STORES_HISTORY is false but MODE_STORES_HISTORY_LEDGERHEADERS remains true. That is: that there is no legal configuration where MODE_STORES_HISTORY_LEDGERHEADERS is false but MODE_STORES_HISTORY is true. Or in yet other words, that only 3 of the 4 states of these 2 flags are legal.

(Also option to discuss here: I know we've gone back and forth on orthogonal flags vs. enumerated modes, but maybe this is as good as any to ask whether to reconsider merging MODE_USES_IN_MEMORY_LEDGER, MODE_STORES_HISTORY and MODE_STORES_HISTORY_LEDGERHEADERS back into a single enum. I'm not sure which of the 8 possible combinations is legal off the top of my head, which is not a great sign.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, I don't know @graydon : we used to have those "uber flags" used in the code base and it was actually very hard to understand which parts of the code had to change when touching things.

What we may want to do in this commit is probably change MODE_STORES_HISTORY to something that doesn't overlap with MODE_STORES_HISTORY_LEDGERHEADERS (right now it's hard to tell how they relate to each other: are they mutually exclusive?!).

So if we rename MODE_STORES_HISTORY to something like MODE_STORES_HISTORY_MISC (and introduce MODE_STORES_HISTORY_LEDGERHEADERS before) we can make it clear that they complement each other.

We can add helper functions to help check if any historical data is enabled (that seems to be the common case), or maybe to enable/disable "persist all historical data" (to flip both at the same time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea to rename MODE_STORES_HISTORY. I ended up implementing modeStoresAllHistory and modeStoresAnyHistory, and that seems to improve clarity quite a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

@graydon graydon self-requested a review May 7, 2021 00:21
Copy link
Contributor

@graydon graydon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks promising! Fairly subtle changes though -- I've marked places where I think things could maybe be made a little more obvious for our future selves when we revisit these treacherous paths (as we surely will). Mostly just cosmetic though, I think there's only one structural / logic change around the question of how to most-safely "reset" vs. "reinitialize + reinsert" a minimal DB that can't be reused.

@@ -212,6 +212,9 @@ class Config : public std::enable_shared_from_this<Config>
// fees, and scp history in the database
bool MODE_STORES_HISTORY;

// A config parameter that stores ledger headers in the database
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, I don't know @graydon : we used to have those "uber flags" used in the code base and it was actually very hard to understand which parts of the code had to change when touching things.

What we may want to do in this commit is probably change MODE_STORES_HISTORY to something that doesn't overlap with MODE_STORES_HISTORY_LEDGERHEADERS (right now it's hard to tell how they relate to each other: are they mutually exclusive?!).

So if we rename MODE_STORES_HISTORY to something like MODE_STORES_HISTORY_MISC (and introduce MODE_STORES_HISTORY_LEDGERHEADERS before) we can make it clear that they complement each other.

We can add helper functions to help check if any historical data is enabled (that seems to be the common case), or maybe to enable/disable "persist all historical data" (to flip both at the same time).

, mDeleteEntireBucketDirInDtor(app.getConfig().isInMemoryMode())
, mDeleteEntireBucketDirInDtor(
app.getConfig().isInMemoryMode() &&
!app.getConfig().MODE_STORES_HISTORY_LEDGERHEADERS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on adding a helper function for clarity

@@ -439,10 +439,13 @@ ApplicationImpl::validateAndLogConfig()

if (getHistoryArchiveManager().hasAnyWritableHistoryArchive())
{
if (!mConfig.MODE_STORES_HISTORY)
if (!mConfig.MODE_STORES_HISTORY ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my other comment: what we're looking for here is to check that if publish is enabled we better store all historical data. So a helper function here would help

@@ -75,7 +75,8 @@ CommandHandler::CommandHandler(Application& app) : mApp(app)

mServer->add404(std::bind(&CommandHandler::fileNotFound, this, _1, _2));

if (mApp.getConfig().MODE_STORES_HISTORY)
if (mApp.getConfig().MODE_STORES_HISTORY &&
mApp.getConfig().MODE_STORES_HISTORY_LEDGERHEADERS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an || as maintenance related endpoints should work regardless of which historical data is in there

src/main/CommandLine.cpp Show resolved Hide resolved
@marta-lokhova marta-lokhova force-pushed the captive_core_fast_startup branch from 6446e50 to dc88ce5 Compare May 11, 2021 16:20
@marta-lokhova
Copy link
Contributor Author

@graydon @MonsieurNicolas thanks for the feedback! I believe I addressed your comments in several new commits (I haven't squashed the changes yet, so that it's easy to review).

Copy link
Contributor

@MonsieurNicolas MonsieurNicolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very close to something we can merge. Good job. I added a couple suggestions/questions related to potentially bad things that we're carrying over from master

throw std::invalid_argument(
"Core is not configured to store history, but "
"some history archives are writable (see "
"MODE_STORES_HISTORY_MISC "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this error message is weird (and got carried over from before):
end users to not have control over this, so the error should hint more about something they can fix.

I think the error message should just be something like

"Core is not configured to store history, but some history archives are writable"

auto has = app.getLedgerManager().getLastClosedLedgerHAS();
auto lclHash =
app.getPersistentState().getState(PersistentState::kLastClosedLedger);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Marta, I think it's fine as is for now


// As the local HAS might have merges in progress, let
// `prepareForPublish` convert it into a "valid historical HAS".
has.prepareForPublish(app);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another "carry over".

Is this has.prepareForPublish(app); really needed?
The current ledger state should never depend on merges in progress (when we close a ledger we have already computed the current ledger) - I think the only thing that this call is going to do is potentially cause a stall when restarting because it will perform merges that it doesn't need to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently needed since ApplyBucketsWork is designed to apply buckets from an untrusted source, so it verifies validity of the HAS. That being said, I think the check could be moved out of ApplyBucketsWork into catchup. I opened #3044 to track this.

{
LOG_WARNING(DEFAULT_LOG,
"Using MANUAL_CLOSE and RUN_STANDALONE "
"together "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this string got chopped-formatted into too many bits. Maybe you need to merge it back into a single line and re-run clang-format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you're right, looks much better now.

@@ -212,6 +212,9 @@ class Config : public std::enable_shared_from_this<Config>
// fees, and scp history in the database
bool MODE_STORES_HISTORY;

// A config parameter that stores ledger headers in the database
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

{
ZoneScoped;
std::vector<std::pair<PeerBareAddress, PeerRecord>> result;
std::string sql = "SELECT * FROM peers";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid using SELECT * as it's a footgun, just list the column names that you expect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, updated.

@marta-lokhova marta-lokhova force-pushed the captive_core_fast_startup branch from dc88ce5 to 7cde809 Compare May 12, 2021 21:43
@marta-lokhova
Copy link
Contributor Author

I believe this is ready now @MonsieurNicolas, I've squashed the commits as well.

@MonsieurNicolas
Copy link
Contributor

r+ 7cde809

@MonsieurNicolas
Copy link
Contributor

@latobarita: retry

@MonsieurNicolas
Copy link
Contributor

@latobarita: retry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants