Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orc: more changes #6714

Merged
merged 7 commits into from
Sep 17, 2020
Merged

orc: more changes #6714

merged 7 commits into from
Sep 17, 2020

Conversation

sougou
Copy link
Contributor

@sougou sougou commented Sep 13, 2020

Since we're moving iteratively, the individual commits are not necessarily holistic. So, I'm combining everything done so far into one PR:

  • Added support for a a "durability" plugin. It allows you to specify who are the eligible masters, how many semi-sync acks the master should be configured to, and who can ack semi-syncs. The promotion rules now consult this module instead of config variables. This will evolve as we move forward.
  • Added more failure detections: MasterSemiSyncMustBeSet, MasterSemiSyncMustNotBeSet, ReplicaSemiSyncMustBeSet, ReplicaSemiSyncMustNotBeSet
  • Proactively change the tablet types in vitess for a smoother user experience.
  • LockShard has improved, but enough. More will come later.
  • Added support for Operator in orchestrator: initial support planetscale/vitess-operator#130. Made corresponding changes here.
  • Added operator to docker images: this required disabling of CGO, because of our new dependency with sqlite.
  • Deleted discovery through mysql topology. This is not needed because there is authoritative info coming from the vitess topo.
  • Disabled hostname resolutions: This was not working inside kubernetes. Again, the vitess topo provides authoritative info for hostnames alreay.
  • Removed CLI mode from orchestrator main. Now, the only mode is http. So, that argument is now unnecessary.
  • Added a new flag "orc_web_dir" in order to point orchestrator at its webdir to the right path in the vitess/lite container.
  • The default values for some configs have been changed, and will likely be removed from the user's control eventually. The hard-coded values are:
    • "BackendDB": "sqlite"
    • "SQLite3DataFile": ":memory:"
    • "RecoverMasterClusterFilters": ["*"]
    • "DelayMasterPromotionIfSQLThreadNotUpToDate": true
    • "MySQLHostnameResolveMethod": "none"
  • Added some additional demo scripts/files in examples/local, and examples/operator.

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. change from :memory: to file::memory:?mode=memory&cache=shared, because :memory: will open a new snapshot of the database for any new sqlite connection. The only reason you didn't notice a problem is that the orchestrator code forces a single connection in the pool; that's because I've ran into deadlocks in the past that I couldn't solve. Ideally these can be solved sometime in the future, but then file::memory:?mode=memory&cache=shared must be used or ese everything breaks.

  2. Do you still need help with CGO and Docker? I can look into that.

Removed CLI mode from orchestrator main. Now, the only mode is http. So, that argument is now unnecessary.

(iterating inline comment) CLI is used in integration tests; a but like vitess' endtoend tests run command line vtctl. In that sense it's useful to keep. Unless you convert all integrated tests to run orchestrator-client, which is some undertaking.

Disabled hostname resolutions: This was not working inside kubernetes. Again, the vitess topo provides authoritative info for hostnames alreay.

Is it possible, though, that when you SHOW PROCESSLIST or SHOW SLAVE STATUS on some user's MySQL server, the hostname from the master/replicas does not match the way Vitess thinks it should show? If that happens, then orchestrator will not build the topology view correctly. I'm concerned that we are unable to verify that, because who knows what setups different users may have.
I'd suggest restoring hostname resolutions to be on the safe side.

go/vt/orchestrator/config/config.go Outdated Show resolved Hide resolved
@@ -61,7 +61,7 @@ endif
install: build
# binaries
mkdir -p "$${PREFIX}/bin"
cp "$${VTROOT}/bin/"{mysqlctld,vtctld,vtctlclient,vtgate,vttablet,vtworker,vtbackup} "$${PREFIX}/bin/"
cp "$${VTROOT}/bin/"{mysqlctld,orchestrator,vtctld,vtctlclient,vtgate,vttablet,vtworker,vtbackup} "$${PREFIX}/bin/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -19,9 +19,6 @@
# Use a temporary layer for the build stage.
FROM vitess/bootstrap:mariadb103 AS builder

# Allows some docker builds to disable CGO
ARG CGO_ENABLED=0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -48,6 +45,7 @@ ENV MYSQL_FLAVOR MariaDB103

# Copy artifacts from builder layer.
COPY --from=builder --chown=vitess:vitess /vt/install /vt
COPY --from=builder --chown=vitess:vitess /vt/src/vitess.io/vitess/web/orchestrator /vt/web/orchestrator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -0,0 +1,57 @@
#!/bin/bash

# Copyright 2019 The Vitess Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2019 The Vitess Authors.
# Copyright 2020 The Vitess Authors.

if !inst.RegexpMatchPatterns(instance.MasterKey.StringCode(), config.Config.DiscoveryIgnoreMasterHostnameFilters) {
discoveryQueue.Push(instance.MasterKey)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wow 🤞 hope this works 😅

_, unlock, err := ts.LockShard(context.TODO(), tablet.Keyspace, tablet.Shard, "Orc Recovery")
ctx, cancel := context.WithTimeout(context.TODO(), 1*time.Second)
defer cancel()
_, unlock, err := ts.LockShard(ctx, tablet.Keyspace, tablet.Shard, "Orc Recovery")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

} else {
err = tmc.UndoDemoteMaster(context.TODO(), tablet)
err = tmc.UndoDemoteMaster(ctx, tablet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, nice! I like the Undo!

@@ -1535,9 +1559,10 @@ func getCheckAndRecoverFunction(analysisCode inst.AnalysisCode, analyzedInstance
return electNewMaster, true
case inst.MasterHasMaster:
return fixClusterAndMaster, true
case inst.MasterIsReadOnly:
case inst.MasterIsReadOnly, inst.MasterSemiSyncMustBeSet, inst.MasterSemiSyncMustNotBeSet:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

go/vt/orchestrator/logic/topology_recovery.go Show resolved Hide resolved
Copy link
Contributor Author

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll answer the main comments separately.

examples/local/scripts/ovttablet-up.sh Show resolved Hide resolved
}

switch {
case helpTopic != "":
app.HelpCommand(helpTopic)
case len(flag.Args()) == 0 || flag.Arg(0) == "cli":
app.CliWrapper(*command, *strict, *instance, *destination, *owner, *reason, *duration, *pattern, *clusterAlias, *pool, *hostnameFlag)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to convert. I'll explain in the main answer.

log.Fatale(err)
}
fmt.Println(instanceKey.DisplayString())
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were removed because they conflict with the new durability plugin. Even you set it here, the plugin will undo the change if it notices that the setting doesn't agree with its expectations.

si.MasterTermStartTime = newMasterTablet.MasterTermStartTime
return nil
})
// Don't proceed if shard record could not be updated.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment applies to the error check and nil return. It says that we should not ignore this error. LMK if there is a better way to phrase it.

go/vt/orchestrator/logic/topology_recovery.go Show resolved Hide resolved
@sougou
Copy link
Contributor Author

sougou commented Sep 15, 2020

  1. change from :memory: to file::memory:?mode=memory&cache=shared, because :memory: will open a new snapshot of the database for any new sqlite connection. The only reason you didn't notice a problem is that the orchestrator code forces a single connection in the pool; that's because I've ran into deadlocks in the past that I couldn't solve. Ideally these can be solved sometime in the future, but then file::memory:?mode=memory&cache=shared must be used or ese everything breaks.

Done

  1. Do you still need help with CGO and Docker? I can look into that.

CGO works now.

  1. Removed CLI mode from orchestrator main. Now, the only mode is http. So, that argument is now unnecessary.

(iterating inline comment) CLI is used in integration tests; a but like vitess' endtoend tests run command line vtctl. In that sense it's useful to keep. Unless you convert all integrated tests to run orchestrator-client, which is some undertaking.

I had to remove the CLI because there was no easy way to change the operator framework to add Http as a non-flag argument. So, removing that argument cascaded into these dependencies. I'm not worried about losing CLI for tests because we've managed to test vitess without command line versions of the servers. vtctl continues to exist mainly because of inertia. It is deprecated. We just haven't gotten around to deleting it.

  1. Disabled hostname resolutions: This was not working inside kubernetes. Again, the vitess topo provides authoritative info for hostnames alreay.

Is it possible, though, that when you SHOW PROCESSLIST or SHOW SLAVE STATUS on some user's MySQL server, the hostname from the master/replicas does not match the way Vitess thinks it should show? If that happens, then orchestrator will not build the topology view correctly. I'm concerned that we are unable to verify that, because who knows what setups different users may have.
I'd suggest restoring hostname resolutions to be on the safe side.

I wasn't too sure about this, which is why the core code is still there and we can resurrect it if we run into issues. But enabling hostname resolution doesn't work in kubernetes. Those pod names masquerading as hostnames was causing confusion. I'm hoping we won't have to do this because Vitess has some cloud-friendly way of resolving hostnames. So, blindly trusting the ones provided by vitess should just work. 🤞

@shlomi-noach
Copy link
Contributor

blindly trusting the ones provided by vitess should just work. crossed_fingers

The issue is not that; I'm happy to trust the name resolving provided by vitess, that's fine. But, do we absolutely know for sure that if I SHOW SLAVE STATUS than the Master_host shows up in the same way Vitess resolves it? e.g. could I just see an IP address there? Or is this impossible because it's vitess who set up the replication in the first place?

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
@sougou
Copy link
Contributor Author

sougou commented Sep 16, 2020

blindly trusting the ones provided by vitess should just work. crossed_fingers

The issue is not that; I'm happy to trust the name resolving provided by vitess, that's fine. But, do we absolutely know for sure that if I SHOW SLAVE STATUS than the Master_host shows up in the same way Vitess resolves it? e.g. could I just see an IP address there? Or is this impossible because it's vitess who set up the replication in the first place?

This seems to work fine. I just brought up a cluster and looked at the replica info. They show up as IP in the UI also. I think it woks fine because we also use the same IP while wiring up the replicas using change master to.

Screenshot from 2020-09-15 21-12-14

@sougou sougou merged commit ba71ee6 into vitessio:master Sep 17, 2020
@sougou sougou deleted the ss-oc5-operator branch September 20, 2020 21:38
@askdba askdba added this to the v8.0 milestone Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants