-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add capability to create consistent transactions #4310
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach looks good overall. Got some initial comments/questions.
I still need to think about the consequences of stream connection using tx pool.
There is an intent to merge the stream and non-stream protocol. So, we need to make decisions that are compatible with that future. Most likely, it's not material for this change.
go/mysql/flavor_mariadb.go
Outdated
@@ -41,6 +41,10 @@ func (mariadbFlavor) masterGTIDSet(c *Conn) (GTIDSet, error) { | |||
return parseMariadbGTIDSet(qr.Rows[0][0].ToString()) | |||
} | |||
|
|||
func (mariadbFlavor) startSlaveUntilAfter(pos Position) string { | |||
panic("not implemented") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is an equivalent for mariadb: https://mariadb.com/kb/en/library/start-slave/
c862c97
to
34a916e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be possible to write integration tests for this feature? For example, how do we know this will work as expected for MariaDB?
Let me know if you need help/guidance for this.
d0f9fba
to
9dcb4b7
Compare
test/lock_tables.py
Outdated
addr = tablet_replica1.rpc_endpoint() | ||
p = urlparse('http://' + addr) | ||
channel = grpc.insecure_channel('%s:%s' % (p.hostname, p.port)) | ||
tablet_manager = TabletManagerStub(channel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe extract this to a method on Tablet
? so you can do just:
tablet_manager = tablet_replica1.connect_tablet_manager()
tablet_manager.LockTables(...)
test/lock_tables.py
Outdated
primary key (id) | ||
) Engine=InnoDB''' | ||
|
||
def test_lock_and_unlock(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also have a test for auto-unlock on timeout? Maybe if you expose that as a flag then set it to something really low like 1 second when you start the tablet and then you sleep for 2 seconds and the tables should be unlocked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good so far. I assume you're still working on the integration tests?
return tsv.serveNewType() | ||
} | ||
|
||
func (tsv *TabletServer) serveNewType() (err error) { | ||
if !tsv.te.isOpen { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is not necessary. The state machine ensures that serveNewType
will not get called without a fullStart
. This is also the reason why this code doesn't need to ensure that other services like qe
are already open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually put this in here to make cache_invalidation.py
green again. Not sure what is going on there, but if I comment these three lines out, that test starts failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just occurred to me that cache_invalidation did expose a problem with the approach of opening te
in fullStart
. Essentially, we have to close and re-open te
every time serveNewType
gets called. This is because Close
will have to wait for existing transactions to be resolved before we can switch to the new serving type.
In other words, we should go back to the old behavior, except that we immediately te.Open
as soon as it's closed.
@sougou correct. Still working on adding more integration tests. |
1680352
to
0f179a0
Compare
@sougou @deepthi status for this PR is that I still have not solved #4310 (comment) other than that, I think it's ready |
82ae29e
to
5a449d1
Compare
@@ -640,53 +633,6 @@ func TestTabletServerStopWithPrepare(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestTabletServerMasterToReplica(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you delete these tests (this and the ReplicaToMaster)? Are they failing? If so, we need to investigate why.
These are important because they prove that no errors will be served during state transitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were removed by mistake. Sorry about that, and thanks for catching it!
The tests are related to the 5a449d1 commit. I don't think that change is correct, and the tests agree :)
The tests need to be looked at. They're failing after a restart also. |
@@ -1896,20 +1926,12 @@ verifyTarget: | |||
|
|||
ok: | |||
tsv.requests.Add(1) | |||
// If it's a begin, we should make the shutdown code | |||
// wait for the call to end before it waits for tx empty. | |||
if isBegin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this to TxEngine introduces a race:
- Thread 1: tsv.Begin is called, which calls startRequest through execRequest.
- Thread 1:
tsv.mu
is released once startRequest returns. - Thread 2: A call to
setServingType
is received. - Thread 2: goes in: there are no begin requests in the wait group. So, it closes the TxEngine, or changes its state.
- Thread 1: TxEngine.Begin executes on a TxEngine, which adds to the beginWaitGroup. But the state is not the same as what it was when
startRequest
returned, potentially causing unexpected behavior.
@@ -208,6 +208,43 @@ func NewServer(topoServer *topo.Server, alias topodatapb.TabletAlias) *TabletSer | |||
return NewTabletServer(tabletenv.Config, topoServer, alias) | |||
} | |||
|
|||
type TxPoolController interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this interface. We should introduce interfaces only if it's necessary, like if there is a need to replace teCtrl with something else. Otherwise, directly using TxEngine is generally more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I disagree. In my opinion interfaces more clearly delineates how different parts of the code interact and make it more obvious when you add a new dependency across a code base. It is one of the most powerful tools we humans have to tackle complexity by breaking down code in smaller parts so that we can understand each part separately without having to understand everything at the same time.
Sugu, you might be able to keep all of Vitess in your head at the same time. Us mortals are not as fortunate. We have to page different parts in and out.
:-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, I won't insist if you think the interface helps.
The reason why I dislike interfaces is because of the opposite reason: I can't hold the codebase in my head :).
Here's a classic example: https://github.com/vitessio/vitess/blob/master/go/vt/vtgate/scatter_conn.go#L146. This particular line calls into queryservice. As a challenge, you can try drilling down to figure what code path gets eventually used to call vttablet.
Another similar situation is with TabletStats that report health. I still don't understand how that mechanism works, and this has prevented me from fixing issues in that area. I know the health check mechanism is broken; People keep reporting strange issues. But there's nothing I can do there. Those interface abstractions make it impossible for me to understand what calls what.
Short story long: in this case, it's fairly obvious one way or another.
To make this possible, two things are added: - The capability to lock all tables on a tablet, to momenterily stop updates - Once the database is locked, we can create multiple consistent snapshot transactions that all share the same view of the data - Adds the capability to have replication move forward to a specific point in the transaction log Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
* upstream/master: Skip ACL checks for dual pseudotable Makes sure the schema engine has started up before getting debug information. Support tables that have columns named desc exec_cases: add passing messages OnDup test exec_cases: add failing tests for msgs on dupe planbuilder: support ‘on duplicate key’ for msgs
This reverts commit a57c799. Signed-off-by: Andres Taylor <antaylor@squareup.com>
* upstream/master: helm: release 1.0.5 vreplication: address review comments tlstest_test: Go 1.12 / TLS 1.3 fix vreplication: vstream final tweaks vreplication: tests are done vreplication: handle singleton and mariadb issues vreplication: moved tests into vstreamer dir vreplication: more vstreamer tests vreplication: MariaDB test tweaks and more tests vreplication: tests: statements and DDL add column vreplication: make a singleton srvtopo.Server vreplication: got some basic tests working vreplication: tabletserver inits and other tweaks vreplication: vstreamer, plan and filtering vreplication: streamer planbuilder vreplication: vstreamer Engine Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
* upstream/master: vtctl: run BackupShard on replica, rdonly or spare Expose `glog.MaxSize` as `-log_rotate_max_size` flag inline ok check; while-range instead of bare while test the signal handler reload db-credentials-file upon SIGHUP change ClusterAlias in _vt.local_metadata to be keyspace/shard instead of keyspace.shard Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
* upstream/master: Adds a test to check regressions in parser Adding in SQL commands to prepare an instance to join an existing shard Revert "Support tables that have columns named desc" Use a truly random Destination for DestinationAnyShard Signed-off-by: Andres Taylor <antaylor@squareup.com>
@@ -725,7 +735,10 @@ func TestTabletServerReplicaToMaster(t *testing.T) { | |||
if !reflect.DeepEqual(got, want) { | |||
t.Errorf("Prepared queries: %v, want %v", got, want) | |||
} | |||
tsv.SetServingType(topodatapb.TabletType_REPLICA, true, nil) | |||
tsv.SetServingType(topodatapb.TabletType_MASTER, false, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be REPLICA?
tsv.SetServingType(topodatapb.TabletType_REPLICA, true, nil) | ||
tsv.SetServingType(topodatapb.TabletType_MASTER, false, nil) | ||
if len(tsv.te.preparedPool.conns) != 0 { | ||
t.Fatal("fail") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this to t.Errorf with a more clear error message (and down below).
Strangely, a previous comment disappeared. The test you commented out as obsolete, you can delete that. The test failures look real. One of them is failing to compile, and another one has a data race. |
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Trying to make it clearer what we are testing. Signed-off-by: Andres Taylor <antaylor@squareup.com>
Maybe use #4533 instead? |
To make this possible, two things are added:
transactions that all share the same view of the data
This PR is part of a larger piece of work to allow consistent snapshot to be used when cloning and diffing, instead of pausing replication.