Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add capability to create consistent transactions #4310

Closed
wants to merge 37 commits into from

Conversation

systay
Copy link
Collaborator

@systay systay commented Oct 24, 2018

To make this possible, two things are added:

  • The capability to lock all tables on a tablet, to momenterily stop updates
  • Once the database is locked, we can create multiple consistent snapshot
    transactions that all share the same view of the data

This PR is part of a larger piece of work to allow consistent snapshot to be used when cloning and diffing, instead of pausing replication.

@systay systay changed the title Add capability create consistent transactions Add capability to create consistent transactions Oct 24, 2018
Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good overall. Got some initial comments/questions.
I still need to think about the consequences of stream connection using tx pool.

There is an intent to merge the stream and non-stream protocol. So, we need to make decisions that are compatible with that future. Most likely, it's not material for this change.

@@ -41,6 +41,10 @@ func (mariadbFlavor) masterGTIDSet(c *Conn) (GTIDSet, error) {
return parseMariadbGTIDSet(qr.Rows[0][0].ToString())
}

func (mariadbFlavor) startSlaveUntilAfter(pos Position) string {
panic("not implemented")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is an equivalent for mariadb: https://mariadb.com/kb/en/library/start-slave/

go/vt/vttablet/tabletserver/tx_pool.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletserver/tabletserver.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletserver/query_executor.go Show resolved Hide resolved
@systay systay force-pushed the lock-tables branch 3 times, most recently from c862c97 to 34a916e Compare November 1, 2018 15:34
Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be possible to write integration tests for this feature? For example, how do we know this will work as expected for MariaDB?

Let me know if you need help/guidance for this.

go/vt/vttablet/tabletserver/tabletenv/config.go Outdated Show resolved Hide resolved
@systay systay force-pushed the lock-tables branch 3 times, most recently from d0f9fba to 9dcb4b7 Compare November 12, 2018 16:31
addr = tablet_replica1.rpc_endpoint()
p = urlparse('http://' + addr)
channel = grpc.insecure_channel('%s:%s' % (p.hostname, p.port))
tablet_manager = TabletManagerStub(channel)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe extract this to a method on Tablet? so you can do just:

tablet_manager = tablet_replica1.connect_tablet_manager()
tablet_manager.LockTables(...)

primary key (id)
) Engine=InnoDB'''

def test_lock_and_unlock(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also have a test for auto-unlock on timeout? Maybe if you expose that as a flag then set it to something really low like 1 second when you start the tablet and then you sleep for 2 seconds and the tables should be unlocked?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good so far. I assume you're still working on the integration tests?

return tsv.serveNewType()
}

func (tsv *TabletServer) serveNewType() (err error) {
if !tsv.te.isOpen {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is not necessary. The state machine ensures that serveNewType will not get called without a fullStart. This is also the reason why this code doesn't need to ensure that other services like qe are already open.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually put this in here to make cache_invalidation.py green again. Not sure what is going on there, but if I comment these three lines out, that test starts failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me that cache_invalidation did expose a problem with the approach of opening te in fullStart. Essentially, we have to close and re-open te every time serveNewType gets called. This is because Close will have to wait for existing transactions to be resolved before we can switch to the new serving type.

In other words, we should go back to the old behavior, except that we immediately te.Open as soon as it's closed.

@systay
Copy link
Collaborator Author

systay commented Nov 17, 2018

@sougou correct. Still working on adding more integration tests.

@systay systay force-pushed the lock-tables branch 3 times, most recently from 1680352 to 0f179a0 Compare November 22, 2018 09:27
@systay
Copy link
Collaborator Author

systay commented Nov 22, 2018

@sougou @deepthi I think the necessary tests are there now. I think we should squash before merging to clean the history - just wanted to keep the commits separate so you can easily track the changes.

@systay
Copy link
Collaborator Author

systay commented Nov 29, 2018

@sougou @deepthi status for this PR is that I still have not solved #4310 (comment)

other than that, I think it's ready

@systay systay force-pushed the lock-tables branch 2 times, most recently from 82ae29e to 5a449d1 Compare December 5, 2018 10:46
@@ -640,53 +633,6 @@ func TestTabletServerStopWithPrepare(t *testing.T) {
}
}

func TestTabletServerMasterToReplica(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you delete these tests (this and the ReplicaToMaster)? Are they failing? If so, we need to investigate why.

These are important because they prove that no errors will be served during state transitions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were removed by mistake. Sorry about that, and thanks for catching it!

The tests are related to the 5a449d1 commit. I don't think that change is correct, and the tests agree :)

@sougou
Copy link
Contributor

sougou commented Dec 18, 2018

The tests need to be looked at. They're failing after a restart also.

@@ -1896,20 +1926,12 @@ verifyTarget:

ok:
tsv.requests.Add(1)
// If it's a begin, we should make the shutdown code
// wait for the call to end before it waits for tx empty.
if isBegin {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this to TxEngine introduces a race:

  • Thread 1: tsv.Begin is called, which calls startRequest through execRequest.
  • Thread 1: tsv.mu is released once startRequest returns.
  • Thread 2: A call to setServingType is received.
  • Thread 2: goes in: there are no begin requests in the wait group. So, it closes the TxEngine, or changes its state.
  • Thread 1: TxEngine.Begin executes on a TxEngine, which adds to the beginWaitGroup. But the state is not the same as what it was when startRequest returned, potentially causing unexpected behavior.

@@ -208,6 +208,43 @@ func NewServer(topoServer *topo.Server, alias topodatapb.TabletAlias) *TabletSer
return NewTabletServer(tabletenv.Config, topoServer, alias)
}

type TxPoolController interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this interface. We should introduce interfaces only if it's necessary, like if there is a need to replace teCtrl with something else. Otherwise, directly using TxEngine is generally more readable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I disagree. In my opinion interfaces more clearly delineates how different parts of the code interact and make it more obvious when you add a new dependency across a code base. It is one of the most powerful tools we humans have to tackle complexity by breaking down code in smaller parts so that we can understand each part separately without having to understand everything at the same time.

Sugu, you might be able to keep all of Vitess in your head at the same time. Us mortals are not as fortunate. We have to page different parts in and out.

:-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, I won't insist if you think the interface helps.

The reason why I dislike interfaces is because of the opposite reason: I can't hold the codebase in my head :).
Here's a classic example: https://github.com/vitessio/vitess/blob/master/go/vt/vtgate/scatter_conn.go#L146. This particular line calls into queryservice. As a challenge, you can try drilling down to figure what code path gets eventually used to call vttablet.
Another similar situation is with TabletStats that report health. I still don't understand how that mechanism works, and this has prevented me from fixing issues in that area. I know the health check mechanism is broken; People keep reporting strange issues. But there's nothing I can do there. Those interface abstractions make it impossible for me to understand what calls what.

Short story long: in this case, it's fairly obvious one way or another.

To make this possible, two things are added:
 - The capability to lock all tables on a tablet, to momenterily stop updates
 - Once the database is locked, we can create multiple consistent snapshot
   transactions that all share the same view of the data
 - Adds the capability to have replication move forward to a specific point
   in the transaction log

Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
* upstream/master:
  Skip ACL checks for dual pseudotable
  Makes sure the schema engine has started up before getting debug information.
  Support tables that have columns named desc
  exec_cases: add passing messages OnDup test
  exec_cases: add failing tests for msgs on dupe
  planbuilder: support ‘on duplicate key’ for msgs
This reverts commit a57c799.

Signed-off-by: Andres Taylor <antaylor@squareup.com>
* upstream/master:
  helm: release 1.0.5
  vreplication: address review comments
  tlstest_test: Go 1.12 / TLS 1.3 fix
  vreplication: vstream final tweaks
  vreplication: tests are done
  vreplication: handle singleton and mariadb issues
  vreplication: moved tests into vstreamer dir
  vreplication: more vstreamer tests
  vreplication: MariaDB test tweaks and more tests
  vreplication: tests: statements and DDL add column
  vreplication: make a singleton srvtopo.Server
  vreplication: got some basic tests working
  vreplication: tabletserver inits and other tweaks
  vreplication: vstreamer, plan and filtering
  vreplication: streamer planbuilder
  vreplication: vstreamer Engine

Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
systay and others added 7 commits January 15, 2019 11:45
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
* upstream/master:
  vtctl: run BackupShard on replica, rdonly or spare
  Expose `glog.MaxSize` as `-log_rotate_max_size` flag
  inline ok check; while-range instead of bare while
  test the signal handler
  reload db-credentials-file upon SIGHUP
  change ClusterAlias in _vt.local_metadata to be keyspace/shard instead of keyspace.shard

Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
* upstream/master:
  Adds a test to check regressions in parser
  Adding in SQL commands to prepare an instance to join an existing shard
  Revert "Support tables that have columns named desc"
  Use a truly random Destination for DestinationAnyShard

Signed-off-by: Andres Taylor <antaylor@squareup.com>
@@ -725,7 +735,10 @@ func TestTabletServerReplicaToMaster(t *testing.T) {
if !reflect.DeepEqual(got, want) {
t.Errorf("Prepared queries: %v, want %v", got, want)
}
tsv.SetServingType(topodatapb.TabletType_REPLICA, true, nil)
tsv.SetServingType(topodatapb.TabletType_MASTER, false, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be REPLICA?

tsv.SetServingType(topodatapb.TabletType_REPLICA, true, nil)
tsv.SetServingType(topodatapb.TabletType_MASTER, false, nil)
if len(tsv.te.preparedPool.conns) != 0 {
t.Fatal("fail")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to t.Errorf with a more clear error message (and down below).

@sougou
Copy link
Contributor

sougou commented Jan 17, 2019

Strangely, a previous comment disappeared. The test you commented out as obsolete, you can delete that. The test failures look real. One of them is failing to compile, and another one has a data race.

Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Signed-off-by: Andres Taylor <antaylor@squareup.com>
Trying to make it clearer what we are testing.

Signed-off-by: Andres Taylor <antaylor@squareup.com>
@systay
Copy link
Collaborator Author

systay commented Jan 18, 2019

Maybe use #4533 instead?

@systay systay closed this Jan 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants