Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

implement (re)store of proxy's state #107

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dvoytik
Copy link

@dvoytik dvoytik commented Aug 18, 2017

Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes #4.

Signed-off-by: Dmitry Voytik dmitry.voytik@huawei.com

@sboeuf
Copy link
Contributor

sboeuf commented Aug 18, 2017

@dvoytik sorry but we are kind of rushing those days because we have some milestones to reach. And unfortunately, what you implemented is not in the scope of our P1 issues right now, that's why the slow reviews.
But don't think we don't appreciate your contribution, and I wanted to let you know that it will take some time to review both clearcontainers/shim#54 and #107 because they introduce core changes to the shim and proxy.
BTW, did you try some real/functional use cases where you have both the shim and the proxy patched with your patches, in order to make sure it works well ? Those tests should be added to the github.com/clearcontainers/tests repo, so that we run them on every PR.

@dvoytik
Copy link
Author

dvoytik commented Aug 21, 2017

Hi @sboeuf,
No problem, I understand this. Please take your time, I don't hurry.
I tested manually both patches. It works well except that hyperstart has it's own internal time-out (around 5-7 seconds) and if proxy doesn't recover quick enough then hyperstart can hang. I didn't invest much time in debugging hyperstart because it is being replaced with agent. My plan is to wait until agent is stabilised and check if the same problem happens.
I plan to submit a functional test case to tests repo later, but without merging these patches I'm not sure if it makes sense to do it before.

@dvoytik dvoytik force-pushed the store_restore_state branch 3 times, most recently from abbbdb5 to 7a2f972 Compare August 21, 2017 11:11
@coveralls
Copy link

coveralls commented Aug 21, 2017

Coverage Status

Coverage decreased (-0.3%) to 73.933% when pulling 7a2f972 on dvoytik:store_restore_state into d7a4dd8 on clearcontainers:master.

proxy.go Outdated
@@ -124,6 +125,33 @@ func newClient(proxy *proxy, conn net.Conn) *client {
}
}

func (proxy *proxy) restoreTokens(vm *vm, tokens []string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be tokens []Token? Also, what if vm == nil?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately proxy.allocateTokens() returns []string. I'd like to avoid writing a conversion loop from []string to []Token in registerVM() and revert conversion loop in storeVMState().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.

proxy.go Outdated
return storeStateDir + "vm_" + id + ".json"
}

func storeVMState(vm *vm, tokens []string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, tokens []Token?

proxy.go Outdated
}
}

func delVMAndState(proxy *proxy, vm *vm) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if proxy and vm are nil?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add checks.

proxy.go Outdated
}
}

func restoreVMState(proxy *proxy, containerID string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing checks on params. Also, this is a very long function - I think it must be close the the cyclomatic complexity limit so I'd consider refactoring it into smaller chunks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do.

proxy.go Outdated
@@ -188,6 +216,207 @@ func (proxy *proxy) releaseToken(token Token) (*tokenInfo, error) {
return info, nil
}

var storeStateDir = "/var/lib/clear-containers/proxy/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the top of the file? Its value should also be passed in from Makefile so that at the top it's something like:

// XXX: set by build
var storeStateDir string

Copy link
Author

@dvoytik dvoytik Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will do. Should it be set up to "/var/lib/clear-containers/proxy/" in Makefile?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it should probably be set to $PKGLIBDIR/proxy/ (feel free to steal that chunk of code from the runtime's Makefile ;)

proxy.go Outdated
}

var proxyState proxyStateOnDisk
err = json.Unmarshal(fdata, &proxyState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider: what happens when we change the proxy state file format? The 3 main approaches are:

  • Don't handle it and just error if the on-disk state doesn't match what the proxy expects.
  • Add a version field to vmStateOnDisk to allow the proxy to determine how to deal with a different format state file.
  • Make the code "sniff" the state file data and see which fields it contains and their types to determine how to handle such state file differences (the "autoconf approach" ;)

Clearly, here, there is only 1 possible on-disk format, but I think it would be worth thinking about this and documenting the strategy for how we plan to handle any changes to vmStateOnDisk.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO adding a version field to the format would be easiest way to handle this kind of issues.

proxy.go Outdated
if proxy.socketPath, err = getSocketPath(); err != nil {
return fmt.Errorf("couldn't get a rigth socket path: %v", err)
if proxy.socketPath, err = getSocketPath(); err != nil {
return fmt.Errorf("couldn't get a rigth socket path: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo that could be fixed now I've spotted it ;-) "rigth" => "right".

Copy link
Author

@dvoytik dvoytik Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do.

proxy.go Outdated
proxy.Lock()
delete(proxy.vms, vm.containerID)
proxy.Unlock()
proxy.storeState()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This take a new lock. I don't know if there is an issue here in that the state could change between the call to Unlock() and the call to storeState()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here proxy.vms hash-map is protected. I regularly run code with -race parameter to see if data races appear.

proxy_test.go Outdated

// clean up a possible state
os.RemoveAll(storeStateDir)
assert.Equal(t, rig.proxy.restoreState(), false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a major problem, but in the runtime for new code we've started creating a assert variable to avoid having to hass the testing.T each time:

assert := assert.New(t)
  :
  :
assert.Equal(rig.proxy.restoreState(), false)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks. Will do.

proxy_test.go Outdated
@@ -900,3 +901,29 @@ func TestHyperstartResponse(t *testing.T) {

rig.Stop()
}

func TestStoreRestore(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some new tests where the restore fails due to an invalid state file (zero size, invalid contents (non-json, pure text (hello world or something ;-)))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

@jodh-intel
Copy link
Contributor

Hi @dvoytik - thanks for raising this! It looks good. I've done a first-pass review and given some suggestions. There are 2 other areas to mention:

  • Tests

    Thanks for adding a new test. However, this is a relatively big change and although you've added a new test, the coverage level has dropped. Ideally, it would be going up ;-)

  • Locking code

    I think the locking calls will need a more thorough review.

@dvoytik
Copy link
Author

dvoytik commented Sep 1, 2017

@jodh-intel, thank you for the review!
It seems like today I won't have enough time to address all comments thoroughly, and during next two weeks I'm on a vacation without access to a proper internet connection. If you don't mind, I'm going to respond to the comments and rework the PR after September, 18th.

@jodh-intel
Copy link
Contributor

Hi @dvoytik - no problem! And enjoy your holiday! 😎

@dvoytik
Copy link
Author

dvoytik commented Sep 1, 2017

@jodh-intel, thanks! 😃

@jodh-intel
Copy link
Contributor

Hi @dvoytik - hope you had a good holiday. Are you still planning to update this branch? Let me know if there is anything I can do to help.

@dvoytik
Copy link
Author

dvoytik commented Oct 2, 2017

@jodh-intel, thanks! Sorry, for delaying on the PR. I'll rework it and address the comments today after my main job. Unfortunately for the time being I can finish this only in my spare time :(

@jodh-intel
Copy link
Contributor

Hi @dvoytik - understood. I'm happy to help with this if you are tied up with other things.

@dvoytik
Copy link
Author

dvoytik commented Oct 2, 2017

HI @jodh-intel, thank you! I feel responsibility to finish this anyway, so I'm going to work on the PR today evening. Sorry again for delaying this.

@dvoytik dvoytik force-pushed the store_restore_state branch 3 times, most recently from d703819 to 3e0c350 Compare October 5, 2017 21:38
@dvoytik
Copy link
Author

dvoytik commented Oct 5, 2017

Hi @jodh-intel, I've update the PR. Could you please take a look at it again?

proxy.go Outdated
@@ -188,6 +216,207 @@ func (proxy *proxy) releaseToken(token Token) (*tokenInfo, error) {
return info, nil
}

var storeStateDir = "/var/lib/clear-containers/proxy/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it should probably be set to $PKGLIBDIR/proxy/ (feel free to steal that chunk of code from the runtime's Makefile ;)

proxy.go Outdated
// returns false if it's a clean start (i.e. no state is stored) or restoring failed
func (proxy *proxy) restoreState() bool {
proxyStateFilePath := storeStateDir + "proxy_state.json"
if _, err := os.Stat(storeStateDir); os.IsNotExist(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - sorry, yes I see now.

proxy.go Outdated
@@ -84,6 +92,20 @@ type proxy struct {
wg sync.WaitGroup
}

// proxyStateOnDisk is used to re(store) proxy state on disk
type proxyStateOnDisk struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty significant feature which I think justifies moving all the state handling code into a state.go.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll do it.

proxy.go Outdated
@@ -124,6 +125,33 @@ func newClient(proxy *proxy, conn net.Conn) *client {
}
}

func (proxy *proxy) restoreTokens(vm *vm, tokens []string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.

proxy.go Outdated
}

for _, token := range tokens {
token, err := vm.AllocateTokenAs(Token(token))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to assert that token != "" here? AllocateTokenAs() will handle it I see, but it's not an expected scenario I think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks!

proxy.go Outdated

func restoreVMState(proxy *proxy, containerID string) {
if proxy == nil {
proxyLog.Errorf("proxy parameter must be not nil")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you add a check for containerID == "".
  • Can you add the containerID to the log message to aid debugging?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks.

proxy.go Outdated
proxyLog.Warn("Recovering proxy state from: ", proxyStateFilePath)
if proxyState.Version != Version {
proxyLog.Warnf("Stored state version (%s) mismatches proxy"+
" version (%s). Aborting", proxyState.Version, Version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't deal with the scenario where proxyState.Version > Version as that implies the state is from a newer version of the proxy than is running (forced downgrade), but I think we do need to handle the scenario where proxyState.Version <= Version as the running proxy should be able to handle the state of all previous versions.

I'm still not 100% convinced that a version field is the right approach: even if we have one, the code will still need to sanity-check all the state values read off disk to ensure:

  • they are of the correct type.
  • they have a "reasonable" value.

However, it doesn't hurt to include a version.

proxy.go Outdated
proxy.vms[regVM.ContainerID] = vm
proxy.Unlock()

proxyLog.Infof("restoreVMState(containerId=%s,ctlSerial=%s,ioSerial=%s,console=%s)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you could convert all the new log messages on this PR to use the structured logging approach. For example, this one could become:

proxyLog.WithFields(logrus.Fields{
  "container": regVM.ContainerID,
  "control-channel": regVM.CtlSerial,
  "io-channel": regVM.IoSerial,
  "console": regVM.Console,
}).Info("restoring state")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do.

proxy.go Outdated
}

proxyState := &proxyStateOnDisk{
Version: Version,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the storing the protocol version. We need a separate version to describe the state file version as that might need to change more frequently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll rework this part.

proxy_test.go Outdated
@@ -929,3 +930,93 @@ func TestGetSocketPath(t *testing.T) {
assert.NotNil(t, err, err)
assert.Equal(t, p, "")
}

func TestStoreRestore(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for all the functions added? (This might necessitate splitting them into smaller functions to make them more easily testable).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll try :)

@jodh-intel
Copy link
Contributor

Hi @dvoytik - thank you again raising this PR. It's a very significant new feature and because of its sensitivity, I'm afraid my pedantic knob is up to 11 😄

A couple of general points not already mentioned in the review:

  • Can you space the code out a little so that there are blank lines between logical blocks / if tests, etc. That will make it a lot easier to read.

  • I like the simplicity of the current approach but I do feel it would be better to have all functions return a bool or an error and allow the higher-levels to determine whether the error should be fatal or not. That also makes writing tests much easier as there is something concrete to check.

  • I think we should document how the proxy handles full and partial failure to restore state. It may be that we want to add a --error-on-failure-to-restore-type option for example for some users. The default should probably be to log as much as possible and continue to startup, but that might not be desirable in some environments.

As I've said before, I'm more than happy to help out with any of this. This is a big feature so if we can find a way to logically sub-divide the work, just ping me so we can work together on this.

@dvoytik
Copy link
Author

dvoytik commented Oct 6, 2017

Hi @jodh-intel, thank you again for reviewing the PR! No problems with a pedantic attitude at all! Actually I value this a lot, especially when it's reasonable.

All your points are valid, I agree. I'll rework the PR with your comments in mind.

If you don't mind I'll try to finish this PR. It's just going to take a little longer because for the time being I can only work on this at evenings.

IMHO, --error-on-failure-to-restore make sense to implement as a separate PR.

@jodh-intel
Copy link
Contributor

Thanks @dvoytik! ;)

@dvoytik dvoytik force-pushed the store_restore_state branch 2 times, most recently from 9673679 to c2a705c Compare October 9, 2017 21:21
@dvoytik
Copy link
Author

dvoytik commented Oct 9, 2017

The PR has been updated.

@jodh-intel
Copy link
Contributor

Thanks @dvoytik - reviewing and testing with clearcontainers/shim#54 ...

state.go Outdated
}

func logContID(containerID string) *logrus.Entry {
return proxyLog.WithField("container", containerID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cute! 😄

}

proxy.Lock()
proxy.tokenToVM[token] = &tokenInfo{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not wholly clear on this. Should the token be removed if findSessionByToken() fails below?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically if vm.findSessionByToken(token) fails here then the vm object, which owns tokens, is being released later by delVMAndState, thus tokens are being also released.
But maybe I should release them here with such code:

			vm.Lock()
			for token := range vm.tokenToSession {
				_ = vm.freeTokenUnlocked(token)
				delete(vm.tokenToSession, token)
			}
			vm.Unlock()

Slightly more complicated code but more future-proof?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that looks good. What do you think @sboeuf?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes fine with me.

@jodh-intel
Copy link
Contributor

This is looking very nice.

@sboeuf, @amshinde - could you take a look please?

@sboeuf
Copy link
Contributor

sboeuf commented Oct 10, 2017

@jodh-intel sure I'll take a look

state.go Outdated

// returns false if it's a clean start (i.e. no state is stored) or restoring
// failed
func (proxy *proxy) restoreState() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for clarity this should be called something like restoreAllState() because it is not doing the opposite job to storeState():

  • storeState(): only stores proxy state.
  • restoreState(): restores both proxy state and state of all VMs.

(alternatively, you could split the VM restore code in restoreState() into a separate function I guess).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll rename it.

proxy.go Outdated
logContID(vm.containerID).Errorf(
"couldn't store a VM state: %v", err)
}
if err := proxy.storeState(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing blank line above the if (here and in various other places).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops :) I'll fix these, thanks!

proxy.go Outdated
"couldn't store a VM state: %v", err)
}
if err := proxy.storeState(); err != nil {
proxyLog.Errorf("couldn't store proxy's state: %v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put the err on this line as it looks a bit odd down there on its own (same comment for a few other occurrences).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll fix these.

}

proxy.Lock()
proxy.tokenToVM[token] = &tokenInfo{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that looks good. What do you think @sboeuf?

@dvoytik dvoytik force-pushed the store_restore_state branch 2 times, most recently from 97a65e3 to 11d2371 Compare October 11, 2017 21:55
@sboeuf
Copy link
Contributor

sboeuf commented Oct 12, 2017

Sorry I could not review that, but I'll do that tomorrow first thing! And I know I am gonna spend some time since that's a pretty huge PR

Copy link
Contributor

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dvoytik thanks for this huge PR. I have several comments, but I have also one concern (unless I am missing something). From my understanding, you basically STORE during RegisterVM() and RESTORE during UnregisterVM(). The issue is that you're not covering all cases here. You will miss all the tokens/sessionID created during a pod lifecycle. For instance, after the VM has been started and that we have stored all the info, if we create new containers/processes, new tokens/sessionIDs are gonna be generated by the proxy. Am I missing something or this is something expected ? Cause that means that we are not covering a lot of crashing cases otherwise.
@sameo could you please also take a look since this is a significant PR/feature.

Makefile Outdated
@@ -16,6 +16,7 @@ LOCALSTATEDIR := /var

SOURCES := $(shell find . 2>&1 | grep -E '.*\.(c|h|go)$$')
PROXY_SOCKET := $(LOCALSTATEDIR)/run/clear-containers/proxy.sock
STORE_STATE_DIR := $(LOCALSTATEDIR)/lib/clear-containers/proxy/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$(LOCALSTATEDIR)/run/clear-containers/proxy/ would be more appropriate since we're gonna store some data generated at runtime.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree. I'll change this. Thanks.

@@ -53,7 +54,7 @@ all: cc-proxy $(UNIT_FILES)

cc-proxy: $(SOURCES) Makefile
$(QUIET_GOBUILD)go build -i -o $@ -ldflags \
"-X main.DefaultSocketPath=$(PROXY_SOCKET) -X main.Version=$(VERSION)"
"-X main.DefaultSocketPath=$(PROXY_SOCKET) -X main.Version=$(VERSION) -X main.storeStateDir=$(STORE_STATE_DIR)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could fix the other variables as they don't need to be upper case since they don't need to be exported.

Copy link
Author

@dvoytik dvoytik Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like not to mix other refactoring in one patch if you don't mind. But if you insist, I could write a separate patch and make it as a part of this PR though.

state.go Outdated

// Returns false if it's a clean start (i.e. no state is stored) or restoring
// failed.
func (proxy *proxy) restoreAllState() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make the difference between the three cases: clean start / restoration succeeded / restoration failed. IMO, that's the caller responsibility to decide what to do with the different output combinations.

func (proxy *proxy) restoreAllState() (bool, error) {

Or we could have one function to detect if we are in a clean start case by checking if we have something in .../run/clear-containers/proxy/..., and another function for the actual restoration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, you should not define methods belonging to the proxy object outside of proxy.go.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rework this function.

proxy.go Outdated
@@ -622,12 +632,14 @@ func (proxy *proxy) init() error {
// Force a coredump + full stacktrace on internal error
debug.SetTraceback("crash")

// flags
proxy.enableVMConsole = logrus.GetLevel() == logrus.DebugLevel
if !proxy.restoreAllState() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear enough. See my comment about restoreAllState() function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks. I'll rework this.

state.go Outdated
}

func vmStateFilePath(id string) string {
return storeStateDir + "vm_" + id + ".json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using path/filepath package is usually clearer/cleaner when creating/modifying a path.

filepath.Join(storeStateDir, "vm_" + id + ".json")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the hint! I'll rework this function as you suggested.

state.go Outdated
Tokens []string `json:"tokens"`
}

func logContID(containerID string) *logrus.Entry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very generic, it should go into log.go or utils.go, but I don't think it should stay here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK to move it to a distinct file, but maybe it'd be better to do this as a separate refactoring PR? If you insist I can rework this as a part of my PR.

state.go Outdated
}

// On success returns nil, otherwise an error string message.
func (proxy *proxy) restoreTokens(vm *vm, tokens []string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not define methods belonging to the proxy object outside of proxy.go.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change this.

}

proxy.Lock()
proxy.tokenToVM[token] = &tokenInfo{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes fine with me.

state.go Outdated

storeFile := vmStateFilePath(vm.containerID)

err = ioutil.WriteFile(storeFile, o, proxyStateFilesPerm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the format

if err := ...; err != nil {

Copy link
Author

@dvoytik dvoytik Oct 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, as you wish ;)

state.go Outdated
}

// On success returns nil, otherwise an error string message.
func storeVMState(vm *vm, tokens []string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think you should define a storeState() and fetchState() methods both for vm and proxy. That way, it would be more consistent with the overall naming of those new functions that you're introducing.

Copy link
Author

@dvoytik dvoytik Oct 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally the code was in proxy.go but @jodh-intel asked me to move it out to a separate file. If I define the methods on vm and proxy then I have to move them to proxy.go and vm.go back according to what you ask :)

@dvoytik
Copy link
Author

dvoytik commented Oct 13, 2017

Hi @sboeuf, thank you for the review!

From my understanding, you basically STORE during RegisterVM() and RESTORE during UnregisterVM().

No, restoring is done only when proxy starts and detects a stored state on disk. During unregisterVM() we call delVMAndState(), which in turn updates proxy's state and deletes the requested vm's state on disk.

The issue is that you're not covering all cases here. You will miss all the tokens/sessionID created during a pod lifecycle. For instance, after the VM has been started and that we have stored all the info, if we create new containers/processes, new tokens/sessionIDs are gonna be generated by the proxy. Am I missing something or this is something expected ? Cause that means that we are not covering a lot of crashing cases otherwise.

Tokens are stored in a state. Please see vmStateOnDisk.Tokens.

@sboeuf
Copy link
Contributor

sboeuf commented Oct 13, 2017

No, restoring is done only when proxy starts and detects a stored state on disk. During unregisterVM() we call delVMAndState(), which in turn updates proxy's state and deletes the requested vm's state on disk.

Yes sorry I used wrong explanation, but basically you're saving things only at the RegisterVM() time, and you restore when the proxy get restarted and it goes through init().

Tokens are stored in a state. Please see vmStateOnDisk.Tokens.

Yes sure, but you're doing the STORE only when the VM is registered. And according to a full pod lifecycle, you're gonna miss almost everything. I mean that you're only be able to RESTORE the very first configuration that you saved when the VM has been created... Lots of containers and process could be pending at the time the proxy had crashed, and you would miss all of that. I think that you need to store in lot more places in the code. Does that make sense ?

@dvoytik
Copy link
Author

dvoytik commented Oct 13, 2017

Yes sure, but you're doing the STORE only when the VM is registered. And according to a full pod lifecycle, you're gonna miss almost everything. I mean that you're only be able to RESTORE the very first configuration that you saved when the VM has been created... Lots of containers and process could be pending at the time the proxy had crashed, and you would miss all of that. I think that you need to store in lot more places in the code. Does that makes sense ?

Oops, actually you are right! I've just checked again all API methods and totally forgot about attachVM(). Originally in my previous versions it was there and somehow disappeared :) I remember I was testing attaching manually and it worked.
If you can find other places where I've missed storing/updating something please let me know. Thanks!

@dvoytik dvoytik force-pushed the store_restore_state branch from 11d2371 to 885e85b Compare October 15, 2017 19:22
@dvoytik
Copy link
Author

dvoytik commented Oct 15, 2017

@sboeuf, thank you for the review again!
I've updated the PR.

@jodh-intel
Copy link
Contributor

@sboeuf - ping.

@devimc
Copy link

devimc commented Oct 24, 2017

Hi All
I think the best way to avoid restoring proxy's state is a redesign in the way we connect cc-shims and the container, we can hot plug virtio serial ports and associate them directly with cc-shim when a new container/workload is created, cc-proxy will be only used to negotiate with the agent (create, stop, delete containers, etc). That mean 1 serial port to control the container (CTL) and N serial ports to communicate cc-shim and workload (TTY)

Currently:

cc-shim <->  cc-proxy <-> TTY serial port <-> cc-agent <-> workload

With a redesign:

cc-shim <-> TTY N serial port <-> cc-agent <-> workload

in that way containers will not be affected if the cc-proxy dies since part of the logic of the cc-proxy will be moved to cc-agent

@sboeuf
Copy link
Contributor

sboeuf commented Oct 24, 2017

@devimc that could work, sure. But the drawback of such a solution is that we would get a larger attack surface because you would create as much serial ports as we have processes.

@sboeuf
Copy link
Contributor

sboeuf commented Oct 25, 2017

@jodh-intel @dvoytik I'll look into this today

Copy link
Contributor

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noticed that you're missing to save all the container IDs (the one you can find into ioSession structure). We make a difference between the containerID in the vm structure which means the podID and the containerID in the ioSession structure which means a containerID. There is one session per container and one VM per pod.
I think that you need to figure out a way to save the containers related to each pod/vm, otherwise we could not be able to perform proper action like killing the container process after the proxy restarted.
Take a look here: https://github.com/clearcontainers/proxy/blob/master/vm.go#L363 and here: https://github.com/clearcontainers/proxy/blob/master/vm.go#L572-L581

state.go Outdated
@@ -0,0 +1,387 @@
// Copyright (c) 2017 Dmitry Voytik <dmitry.voytik@huawei.com>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dvoytik have you made the change here ?

if err := os.Remove(proxyStateFilePath); err != nil {
return fmt.Errorf("couldn't remove file %s: %v",
proxyStateFilePath, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to your logic, if there's no VM, we should not have a state file, I am fine with that, but this means you should return nil after you properly removed the file, right ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thank you. I'll add here return nil.

state.go Outdated
Version: stateFileFormatVersion,
SocketPath: proxy.socketPath,
EnableVMConsole: proxy.enableVMConsole,
ContainerIDs: make([]string, 0, len(proxy.vms)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to allocate dynamically the size of the buffer. Calling into append() will do that for you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know that. It looks like a premature optimization, but on the other hand does it really hurt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I don't consider this as an optimization since you could directly remove this line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you insist, I'll remove this line, no problem. My intention was to avoid re-allocations on each append operation.

state.go Outdated
// On success returns nil, otherwise an error string message.
func storeVMState(vm *vm) error {
vm.Lock()
tokens := make([]string, 0, len(vm.tokenToSession))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not need dynamic allocation since you're using append right after.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@dvoytik
Copy link
Author

dvoytik commented Oct 27, 2017

@sboeuf, thank you again for the review.
You are right about ioSession. I'll take a look how to store/restore this state in coming days.

@sboeuf
Copy link
Contributor

sboeuf commented Oct 27, 2017

@dvoytik no problem ! Let me know when you have something ready.

@dvoytik dvoytik force-pushed the store_restore_state branch 4 times, most recently from b932d09 to e7ebeda Compare November 7, 2017 21:03
Introduce the high availability feature of cc-proxy by implementing
store/restore of proxy's state to/from disk. This feature depends
on the ability of shim to reconnect to cc-proxy if connection is lost.

Fixes clearcontainers#4.

Signed-off-by: Dmitry Voytik <dmitry.voytik@huawei.com>
@dvoytik dvoytik force-pushed the store_restore_state branch from e7ebeda to b4df8d5 Compare November 7, 2017 21:03
@dvoytik
Copy link
Author

dvoytik commented Nov 8, 2017

Hi @sboeuf, could you please take a look?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants