Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

FC: Add new vsock connection handshake #706

Merged
merged 2 commits into from
Jan 30, 2020

Conversation

Pennyzct
Copy link
Contributor

Which feature do you think can be improved?
New Firecracker v0.20.0 has changed the host-initiated vsock connection protocol to include a trivial handshake.

The new protocol looks like this:

    - [host] CONNECT <port><LF>
    - [guest/success] OK <assigned_host_port><LF>

See PR firecracker-microvm/firecracker#1472 for detailed info.

@codecov
Copy link

codecov bot commented Dec 27, 2019

Codecov Report

Merging #706 into master will decrease coverage by 0.11%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #706      +/-   ##
==========================================
- Coverage   60.82%   60.71%   -0.12%     
==========================================
  Files          17       17              
  Lines        2550     2558       +8     
==========================================
+ Hits         1551     1553       +2     
- Misses        850      857       +7     
+ Partials      149      148       -1     

@Pennyzct
Copy link
Contributor Author

/test

Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Pennyzct.

lgtm

return nil, err
} else if !strings.Contains(response, "OK") {
conn.Close()
return nil, errors.New("malformed response code")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would make sense to log response here if debug is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi~ @jodh-intel yep. And since this file is missing logging part, I've added another commit to fulfill it. ;).

@Pennyzct
Copy link
Contributor Author

Pennyzct commented Jan 6, 2020

Hi guys
FC CI is definitely failing here, since this change is included only in v0.20.0. ;).
I've added another PR to do the FC version bump.

@devimc
Copy link

devimc commented Jan 6, 2020

@Pennyzct thanks, but please add Depends-on: github.com/kata-containers/runtime#2379 in your commit message, that way we can see the CI using the latest version of firecracker

@Pennyzct
Copy link
Contributor Author

Pennyzct commented Jan 8, 2020

Hi~ @devimc
FWIT, even if I add the Depends-on: github.com/kata-containers/runtime#2379, if we don't do the runtime/ vendor update to include this commit, FC CI will still fail. ;).

@Pennyzct
Copy link
Contributor Author

Pennyzct commented Jan 8, 2020

/test

@devimc
Copy link

devimc commented Jan 8, 2020

@Pennyzct thanks, now I can see that fc 0.20 is installed

[Hypervisor]
  MachineType = ""
  Version = "firecracker 0.20.0"

unfortunately FC CI is failing

Stderr: docker: Error response from daemon: OCI runtime create failed: Failed to check if grpc server is working: rpc error: code = DeadlineExceeded desc = context deadline exceeded: unknown.

Pennyzct added a commit to Pennyzct/runtime that referenced this pull request Jan 9, 2020
Add changes from kata-containers/agent#706.
THIS IS ONLY FOR DEBUGGING!

Fixes: kata-containers#2378

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this pull request Jan 10, 2020
The new release for Firecracker is `v0.20.0`.

Fixes: kata-containers#2378
Depends-on: github.com/kata-containers/agent#706

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this pull request Jan 10, 2020
Add changes from kata-containers/agent#706.
THIS IS ONLY FOR DEBUGGING!

Fixes: kata-containers#2378
Depends-on: github.com/kata-containers/agent#706

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this pull request Jan 10, 2020
Add changes from kata-containers/agent#706.
THIS IS ONLY FOR DEBUGGING!

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
@Pennyzct Pennyzct force-pushed the vsock_handshake branch 2 times, most recently from adad4ca to 9b10b2e Compare January 10, 2020 05:19
Pennyzct added a commit to Pennyzct/runtime that referenced this pull request Jan 10, 2020
Add changes from kata-containers/agent#706.
THIS IS ONLY FOR DEBUGGING!

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
@Pennyzct
Copy link
Contributor Author

Updates:
The debugging part should be only running after already getting right response code, since otherwise we may encounter JSON decode error.

dockerd[16053]: time="2020-01-10T03:34:14.461867307Z" level=error msg="Handler for POST 
/v1.38/containers/7a52caa16c60249ceb09ede8b65a8161a0ad7bfe6c1cc206040456ba9737541b/start 
returned error: OCI runtime create failed: unable to retrieve OCI runtime error (invalid character 'i' in 
literal true (expecting 'r')): /usr/local/bin/kata-runtime did not terminate sucessfully

@devimc
Copy link

devimc commented Jan 10, 2020

@Pennyzct good catch!

@jodh-intel
Copy link
Contributor

/test

@sboeuf
Copy link

sboeuf commented Jan 22, 2020

LGTM

@jcvenegas @likebreath I think the right move here is to make sure you update Kata to support latest CH from master. This way, I think the random failures will go away, as the PR is effectively reverting the 3 patches we've been reverting so far (through our testing).

@devimc I think you need to bump FC to 0.20.0 so that it does not get broken when this PR is merged.

@devimc
Copy link

devimc commented Jan 22, 2020

@sboeuf what do you mean? CI installs firecracker 0.20 and is failing

• Failure [152.692 seconds]
docker kill
/tmp/jenkins/workspace/kata-containers-agent-ubuntu-1804-PR-firecracker/go/src/github.com/kata-containers/tests/integration/docker/kill_test.go:79
  killing container
  /tmp/jenkins/workspace/kata-containers-agent-ubuntu-1804-PR-firecracker/go/src/github.com/kata-containers/tests/vendor/github.com/onsi/ginkgo/extensions/table/table.go:92
    with '22'(stopped (tty output)) signal [It]
    /tmp/jenkins/workspace/kata-containers-agent-ubuntu-1804-PR-firecracker/go/src/github.com/kata-containers/tests/vendor/github.com/onsi/ginkgo/extensions/table/table_entry.go:43

    Expected error:
        <*errors.errorString | 0xc000057150>: {
            s: "Timeout reached after 120s",
        }
        Timeout reached after 120s
    not to have occurred

    /tmp/jenkins/workspace/kata-containers-agent-ubuntu-1804-PR-firecracker/go/src/github.com/kata-containers/tests/integration/docker/kill_test.go:144

@sboeuf
Copy link

sboeuf commented Jan 22, 2020

@Pennyzct the PR is failing with latest FC 0.20. Are you still trying to make this pass?

response, err := reader.ReadString('\n')
if err != nil {
conn.Close()
return nil, err
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try removing this line, this way the function will return conn, nil and the GRPC code will handle dialing again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, if reading was failing, we should return error instantly. Then here in func commonDialer

	go func() {
		for {
			select {
			case <-cancel:
				// canceled or channel closed
				return
			default:
			}

			conn, err := dialFunc()
			if err == nil {
				// Send conn back iff timer is not fired
				// Otherwise there might be no one left reading it
				if t.Stop() {
					ch <- conn
				} else {
					conn.Close()
				}
				return
			}
		}
	}()

we could re-connect(conn, err := dialFunc()) again.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, but this quick re-connection performed by commonDialer() tend to get things to fail. Instead, when relying on the backoff strategy from GRPC (which happens when we return conn, nil), things are more stable.
Just to clarify, I'm not saying this is a long term solution, but while we look actively for the root cause, relying on the backoff strategy from GRPC gives us a more stable CI for both Firecracker and Cloud-Hypervisor.

return nil, err
} else if !strings.Contains(response, "OK") {
conn.Close()
return nil, errors.New("malformed response code")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try removing this line, this way the function will return conn, nil and the GRPC code will handle dialing again.

@amshinde
Copy link
Member

@Pennyzct Can you implement @sboeuf suggestions and see if the CI for firecracker passes with those.

jcvenegas pushed a commit to jcvenegas/runtime that referenced this pull request Jan 23, 2020
The new release for Firecracker is `v0.20.0`.

Fixes: kata-containers#2378
Depends-on: github.com/kata-containers/agent#706

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
jcvenegas pushed a commit to jcvenegas/runtime that referenced this pull request Jan 23, 2020
… IS ONLY FOR DEBUGGING!

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
@Pennyzct
Copy link
Contributor Author

Hi~ @sboeuf @amshinde
Sorry for the late response. 😢. I'm still on the Lunar New Year Holiday, so the response could be slow.;).
I'm still working on these patches, it's odd, the all docker-related integration tests have all passed, but failing on ctr. I'll try to reproduce and debug it.

@jcvenegas
Copy link
Member

@Pennyzct ping could you rebase this PR?

New Firecracker v0.20.0 has changed the host-initiated vsock
connection protocol to include a trivial handshake.

The new protocol looks like this:
- [host] CONNECT <port><LF>
- [guest/success] OK <assigned_host_port><LF>

Fixes: kata-containers#705

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
This file is mainly used by kata-runtime and missing logging part.
We add one new "agent-client" log field here, and Plz notice that,
you should turn on the `kata-runtime` debug option to see the output.

Fixes: kata-containers#705
Depends-on: github.com/kata-containers/runtime#2379

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
@Pennyzct
Copy link
Contributor Author

Hi~ guys
Sorry for the delay.😢. I'm still on the Lunar New Year holidays, and the response could be slow.
Thanks for the suggestion from @sboeuf, I've refined the code to use the back-off strategy from grpc. And I've also re-based the code here. ;).

@Pennyzct
Copy link
Contributor Author

/test

Pennyzct added a commit to Pennyzct/runtime that referenced this pull request Jan 30, 2020
Here, We import changes from PR kata-containers/agent#706 to see
if it's really working.

Fixes: kata-containers#2378

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
@teawater teawater merged commit cc95027 into kata-containers:master Jan 30, 2020
Pennyzct added a commit to Pennyzct/runtime that referenced this pull request Jan 31, 2020
We need to include changes in PR github.com/kata-containers/agent#706
(kata-containers/agent#706, to use
the new vsock-trivial-handshake scheme implemented in FC v0.20.0.

Fixes: kata-containers#2378

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this pull request Feb 5, 2020
We need to include changes in PR github.com/kata-containers/agent#706
(kata-containers/agent#706, to use
the new vsock-trivial-handshake scheme implemented in FC v0.20.0.

Fixes: kata-containers#2378
Depends-on: kata-containers#2431

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this pull request Feb 5, 2020
We need to include changes in PR github.com/kata-containers/agent#706
(kata-containers/agent#706, to use
the new vsock-trivial-handshake scheme implemented in FC v0.20.0.

Fixes: kata-containers#2378

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this pull request Feb 7, 2020
We need to include changes in PR github.com/kata-containers/agent#706
(kata-containers/agent#706, to use
the new vsock-trivial-handshake scheme implemented in FC v0.20.0.

Fixes: kata-containers#2378

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
dong-liuliu pushed a commit to dong-liuliu/runtime that referenced this pull request Feb 8, 2020
We need to include changes in PR github.com/kata-containers/agent#706
(kata-containers/agent#706, to use
the new vsock-trivial-handshake scheme implemented in FC v0.20.0.

Fixes: kata-containers#2378

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
evanfoster pushed a commit to evanfoster/runtime that referenced this pull request Mar 9, 2020
We need to include changes in PR github.com/kata-containers/agent#706
(kata-containers/agent#706, to use
the new vsock-trivial-handshake scheme implemented in FC v0.20.0.

Fixes: kata-containers#2378

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants