Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compute agent store for ncproxy reconnect #1097

Merged

Conversation

katiewasnothere
Copy link
Contributor

@katiewasnothere katiewasnothere commented Aug 9, 2021

This PR adds a bolt database store for ncproxy to store container IDs to compute agent addresses so we can reconnect to all compute agents if/when ncproxy restarts.

Context:
Ncproxy runs as a proxy service. It has one client connection to the node network agent and runs as a server for the node network agent in return. It additionally runs as one server for all compute agents and as a client to many compute agent servers.

Restart logic:
On restart, ncproxy will create a new client connection to the node network agent. It will then start up a new server for the node network agent. When the node network agent attempts to create a new client to ncproxy, it will be reconnected to this new server. Then ncproxy will create a new server for compute agent connections. When compute agents attempt to create new clients to ncproxy, they will be reconnected to this new server. Lastly ncproxy will attempt to create new clients to the compute agent servers by reading the new computeAgentStore database for the container IDs and corresponding compute agent server addresses.

Signed-off-by: Kathryn Baldauf kabaldau@microsoft.com

@katiewasnothere katiewasnothere requested a review from a team as a code owner August 9, 2021 22:49
@katiewasnothere katiewasnothere force-pushed the compute_agent_store branch 3 times, most recently from 920ce94 to dda49a1 Compare August 9, 2021 22:59
cmd/ncproxy/server.go Outdated Show resolved Hide resolved
cmd/ncproxy/ncproxy.go Outdated Show resolved Hide resolved
cmd/ncproxy/ncproxy.go Outdated Show resolved Hide resolved
cmd/ncproxy/store.go Outdated Show resolved Hide resolved
internal/uvm/types.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
@katiewasnothere katiewasnothere force-pushed the compute_agent_store branch 2 times, most recently from 2a506df to 00296f9 Compare August 10, 2021 23:00
cmd/ncproxy/run.go Outdated Show resolved Hide resolved
cmd/ncproxy/buckets.go Outdated Show resolved Hide resolved
cmd/ncproxy/ncproxy.go Outdated Show resolved Hide resolved
cmd/ncproxy/run.go Outdated Show resolved Hide resolved
cmd/ncproxy/run.go Outdated Show resolved Hide resolved
cmd/ncproxy/store.go Outdated Show resolved Hide resolved
cmd/ncproxy/store.go Outdated Show resolved Hide resolved
cmd/ncproxy/store.go Outdated Show resolved Hide resolved
cmd/ncproxy/store.go Outdated Show resolved Hide resolved
internal/uvm/network.go Outdated Show resolved Hide resolved
if err != nil {
return nil, errors.Wrap(err, "failed to connect to ncproxy service")
}
client := ttrpc.NewClient(conn, ttrpc.WithOnClose(func() { conn.Close() }))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the underlying conn get cleaned up? I see we will close it when the client is closed here, but then we cast it to a type which doesn't implement io.Closer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, it gets leaked from reading through the code. I'll work on a fix for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed this in recent iterations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this open so this can be reviewed and validated as such :)

@kevpar
Copy link
Member

kevpar commented Aug 11, 2021

It would be helpful to have a comment or commit description explaining how the overall ncproxy reconnect scenario will work.

cmd/ncproxy/ncproxy.go Outdated Show resolved Hide resolved
cmd/ncproxy/ncproxy.go Outdated Show resolved Hide resolved
cmd/ncproxy/run.go Outdated Show resolved Hide resolved
cmd/ncproxy/run.go Outdated Show resolved Hide resolved
@katiewasnothere
Copy link
Contributor Author

Made some changes in #1126 that I will wait for merging so I can make relevant changes in this PR.

@katiewasnothere katiewasnothere force-pushed the compute_agent_store branch 3 times, most recently from a117708 to 11c3c61 Compare August 26, 2021 00:00
@kevpar kevpar changed the title Add compute agent store for ncproxy reconnect it Sep 9, 2021
@katiewasnothere katiewasnothere changed the title it Add compute agent store for ncproxy reconnect Sep 9, 2021
@katiewasnothere katiewasnothere force-pushed the compute_agent_store branch 5 times, most recently from f76d5a1 to e4152d9 Compare September 16, 2021 00:48
@katiewasnothere
Copy link
Contributor Author

Made some changes in #1126 that I will wait for merging so I can make relevant changes in this PR.

Updated this PR to rebase and add new tests now that the above is merged.

cmd/ncproxy/store.go Outdated Show resolved Hide resolved
// reconnectComputeAgents creates new compute agent connections from the database of
// active compute agent addresses and adds them to the compute agent client cache
// this MUST be called before the server start serving anything so that we can
// ensure that the cache is ready when they do.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment or some other location could use additional detail how ncproxy reconnect is intended to work. Documenting the database schema would be helpful too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What additional information do you think would be helpful to detail how the reconnect flow works?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

reconnectComputeAgents
Ncproxy maintains a cache of active compute agents in order reestablish connections if the service is restarted. The cache is persisted in a bolt database with the following schema:

On restart ncproxy will attempt to create new compute agent connections from the database of active compute agent addresses and add them to its compute agent client cache. Reconnect MUST be called before the server is allowed to start serving anything so that we can ensure that the cache is ready. Reconnections are performed in parallel to improve service startup performance.

There are a few failure modes for reconnect. If a compute agent entry is stale the reconnect will fail in the following way: xxxxx. Other failure modes are possible but not expected. In those cases we log the failures but allow the service start to proceed. We chose this approach vs just failing service start for the following reasons xxxx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

cmd/ncproxy/server.go Show resolved Hide resolved
Signed-off-by: Kathryn Baldauf <kabaldau@microsoft.com>
Signed-off-by: Kathryn Baldauf <kabaldau@microsoft.com>
Copy link
Contributor

@msscotb msscotb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -599,14 +649,29 @@ func (s *grpcService) GetNetworks(ctx context.Context, req *ncproxygrpc.GetNetwo
// TTRPC service exposed for use by the shim.
type ttrpcService struct {
containerIDToComputeAgent *computeAgentCache
agentStore *computeAgentStore
Copy link
Member

@kevpar kevpar Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a comment here clarifying what the difference is between a "store" and a "cache". That distinction is a bit confusing currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -89,6 +107,7 @@ func run(clicontext *cli.Context) error {
var (
configPath = clicontext.GlobalString("config")
logDir = clicontext.GlobalString("log-directory")
dbPath = clicontext.GlobalString("log-directory")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be database-path instead of log-directory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch :) fixed

Copy link
Member

@kevpar kevpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of changes, otherwise looking good

Signed-off-by: Kathryn Baldauf <kabaldau@microsoft.com>
Copy link
Member

@kevpar kevpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming CI is green :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants