Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

/fleet/unit excessive read requests to Etcd #1257

Closed
mwitkow opened this issue Jun 18, 2015 · 4 comments
Closed

/fleet/unit excessive read requests to Etcd #1257

mwitkow opened this issue Jun 18, 2015 · 4 comments

Comments

@mwitkow
Copy link
Contributor

mwitkow commented Jun 18, 2015

We're running an Etcd 2.1-alpha1, with 5 core cluster and the rest of nodes (hundreds) in Etcd proxy configuration.

As per graphs in etcd-io/etcd#3001 we're seeing a storm of recursive GETs coming from local host fleet, on each of the nodes in the cluster. We're running about 1200 units on the cluster.

We've captured a TCP dump (available on request) that shows that Fleet has made ~6000 HTTP get requests in about 120s. We're not running any fleetctl commands on the host, so I assume this is the local fleetd.

fleet_recursive_requests

Most of these are GET requests to individual /v2/keys/_coreos.com/fleet/unit/<unit_id> in consistent, recursive mode.

Is there some reason that prohibits fleet from doing one big recursive request on /v2/keys/_coreos.com/fleet/unit/oing an individual request per unit instead of a recursive one across all? Can you point us at the code that does that?

@mwitkow
Copy link
Contributor Author

mwitkow commented Jun 18, 2015

So the only place that calls /unit/<unit_id> in recursive mode seems to be
registry/job.go:257
func (r *EtcdRegistry) getUnitFromObjectNode(node *etcd.Node) (*job.Unit, error)
which is only called in
registry/job.go:151
func (r *EtcdRegistry) dirToUnit(dir *etcd.Node) (*job.Unit, error)

this is called in two places:
registry/job.go:133
func (r *EtcdRegistry) Unit(name string) (*job.Unit, error)
(which I assume is called for a single unit fetch)
or in
registry/job.go:90
func (r *EtcdRegistry) Units() ([]job.Unit, error)

which does match the symptoms we see in the tcp dump (see highlighted /job sorted request)
fleet_recursive_requests1

@wuqixuan
Copy link
Contributor

Currently, the design is thta all agent get all unit files periodically from etcd. If there are many units in etcd and agent is many, the unit HTTP get request will be huge.
I think need implement a cache mechanism in agent side. Only if the unit change, the agent get the latest status from etcd, otherwise, will not send unit file HTTP get request to etcd.
@bcwaldon @jonboulle @crawford, what do you think ? If need, I can help to implement it to enhance the performance of etcd server.

@mwitkow
Copy link
Contributor Author

mwitkow commented Jun 29, 2015

#1260
It does a single recursive for all units read on registry.Units() call, instead of making one for Unit.

@jonboulle
Copy link
Contributor

Fixed by #1376. Thanks for the patch!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants