Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding InfluxDB user returns Error: Failed to create User: user not found #5840

Closed
btasker opened this issue Jan 19, 2022 · 4 comments · Fixed by #5917
Closed

Adding InfluxDB user returns Error: Failed to create User: user not found #5840

btasker opened this issue Jan 19, 2022 · 4 comments · Fixed by #5917
Assignees
Milestone

Comments

@btasker
Copy link

btasker commented Jan 19, 2022

This needs to be run against an Enterprise/Cloud1 environment (there need to be multiple nodes)

  1. Create lots of InfluxDB users (the example of this has > 200 users)
  2. Try and create a new user
  3. Chronograf will return the error Error: Failed to create User: user not found
  4. If you refresh the page, the user will be there

Expected behavior:
The user should be added without error

Actual behavior:
The user is created, but an error is returned

The POST to /chronograf/v1/sources/0/users returns

{"code":400,"message":"user not found"}
@btasker
Copy link
Author

btasker commented Jan 19, 2022

My original thinking was that this was a meta-node issue - for some reason I thought that URL was being proxied more or less straight through. Looking at Chronograf's source, though, that's obviously not the case.

Walking through the code, it looks like where we're failing is probably here

        res, err := store.Add(ctx, user)
        if err != nil {
                Error(w, http.StatusBadRequest, err.Error(), s.Logger)
                return
        }

(Incidentally, that conditional section is checked a second time almost immediately, looks like a mistake/double paste.)

Anyway, under the hood that looks to be calling influx::add, that function's quite small, but there's almost certainly potential for a race there

func (c *Client) Add(ctx context.Context, u *chronograf.User) (*chronograf.User, error) {
	_, err := c.Query(ctx, chronograf.Query{
		Command: fmt.Sprintf(`CREATE USER "%s" WITH PASSWORD '%s'`, u.Name, u.Passwd),
	})
	if err != nil {
		return nil, err
	}

It's running a CREATE USER against InfluxDB, which we know completes, so we're unlikely to have failed at this conditional.

But then it moves onto assigning permissions for that user

	for _, p := range u.Permissions {
		if err := c.grantPermission(ctx, u.Name, p); err != nil {
			return nil, err
		}
	}
	return c.Get(ctx, chronograf.UserQuery{Name: &u.Name})
}

If that fails, we know we're going to return the error back up the call chain into the response JSON. Additionally, it doesn't make any sense for a user creation command to return "user not found", but it does make sense for a permissions operation to do so.

Looking in grantPermission

func (c *Client) grantPermission(ctx context.Context, username string, perm chronograf.Permission) error {
        query := ToGrant(username, perm)
        if query == "" {
                return nil
        }

        _, err := c.Query(ctx, chronograf.Query{
                Command: query,
        })
        return err
}

ToGrant just generates an InfluxQL statement, and in any case we can ignore that because we'll return nil if it fails.

So, query, the bit that's interesting here is inital creation of the request object

        req, err := http.NewRequest("POST", u.String(), nil)
        if err != nil {
                return nil, err
        }
        req.Header.Set("Content-Type", "application/json")

We're using net/http to place the upstream request - it should support keep-alive by default (and we configure it against the transport in another method). But, there's no guarantee that that original connection will get reused by our permissions request (it might already have been taken up by another request, etc etc).

I think the answer is that our permissions request is going to another node, and because of the size of the user's table/dataset it hasn't finished replicating across yet - reworking to ensure that the CREATE USER and subsequent permissions calls go over the same connection (and so the same node) would address that - IF I'm right, which is a pretty big if.

@davidby-influx
Copy link

@sranka - Has this been triaged and scheduled for a release? We have a customer suffering from this bug.

@sranka sranka added this to the 1.10 milestone Feb 28, 2022
@sranka
Copy link
Contributor

sranka commented Feb 28, 2022

A solution to investigate would be to execute all statements (create user and grants) in a single InfluxQL query, without worrying about cluster implementation at all.

@sranka
Copy link
Contributor

sranka commented May 11, 2022

@btasker thank you for your investigation, I looked a bit deeper to see that the situation in the code is different.
InfluxDB enterprise uses this code to create users. It communicates directly with the master meta node when creating a user (possibly with a preceding 307 redirect from a follower node), but it may communicate with a follower node when getting the user. The user creation process thus consists of the following steps:

  1. create a user (name and password) using /user endpoint at meta node
  2. set user permissions
    2.1 get user using /user endpoint at meta node
    2.2. add user permissions using /user endpoint at meta node
  3. set user roles
    3.1. for every role, add the user to role using /role endpoint at meta node
  4. Get created user
    4.1. get user using /user endpoint at meta node
    4.2. get user roles from all roles using /role endpoint at meta node

Chronograf UI does not allow to create users with roles and permissions, user not found error can be thus caused when getting the user (parts 2.1 and 4.1). 2.1 can be avoided in the chronograf code (it is useless), but 4.1 is not. user not found is probably caused by a small delay caused by the eventual consistency of the meta cluster. As a solution, the code that creates the user should retry getting the user in case of user not found error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants