Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset / Overwrite invokerId for unique name in zookeeper manually #5024

Merged
merged 5 commits into from
Nov 13, 2020

Conversation

bdoyle0182
Copy link
Contributor

Description

This allows the operator to specifically set the instanceId for a unique name in zookeeper from the command line when starting the invoker. Added a comment that this command option should not be used unless you are sure that the invokerId does not exist for any unique name

For context, we use hostname for our invoker unique names. We have hosts that no longer exist and therefore invokers are created backfilled in the controller invoker pool for those invokers that will never be started again. This causes the invoker hashing algorithm to be off since these invokers are always considered down. This change allows us to easily move our highest lexicographical invokers into those slots in zookeeper without manually going into zookeeper to do so.

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

}
} else {
val newId = overwriteId.getOrElse("")
zkClient.create().orSetData().forPath(myIdPath, BigInt(newId).toByteArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got this correctly, I think there would be still data for obsolete unique names.
Shouldn't we remove them?

For example, let's assume data looks like this:

/invokers/idAssignment/mapping/host1 0   // obsolete
/invokers/idAssignment/mapping/host2 1   // obsolete
/invokers/idAssignment/mapping/host3 2
/invokers/idAssignment/mapping/host4 3

You can update this like,

/invokers/idAssignment/mapping/host1 0   // obsolete
/invokers/idAssignment/mapping/host2 1   // obsolete
/invokers/idAssignment/mapping/host3 0
/invokers/idAssignment/mapping/host4 1

But there will still be data for host1 and host2.
If such names are used again by any chance, isn't it possible that id conflicts can happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct. I added a comment that you should not use this overwrite unless you are sure that there is not another invoker that shares the id.

I would like to make it cleaner to update invoker host mapping, but currently it's tracked by a single atomic counter in zookeeper which makes this complicated since it only goes up. And since the invokers are hostname -> id mappings, you would need to get all invokers and look at all of the ids to determine what is the lowest id available to add it as (which I'm not sure if you can do atomically like with the counter I'm not super familiar with zookeeper). This is just meant to be a manual operation used for corrections when your fleet is out of sync. Do you have any better ideas because I do think it's a problem you can never really go back and remap invoker hosts to a different instance id

Copy link
Member

@style95 style95 Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's a bit picky.
How about removing the obsolete data from the zookeeper when overwriting the ID?
Maybe we can iterate the mappings and figure out the invoker with the overwritten ID.

I am also fine with the proper comment as this feature would not be used without any manual intervention.

Copy link
Contributor Author

@bdoyle0182 bdoyle0182 Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a change which checks all uniqueNames to see if the instance id already exists before overwriting it to this invoker. This isn't atomic like the counter is for assigning id's so it's not perfect, but I think it's more than good enough for something that should be used for corrective actions to an out of sync fleet.

Next up, I would like to figure out a way to improve the dynamic id assigner so that it can handle these gaps and backfill if things are missing in an atomic way to this check of what ids are there. But that should be saved for another pr. This is good enough for now for us to correct things.

exception case should be covered in a unit test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure this would fix your problem.
It seems the path is not transient so that even if an obsolete invoker no longer exists, there will be the path and data in zookeeper.
So you would be unable to overwrite the existing ID all the time.

Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's right the data would have to be gone in zookeeper before the overwrite attempt happens. I guess I could try to delete the existing mapping once I find it and then overwrite the id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, it should delete the mapping with the instance id before it overwrites

Copy link
Member

@rabbah rabbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looking good to me

if (idCounter.trySetCount(current, current.getValue() + 1)) {
current.getValue()
} else {
assignId()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's "possible" this can recurse forever, should this give up after N trials?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that way yea, but I'd rather deal with that in a separate PR since it's already existing.

bdoyle0182 and others added 2 commits November 12, 2020 09:41
…InstanceIdAssigner.scala

Co-authored-by: rodric rabbah <rodric@gmail.com>
…InstanceIdAssigner.scala

Co-authored-by: rodric rabbah <rodric@gmail.com>
@bdoyle0182
Copy link
Contributor Author

I made a couple changes to make this better and assigning ids more infallible:

  1. As noted in the discussion above, if the invoker id existing for another name; it will delete that mapping before reassigning to the new name.
  2. You can't overwrite with an invoker id that doesn't fit into the size of the invoker pool. i.e. if the invoker pool is size 2, I can't overwrite with id 2. only 0 or 1.
  3. When dynamically assigning a new id, it now does so based on the number of invoker nodes in the zookeeper list rather than the atomic count. That way when reassigning things, it will always stay in sync with the true size of invokers if assigning a new id dynamically. This should maintain atomicity because it still uses the atomic counter as a lock and will re get the invokers size if it doesn't acquire the lock and would then include the new invoker that had the lock at the time. This is important to do because obviously it better tracks the true size of your invoker fleet and is important for allowing overwriting. For example, we want to move invokers 25-30 to invokers 0-5. When we do that, 25-30 would no longer exist; but our counter is still at 31 so if we add a new node it will be 31 leaving 25-30 forever empty. With this change, after moving 25-30 to 0-5 it will recognize there's actually only 25 invokers so when we add a new node it will correctly use 25 for the new node.

@bdoyle0182 bdoyle0182 force-pushed the reset-zookeeper-invokerid branch 2 times, most recently from fa5691b to 2c1c397 Compare November 12, 2020 22:55
Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor nit

@bdoyle0182 bdoyle0182 merged commit 526f011 into apache:master Nov 13, 2020
@bdoyle0182
Copy link
Contributor Author

Tested this out in one of our regions. Works great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants