-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brokers connecting to replaced Bookies #190
Comments
@merlimat any idea what could be happening? We had all brokers connecting to bookies which no longer existed, and continued doing so for days, we had to restart them for them to realize those bookies did not exist. |
I think this is due to the BK client still having the old metadata, pointing to the retired bookies. I know that in broker if we have a read error, we close the ledger to make sure we get the updated metadata. I'm not sure in this case at which point it might have got stuck. Need to reproduce it in a simple scenario to debug it further. |
We had lot's of failures and broker didn't seem to be updating the metadata, If I can find the related logs, I'll attach them. |
@merlimat i'm checking logs, the mostly complain of failing to write to the replaced bookies, and then they get quarantined, but eventually, they get removed from quarantine and start failing again. |
Uhm, that sound strange. If the bookies are down, they should not be being picked up for new ledgers ensembles. Can you indeed post the logs here? Also, can you verify that the bookies are really not registered in ZK anymore?
|
I did that when we had the issue and they were indeeed not registered. Here I attach some logs for a single broker which is failing against bookie 10.64.103.176 which did not exist at the time. The logs are only those mentioning that bookie. |
@merlimat could you take a look at the logs? |
So, I'm not sure on what is exactly happening, though it seems to be related with the RackAware policy and the notification it gets when the z-node with the mapping is changed. In particular, at the beginning of the log :
So, first bookie 10.64.103.176 gets removed and then immediately added back again. I need to setup a test env to try to reproduce this. In the meantime, I think you were updating rack-aware mapping z-node every time a bookie was removed from |
@sschepens Not able to reproduce locally so far. Can you try turning debug logs on these classes?
Also can you explain again how do you update the rack info? |
If you are still seeing this issue, it will be useful if you can list the nodes under /ledgers/available and send the contents of /bookies |
@saandrews yes we're still experimenting this every once in a while |
This might be related to network stabilization in zookeeper. Do let us know if this issue is resolved. |
@sschepens did this issue ever get resolved, or do you continue to see it? |
Closing this for now, please reopen if you see again |
* Uncompress the payload if it's compressed * Add Pulsar producers with all compression types to unit test
We're finding brokers attempting to connect to Bookies that have been replaced and are nowhere to be found on Zookeeper.
The text was updated successfully, but these errors were encountered: