Brokers connecting to replaced Bookies #190

sschepens · 2017-02-03T21:07:58Z

We're finding brokers attempting to connect to Bookies that have been replaced and are nowhere to be found on Zookeeper.

sschepens · 2017-02-07T15:31:19Z

@merlimat any idea what could be happening? We had all brokers connecting to bookies which no longer existed, and continued doing so for days, we had to restart them for them to realize those bookies did not exist.
Is this an issue with Bookkeeper client or with brokers themselves?

merlimat · 2017-02-07T15:35:41Z

I think this is due to the BK client still having the old metadata, pointing to the retired bookies. I know that in broker if we have a read error, we close the ledger to make sure we get the updated metadata.

I'm not sure in this case at which point it might have got stuck. Need to reproduce it in a simple scenario to debug it further.

sschepens · 2017-02-07T15:40:28Z

I think this is due to the BK client still having the old metadata, pointing to the retired bookies. I know that in broker if we have a read error, we close the ledger to make sure we get the updated metadata.

We had lot's of failures and broker didn't seem to be updating the metadata, If I can find the related logs, I'll attach them.

sschepens · 2017-02-07T15:59:20Z

@merlimat i'm checking logs, the mostly complain of failing to write to the replaced bookies, and then they get quarantined, but eventually, they get removed from quarantine and start failing again.
No errors seem to come from reading, maybe that's why metadata doesn't get refreshed?

merlimat · 2017-02-07T18:55:24Z

Uhm, that sound strange. If the bookies are down, they should not be being picked up for new ledgers ensembles. Can you indeed post the logs here? Also, can you verify that the bookies are really not registered in ZK anymore?

bin/bookkeeper shell listbookies  --readwrite

sschepens · 2017-02-07T19:21:07Z

Also, can you verify that the bookies are really not registered in ZK anymore?

I did that when we had the issue and they were indeeed not registered.

Here I attach some logs for a single broker which is failing against bookie 10.64.103.176 which did not exist at the time. The logs are only those mentioning that bookie.

logs.txt

sschepens · 2017-02-20T15:04:20Z

@merlimat could you take a look at the logs?

merlimat · 2017-02-20T22:48:00Z

So, I'm not sure on what is exactly happening, though it seems to be related with the RackAware policy and the notification it gets when the z-node with the mapping is changed.

In particular, at the beginning of the log :

2017-02-03 02:39:04,038 - INFO  - [zk-cache-executor-11-1:ZkBookieRackAffinityMapping@160] - Bookie rack info updated to {us-east-1={10.64.103.28:3181=com.yahoo.pulsar.zookeeper.BookieInfo@583c6a6e, 10.64.102.115:3181=com.yahoo.pulsar.zookeeper.BookieInfo@410cbbe8, 10.64.102.214:3181=com.yahoo.pulsar.zookeeper.BookieInfo@797f9064, 10.64.102.126:3181=com.yahoo.pulsar.zookeeper.BookieInfo@10f00f34, 10.64.103.156:3181=com.yahoo.pulsar.zookeeper.BookieInfo@2ba4685e, 10.64.102.237:3181=com.yahoo.pulsar.zookeeper.BookieInfo@f535539, 10.64.102.145:3181=com.yahoo.pulsar.zookeeper.BookieInfo@a2a0807, 10.64.103.176:3181=com.yahoo.pulsar.zookeeper.BookieInfo@1ab32fd9, 10.64.103.68:3181=com.yahoo.pulsar.zookeeper.BookieInfo@12dd5249, 10.64.103.79:3181=com.yahoo.pulsar.zookeeper.BookieInfo@73227b6, 10.64.102.65:3181=com.yahoo.pulsar.zookeeper.BookieInfo@5d026d67, 10.64.103.171:3181=com.yahoo.pulsar.zookeeper.BookieInfo@5e4c5cf9}}. Notifying rackaware policy.
2017-02-03 02:39:04,039 - INFO  - [zk-cache-executor-11-1:NetworkTopology@463] - Removing a node: /us-east-1e/10.64.103.176:3181
2017-02-03 02:39:04,039 - INFO  - [zk-cache-executor-11-1:NetworkTopology@394] - Adding a new node: /us-east-1e/10.64.103.176:3181

So, first bookie 10.64.103.176 gets removed and then immediately added back again. I need to setup a test env to try to reproduce this.

In the meantime, I think you were updating rack-aware mapping z-node every time a bookie was removed from /ledgers/available, right? Can you try not to touch the mapping and see if it makes any difference?

merlimat · 2017-02-21T03:01:25Z

@sschepens Not able to reproduce locally so far.

Can you try turning debug logs on these classes?

org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy
org.apache.bookkeeper.net.NetworkTopology
com.yahoo.pulsar.zookeeper.ZkBookieRackAffinityMapping

Also can you explain again how do you update the rack info?

saandrews · 2017-03-16T23:42:32Z

If you are still seeing this issue, it will be useful if you can list the nodes under /ledgers/available and send the contents of /bookies

sschepens · 2017-03-17T13:10:54Z

@saandrews yes we're still experimenting this every once in a while

shifty21 · 2017-12-19T09:06:26Z

This might be related to network stabilization in zookeeper.
Basically it takes time to stabilise the network of bookies as registered in zookeeper. You might want to look at the zookeeper property bkc.networkTopologyStabilizePeriodSeconds
Ref - https://twitter.github.io/distributedlog/html/implementation/storage.html

Do let us know if this issue is resolved.

ivankelly · 2018-08-30T15:01:41Z

@sschepens did this issue ever get resolved, or do you continue to see it?

ivankelly · 2018-12-05T22:15:14Z

Closing this for now, please reopen if you see again

* Uncompress the payload if it's compressed * Add Pulsar producers with all compression types to unit test

sijie pushed a commit to sijie/pulsar that referenced this issue Mar 4, 2018

Check for void type for output serde (apache#190)

0c7e7b5

ivankelly added triage/week-35 type/bug The PR fixed a bug or issue reported a bug labels Aug 30, 2018

ivankelly removed the triage/week-35 label Dec 5, 2018

ivankelly closed this as completed Dec 5, 2018

hangc0276 pushed a commit to hangc0276/pulsar that referenced this issue May 26, 2021

Support consume compressed messages from Pulsar producer (apache#190)

41f274c

* Uncompress the payload if it's compressed * Add Pulsar producers with all compression types to unit test

xiaotongwang1 mentioned this issue Aug 4, 2021

Pulsar 2.7.0+ KOP 2.7.2.x getPartitionedTopicMetadata timeout #11532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brokers connecting to replaced Bookies #190

Brokers connecting to replaced Bookies #190

sschepens commented Feb 3, 2017

sschepens commented Feb 7, 2017

merlimat commented Feb 7, 2017

sschepens commented Feb 7, 2017

sschepens commented Feb 7, 2017

merlimat commented Feb 7, 2017

sschepens commented Feb 7, 2017

sschepens commented Feb 20, 2017

merlimat commented Feb 20, 2017

merlimat commented Feb 21, 2017

saandrews commented Mar 16, 2017

sschepens commented Mar 17, 2017 •

edited

Loading

shifty21 commented Dec 19, 2017 •

edited

Loading

ivankelly commented Aug 30, 2018

ivankelly commented Dec 5, 2018

Brokers connecting to replaced Bookies #190

Brokers connecting to replaced Bookies #190

Comments

sschepens commented Feb 3, 2017

sschepens commented Feb 7, 2017

merlimat commented Feb 7, 2017

sschepens commented Feb 7, 2017

sschepens commented Feb 7, 2017

merlimat commented Feb 7, 2017

sschepens commented Feb 7, 2017

sschepens commented Feb 20, 2017

merlimat commented Feb 20, 2017

merlimat commented Feb 21, 2017

saandrews commented Mar 16, 2017

sschepens commented Mar 17, 2017 • edited Loading

shifty21 commented Dec 19, 2017 • edited Loading

ivankelly commented Aug 30, 2018

ivankelly commented Dec 5, 2018

sschepens commented Mar 17, 2017 •

edited

Loading

shifty21 commented Dec 19, 2017 •

edited

Loading