Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very quick offerRescinded, brokers not starting #292

Open
ror6ax opened this issue Mar 10, 2017 · 11 comments
Open

very quick offerRescinded, brokers not starting #292

ror6ax opened this issue Mar 10, 2017 · 11 comments

Comments

@ror6ax
Copy link

ror6ax commented Mar 10, 2017

I've started the scheduler via ./kafka-mesos.sh scheduler:

2017-03-10 17:46:29,377 DEBUG     KafkaMesosScheduler] [ACCEPT ]: slave-6#82904 cpus:1.20; mem:38079.00; disk:280854.00; ports:[2384..2389][7000..7003][7199..7200][9001..9001][9042..9043][9160..9161][31000..31091][31093..31209][31211..31323][31325..31388][31390..31779][31781..32000] => {"id":"0","active":true,"cpus":1.0,"mem":2048,"heap":1024,"syslog":false,"constraints":"","options":"","log4jOptions":"","stickiness":{"period":"10m"},"failover":{"delay":"1m","maxDelay":"10m","failures":0},"metrics":{"timestamp":0},"needsRestart":false}
2017-03-10 17:46:32,852 DEBUG     KafkaMesosScheduler] [resourceOffers]
slave-1#82915 mem:22559.00; disk:278550.00; ports:[2384..2389][7002..7003][7200..7200][9001..9001][9043..9043][9161..9161][31000..31099][31101..31286][31288..31462][31464..31488][31490..31631][31635..31733][31735..31898][31900..31972][31974..32000]
2017-03-10 17:46:32,853 DEBUG     KafkaMesosScheduler] [DECLINE]:       slave-1#82915 mem:22559.00; disk:278550.00; ports:[2384..2389][7002..7003][7200..7200][9001..9001][9043..9043][9161..9161][31000..31099][31101..31286][31288..31462][31464..31488][31490..31631][31635..31733][31735..31898][31900..31972][31974..32000] for 3600s because cpus < 1.0
2017-03-10 17:46:39,329 INFO      KafkaMesosScheduler] [offerRescinded] #82904

I can see Mesos having a Kafka framework, but not one of the agents/workers.

Logs are showing ACCEPT -> couple of seconds -> offerRescinded cycle around my cluster.
Any hints as to how do I debug this?

@steveniemitz
Copy link
Contributor

Do you see anything interesting in the mesos agent logs? (slave-6 in this case). It looks like its exactly 10 seconds after the ACCEPT was sent, which seems suspiciously like the offer is timing out. Do you have an offer timeout set to 10 seconds on mesos by any chance?

There should be a log entry directly after that accept saying something like Starting broker 0: launching task ... by offer ...

Can you also check the master logs and see if you see the Launch Tasks message getting there from the framework?

Finally, what version of mesos are you running?

@ror6ax
Copy link
Author

ror6ax commented Mar 13, 2017

In fact, only entry in the logs apart from ACCEPTs and immediate DECLINEs is happening when I re-start the broker. Neither clients nor master sow anything related to Kafka in the logs. I'm currently debugging some problems due to overlay network configuration on the cluster, do you think this might be related?

@ror6ax
Copy link
Author

ror6ax commented Mar 14, 2017

I have switched to Marathon scheduler. Logs are a bit more verbose now.
Broker start fails with

2017-03-14 15:04:09,148 INFO   BrokerLifecycleManager] Broker 0 failed 2, waiting 2m, next start ~ 2017-03-14 15:06:09Z
2017-03-14 15:04:09,713 INFO      KafkaMesosScheduler] [executorLost] executor:#aad1c slave:#22-S3 status:0
2017-03-14 15:05:27,117 INFO                BrokerApi] Handling start for broker 0
2017-03-14 15:05:27,118 INFO   BrokerLifecycleManager] Activating broker 0
2017-03-14 15:05:27,118 INFO             OfferManager] Re-requesting previously suppressed offers.
2017-03-14 15:05:27,124 INFO        BrokerTaskManager] Starting broker 0: launching task kafka-0-8af35576-7318-4cae-902f-49efba5224be by offer some-slave-3#36557
 kafka-0-8af35576-7318-4cae-902f-49efba5224be slave:#22-S0 cpus:1.00; mem:2048.00; ports:[2380..2380] data:{"id":0,"options":{},"syslog":false,"log4jOptions":{},"defaults":{"broker.id":"0","log.dirs":"kafka-logs","port":"2380","zookeeper.connect":"master.mesos:2181","host.name":"some-slave-3","log.retention.bytes":"10737418240"}}
2017-03-14 15:05:27,316 INFO      KafkaMesosScheduler] [executorLost] executor:#90a7d slave:#22-S0 status:256
2017-03-14 15:05:27,317 INFO      KafkaMesosScheduler] [statusUpdate] kafka-0-8af35576-7318-4cae-902f-49efba5224be TASK_FAILED slave:#22-S0 reason:REASON_CONTAINER_LAUNCH_FAILED message:Failed to launch container: Failed to fetch all URIs for container 'a57754aa-62f8-4dd6-9cf1-7a0f9efb5e8d' with exit status: 256
2017-03-14 15:05:27,317 INFO   BrokerLifecycleManager] Broker 0 failed 1, waiting 1m, next start ~ 2017-03-14 15:06:27Z
2017-03-14 15:06:26,688 INFO        BrokerTaskManager] Starting broker 0: launching task kafka-0-1505ef66-a712-404f-95ed-2f97ee60edd3 by offer some-slave-6#36870
 kafka-0-1505ef66-a712-404f-95ed-2f97ee60edd3 slave:#22-S3 cpus:1.00; mem:2048.00; ports:[2384..2384] data:{"id":0,"options":{},"syslog":false,"log4jOptions":{},"defaults":{"broker.id":"0","log.dirs":"kafka-logs","port":"2384","zookeeper.connect":"master.mesos:2181","host.name":"some-slave-6","log.retention.bytes":"10737418240"}}
2017-03-14 15:06:27,672 INFO      KafkaMesosScheduler] [statusUpdate] kafka-0-1505ef66-a712-404f-95ed-2f97ee60edd3 TASK_STARTING slave:#22-S3 reason:REASON_COMMAND_EXECUTOR_FAILED
2017-03-14 15:06:28,299 INFO      KafkaMesosScheduler] [statusUpdate] kafka-0-1505ef66-a712-404f-95ed-2f97ee60edd3 TASK_FAILED slave:#22-S3 reason:REASON_COMMAND_EXECUTOR_FAILED message:java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at ly.stealth.mesos.kafka.executor.MetricCollectorProxy.ly$stealth$mesos$kafka$executor$MetricCollectorProxy$$collect(MetricCollectorProxy.scala:49)
	at ly.stealth.mesos.kafka.executor.MetricCollectorProxy.start(MetricCollectorProxy.scala:45)
	at ly.stealth.mesos.kafka.executor.KafkaServer$Distro$.startCollector(BrokerServer.scala:201)
	at ly.stealth.mesos.kafka.executor.KafkaServer.start(BrokerServer.scala:90)
	at ly.stealth.mesos.kafka.executor.Executor$$anon$1.run(Executor.scala:94)

@steveniemitz
Copy link
Contributor

steveniemitz commented Mar 14, 2017

That's odd that you weren't getting those verbose logs before.

Anyways I suspect the issue is the executor crashing on launch

2017-03-14 15:06:28,299 INFO      KafkaMesosScheduler] [statusUpdate] kafka-0-1505ef66-a712-404f-95ed-2f97ee60edd3 TASK_FAILED slave:#22-S3 reason:REASON_COMMAND_EXECUTOR_FAILED message:java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at ly.stealth.mesos.kafka.executor.MetricCollectorProxy.ly$stealth$mesos$kafka$executor$MetricCollectorProxy$$collect(MetricCollectorProxy.scala:49)
	at ly.stealth.mesos.kafka.executor.MetricCollectorProxy.start(MetricCollectorProxy.scala:45)
	at ly.stealth.mesos.kafka.executor.KafkaServer$Distro$.startCollector(BrokerServer.scala:201)
	at ly.stealth.mesos.kafka.executor.KafkaServer.start(BrokerServer.scala:90)
	at ly.stealth.mesos.kafka.executor.Executor$$anon$1.run(Executor.scala:94)

I think I remember someone reporting seeing this when the version of kafka the framework was built against didn't match the version they were running. Do yours match? The default build is 0.10.2.0 now.

Also, it seems like some of your mesos agents might not be able to hit the scheduler:

2017-03-14 15:05:27,317 INFO      KafkaMesosScheduler] [statusUpdate] kafka-0-8af35576-7318-4cae-902f-49efba5224be TASK_FAILED slave:#22-S0 reason:REASON_CONTAINER_LAUNCH_FAILED message:Failed to launch container: Failed to fetch all URIs for container 'a57754aa-62f8-4dd6-9cf1-7a0f9efb5e8d' with exit status: 256

@ror6ax
Copy link
Author

ror6ax commented Mar 14, 2017

I've substituded --api to Mesos-DNS provided name, got one step closer:

2017-03-14 15:14:45,877 INFO      KafkaMesosScheduler] [statusUpdate] kafka-0-a7a43b22-9088-4b97-9672-8f8f82286b70 TASK_STARTING slave:#22-S3 reason:REASON_COMMAND_EXECUTOR_FAILED
2017-03-14 15:14:46,484 INFO      KafkaMesosScheduler] [statusUpdate] kafka-0-a7a43b22-9088-4b97-9672-8f8f82286b70 TASK_FAILED slave:#22-S3 reason:REASON_COMMAND_EXECUTOR_FAILED message:java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at ly.stealth.mesos.kafka.executor.MetricCollectorProxy.ly$stealth$mesos$kafka$executor$MetricCollectorProxy$$collect(MetricCollectorProxy.scala:49)
	at ly.stealth.mesos.kafka.executor.MetricCollectorProxy.start(MetricCollectorProxy.scala:45)
	at ly.stealth.mesos.kafka.executor.KafkaServer$Distro$.startCollector(BrokerServer.scala:201)
	at ly.stealth.mesos.kafka.executor.KafkaServer.start(BrokerServer.scala:90)
	at ly.stealth.mesos.kafka.executor.Executor$$anon$1.run(Executor.scala:94)
Caused by: java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
	at ly.stealth.mesos.kafka.executor.MetricCollectorProxy$MetricsCollector.collectJvmMetrics(MetricCollectorProxy.scala:111)
	at ly.stealth.mesos.kafka.executor.MetricCollectorProxy$MetricsCollector.run(MetricCollectorProxy.scala:194)

Ypu mean if version in docker matches the one i do ./kafka-mesos.sh broker start 0 with?

Both built from same repo based on https://github.com/mesos/kafka 20 minutes apart.
I only modified the dockerfile to use Java 8 since my build machine uses it and there were "major.minor 52.0" build error problems.

@steveniemitz
Copy link
Contributor

steveniemitz commented Mar 14, 2017

Yep, pretty sure this is due to a kafka mismatch. Make sure you're running the scala 2.11 build of kafka and are the version matches what kafka-mesos was built against.

@ror6ax
Copy link
Author

ror6ax commented Mar 14, 2017

gradle.build says scalaVersion = "2.11.8".

Now, I'm not sure how do I verify I run it using same Scala.

 Add mesos repo
RUN \
  apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E56151BF && \
  echo deb http://repos.mesosphere.io/ubuntu trusty main > /etc/apt/sources.list.d/mesosphere.list

# Install deps
RUN apt-get update
RUN apt-get install -qy software-properties-common python-software-properties
RUN apt-get install -qy debconf-utils
RUN \
  add-apt-repository ppa:webupd8team/java -y && \
  apt-get update && \
  apt-get install -qy \
    git vim zip mc curl
RUN echo "oracle-java8-installer shared/accepted-oracle-license-v1-1 select true" | sudo debconf-set-selections
RUN apt-get install -y oracle-java8-installer
RUN apt-get install -qy mesos libc6

# Add kafka-mesos & kafka
COPY .docker/kafka* /opt/kafka-mesos/
COPY ./docker-entrypoint.sh /docker-entrypoint.sh

It only says Java 8 is installed, how do I make sure what I launch it with?

@steveniemitz
Copy link
Contributor

I mean the version of kafka, that needs to be the 2.11 version. What is the file name of the kafka .tgz you're using?

If you're using the docker build script I'd be weary, I don't know if it still works (or ever worked...)

@ror6ax
Copy link
Author

ror6ax commented Mar 14, 2017

It turns out my gradle said Mesos version 0.28.0 while I'm running 1.1.0, so I'm testing that.

kafka-mesos-0.10.1.0-SNAPSHOT-kafka_2.11-0.10.1.1.jar
kafka_2.10-0.8.2.2.tgz

Ahaaa. "build-image.sh" has hardcoded values! I missed that during the build. Let me rebuild and see if it helps.

@ror6ax
Copy link
Author

ror6ax commented Mar 16, 2017

it works. would you be interested in PR?

@steveniemitz
Copy link
Contributor

sorry, didn't see this! Yes please! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants