-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Framework not declining offers #132
Comments
hey @drewrobb sorry for the delayed response. Can you please let me know what the values of these config options in your
|
nimbus.monitor.freq.secs = 10 This is from the nimbus UI configuration section. Neither are set in my storm.yaml |
Can you please clarify which commit SHA you are running the framework from? The one you linked isn't at HEAD, and there have been some changes that might affect behavior in this area. (I have a lot more that I've written that I'll send when I get this info.) |
@drewrobb : hmmm, I'm surprised by the behavior you are reporting. Let me clarify that there are 2 behaviors involved in the lines originally pointed at.
So no offer should be held onto for more than 75 seconds before the Another clarification: which mesos version are you running? |
Thanks for clarifying all that, I think the issue is either with 2) (although maybe something outside of that exact code and configuration). Total speculation, but say nimbus is killed while it has a pending offer, would it know about that offer when it starts back again? Is there any other place where state about pending offers between nimbus and mesos could get out of sync? I was running mesos 0.27.1 when this was observed, but we have since upgraded to 0.28, and now with |
Small update: even with |
thanks for the update @drewrobb. So if I'm understanding you correctly, you are saying that the framework is declining an offer and then its constituent resources are sent back immediately (in a new offer) to this same framework? You're not talking about the leftover resources from a task that used part of an offer, right? I'm not super familiar with all the vagaries of the mesos resource handling. Can you confirm whether you are using anything like reservations or other non-default resource handling mechanisms? I would think #134 would not prevent this -- it specifically is meant to lower the amount of time an offer takes to get back to the framework, in the case where we declined it due to expiring in the rotating map. |
I can't confirm that the framework is declining the offer. The offer may have been timed out by our |
@drewrobb : hmm, I'm a bit confused about what the issue is then? What are you expecting to happen in this case? Can you confirm some exact timings? I'm assuming you're running some other frameworks in the cluster and the behavior you're seeing is that the offers are largely (or all?) getting hoarded by the storm-mesos framework? Maybe we can jump on a call or something tomorrow. My email is my GitHub handle |
@erikdw I'd be happy to jump on a call but I did some more debugging and I keep on learning more about what might be happening, so I think I should jot down what I have learned so far. First some more background on the actual problem: We are also running marathon. Our cluster gets to a state where there is only one slave in the cluster with enough resources to launch some big task on marathon. I observe that marathon is stuck trying to launch that task for a very long time (at least on hour). I then look at mesos ui outstanding offers page and see that storm has an offer for that slave. Marathon doesn't seem to ever get the offer, it is reoffered to Storm immediately. If I stop the storm scheduler, marathon will quickly be able to launch its task. At this point in my debugging it is clear that the following is happening:
|
@drewrobb Is your problem same as |
@drewrobb Another question, do you "constraints" feature as mentioned in https://mesosphere.github.io/marathon/ |
@dsKarthick that sort of describes my issue, but I have confirmed that it is storm that is hoarding offers in a way that I don't think it is meant to and also in a way that is independent of my marathon usage. I've verified the following multiple times:
I do run storm framework on marathon. Some apps do use constraints. |
@drewrobb I see that @erikdw already pointed out offer filter refused_second. But you are observing that the resources are being re-offered immediately. Is storm framework is using any of the offers that you think are being hoarded? If so like @erikdw mentioned in his earlier comments, offer filter refused_second isn't applicable and its less of a surprise that the resources are being re-offered to storm itself. Quoting what you said in the previous comment
Does this mean your storm tasks are constantly dying and not getting launched or that the offer is not enough for launching the tasks? If your storm tasks are constantly dying, then it could imply that the resources are being used by the storm-framework and therefore [https://github.com/apache/mesos/blob/0.25.0/include/mesos/mesos.proto#L1161-L1173](offer filter refused_seconds) setting is not applicable. |
I found some nimbus logs that may be very helpful. See https://gist.github.com/drewrobb/78d0cbea14e2b78cacfa7e3a50f3578c Here is what I observe in this log:
This raises two questions:
|
Logs are interesting indeed. Helped me identify a bug https://github.com/mesos/storm/blob/master/storm/src/main/storm/mesos/schedulers/DefaultScheduler.java#L48. It should be
I am assuming that your storm-core logs like below are going to a different location.
Let me try to reproduce it. |
@drewrobb I couldn't reproduce the problem on Friday. But earlier today, I inadvertently reproduced a problem with the symptom that you described. Realized while talking to @erikdw that I had built the stom-mesos archive wrong (I used mvn package and used the resultant jar to deploy rather than using the one resulting from bin/build-release.sh). The moment I re-deployed the correct version, I could see expected behavior. Do you want to do a hangout session with us (@erikdw and I) tomorrow? We are available between 3PM and 7PM PDT tomorrow. |
@dsKarthick : Warriors game is tomorrow, so I'll be unavailable from 5:30pm on. |
@erikdw @dsKarthick today may work, I think I will have some time around 3-4pm |
@drewrobb 3-4pm it is then! Lets do a hangout. Could you email us at {d.s.karthick@, erikdw@}gmail.com. |
Thanks for meeting with us today @drewrobb . Our understanding of this issue is:
|
@drewrobb : we discovered a corner case in the code prior to #154 which might explain the behavior above. In our own storm-mesos-specific scheduler logic, we had a pathological case:
We have fixed that behavior as part of the (numerous) changes in #154. I've kicked off the build of release 0.1.5 of this project, so assuming it goes smoothly there will be a new version for you to try out within the hour. #154 was a large effort by our team (@dsKarthick, @JessicaLHartog, and myself) and it should unblock much quicker changes that will improve various things in the project, such as finally addressing #50 which relates strongly to this issue. |
@erikdw awesome work by your team. I've been on vacation but am back this week. I'll look to try out release 0.1.5 in our environment this week. |
@drewrobb : can you please confirm whether this behavior is gone with 0.1.5 or 0.1.7? I presume we have fixed this issue now. |
We believe this is resolved so I'm closing this issue. Please reopen if the issue persists. |
We observed in production our Storm framework having outstanding offers persist for much longer than the expected 2 minutes (
storm/src/main/storm/mesos/MesosNimbus.java
Lines 179 to 189 in 9dc8258
Fortunately, it seems like an easy workaround is to just set mesos-master option
--offer_timeout
(we didn't have this set prior). However, it seems like whatever underlying cause there might be in the storm scheduler to forget about an outstanding issue might need to be addressed eventually? At least in the mean time, I think it would be good to warn people to use--offer_timeout
because that might be easier than trying to figure out how to reproduce and fix this issue?The text was updated successfully, but these errors were encountered: