-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent deployment failures on ddtrace v1.12.0 - exit status 155 #2955
Comments
Hey Andrew 👋! Thanks for the detailed bug report. Given your mention of the issue showing up first on 1.11.0, which was the first release of dd-trace-rb where we started using signals by default, it definitely seems that's the cause.
Having said the above, your report is definitely a head-scratcher. For context: the new "CPU Profiling 2.0" feature relies on first installing a The The profiler goes through a lot of checks when it installs the signal handler, and it turns itself off if anything fails. So it should not be possible for the profiler to send
|
Hi @ivoanjo! Thanks so much for your detailed reply.
Not all the env vars are set that are listed here, but version and service don't seem important? This is for our web server environment.
|
I went back to check if this is happening on our staging environment as well. It doesn't appear to have happened on staging, and we deploy after every pull request is merged to staging, and deploy about once a day manually to production. I think given this warning I'm not super comfortable turning on debug mode on production. But is strange that it doesn't happen on staging, where we have more deploys occurring. I wonder if there's some other config different between the two we should sync up. |
Thanks for the answers!
Version and service are both optional yet recommended 😉 Setting the version enables you to easily compare deployments of the service (diff my performance profile vs the previous version). Setting the service allows you to control the name of the service that shows up in profiling.
Yeah, I do admit it's kinda hidden in the profiler troubleshooting page, but it's great you reached out since we definitely want to try to fix all of the cases that need this workaround 😄 Thanks for sharing! I went through the list and the only thing that jumped out at me is that you're using "sentry-ruby" which optionally can also do profiling, but it relies on the
This should be fine, I don't think it'll impact profiling (or other features).
I'm definitely curious if you are able to spot any difference between your staging and production environment that may be triggering this on one but not the other. As for the impact of the debug mode, completely understandable! Would you instead be up for trying to run with a custom branch where I can add some useful, yet non-verbose extra logging information? |
Yeah, we could run a custom branch with extra logging! Again, we'll have to wait for a deploy failure to get info but it will be worth it. Thank you for all the work you're putting in on this, I really appreciate it. |
Great! I'll put something together and share it here in the next days! In the meanwhile, if you get the chance to try running with the no signals workaround enabled and seeing if you still get the issues it would be great! I'd like to use that experiment to confirm that indeed the issue is coming from signals sent by the profiler. |
Sure, will do! |
Aptible support responded just now:
For now I'm leaving the |
Hi, updating you on the latest.
Thanks for your help! |
Got it, thanks for the update! In that case, I'll prepare a branch that adds the needed logging so we can track this down. I'll add it here once it's ready. |
Branch for testing is ready! You can use it by changing your gem 'ddtrace', git: 'https://github.com/datadog/dd-trace-rb.git', branch: 'ivoanjo/missing-signal-handler' Every ~10 seconds, it should print one of
If it detects the signal handler is missing, it will still let the Ruby process terminate immediately, but it will ask Ruby to print out a crash log. It will look something like
Let me know if you're able to test it, and what output you get! 🙏 |
Hey! Sorry for the long delay here. I have it running on prod and it shows me:
Haven't hit a faulty deploy yet with the debugging on, but we have had deploy failures in the past week. |
Thanks for the update! Other than the added logging, nothing else is changed, so the chances of the issue occurring should remain the same. Also, since the branch is based on an older release of dd-trace-rb, let me know if you'd like me to rebase it atop dd-trace-rb 1.14.0. |
We got a deployment failure! Do you see anything interesting in here? I can post the whole deploy log in a gist if that's helpful, but I think this is the most important part.
|
Thanks for the info. I was hoping that rather than getting the "Profiling timer expired" again, we'd get the extra info I've added, but it looks like my suspicion wasn't correct. 🤔 I'll admit I'm a bit puzzled: if the sigprof causing the issue is indeed sent by the profiler, how can the profiler observe that the signal handler is in place, and then when the signal is sent, the process acts as if the signal handler is not in place? Here's a few questions that I hope will reveal some leads to investigate:
|
# Puma can serve each request in a thread from an internal thread pool.
# The `threads` method setting takes two numbers: a minimum and maximum.
# Any libraries that use thread pools should be configured to match
# the maximum value specified for Puma. Default is set to 5 threads for minimum
# and maximum; this matches the default thread size of Active Record.
#
max_threads_count = ENV.fetch("RAILS_MAX_THREADS", 5)
min_threads_count = ENV.fetch("RAILS_MIN_THREADS") { max_threads_count }
threads min_threads_count, max_threads_count
# Specifies the `port` that Puma will listen on to receive requests; default is 3000.
#
port ENV.fetch("PORT", 3000)
# Specifies the `environment` that Puma will run in.
#
environment ENV.fetch("RAILS_ENV", "development")
# Specifies the `pidfile` that Puma will use.
pidfile ENV.fetch("PIDFILE", "tmp/pids/server.pid")
# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart
I'm not sure if those are the errors you mean. I've since reverted to the normal ddtrace release for a quieter rails console :) I guess I don't know if I'm seeing containers die and auto replaced by aptible. I'm not aware of that behavior. Thanks for all your help @ivoanjo! This has been a real journey. |
Hey! Thanks Andrew for your careful replies of all my questions. Log analysisI spent some time looking through the full logs, and it seems that there are multiple Ruby processing starting up, and using profiling. For instance, it looks like the first one is the
Notice the This seems to finish without issues:
And then the web app tries to start:
and then it fails ~88 seconds later:
and then roughly at the same time these show up:
This is weird because of the pid also being 69. Linux usually doesn't reuse pids until they "wrap around" the maximum, so this message would appear to be referring to the rake process. But then rake was marked as completed 90 seconds ago! And then later it shows up again ~22 seconds later:
...which is extra weird. What is this second process that failed with "profiling timer expired"? If it's the original pid 69, how can it fail twice? Earlier on, I mentioned that your setup using s6-overlay should be fine and shouldn't impact on this issue but I'm less sure 🤔. If you can ask Aptible support about the possibility of getting logs using the pids, that would be great -- would help in trying to understand which Ruby processes are actually failing, and when. ConfigurationOne odd thing I noticed in your configuration is that you mentioned you were running the Datadog agent together with the Ruby processes, but the Ruby processes are not configured to talk to this agent:
Should they be configured to report to localhost instead? Other notes
Right! We've fixed that since in #3011 , but the debug branch did not include this PR yet. Next steps
I'd like to try enabling the no signals workaround next, with the latest ddtrace (1.14.0), if you can. If it works with the no signals workaround enabled, we'll have confirmation that the rest of the profiler is working fine. If it doesn't, I think that will also provide valuable insights. |
Thank you! Very detailed, love the analysis. Re: Configuration for our Agent URL: Our DD_* env vars:
Our # frozen_string_literal: true
require "faraday"
require "net/http"
require "redis"
require "rest_client"
unless Rails.env.development? || Rails.env.test?
Datadog.configure do |c|
c.tracing.instrument :faraday, split_by_domain: true
c.tracing.instrument :http, split_by_domain: true
c.tracing.instrument :rails, log_injection: true, service_name: "rails"
c.tracing.instrument :redis
c.tracing.instrument :rest_client
c.tracing.instrument :sidekiq, log_injection: true, service_name: "sidekiq"
c.tracing.instrument :pg, comment_propagation: "full", service_name: "postgres"
c.version = APP_VERSION
end
Datadog::Tracing.before_flush(
# Do not trace health checks to data dog
Datadog::Tracing::Pipeline::SpanFilter.new { |span| span.name == "rack.request" && span.get_tag("http.url") == "/health" }
)
end I will:
I'm a bit confused as to why ddtrace would be starting for
Maybe this isn't the recommended way to start it anymore? I also just got a CI test failure where it tried to connect to datadog in a spec and failed. Probably unrelated but thought I'd mention it, given we're talking about ddtrace running when I'm not expecting it to be. DD_TRACE_TEST_MODE_ENABLED is not set, and should default to false. Test output:
I apologize if that turned into a ramble! Thanks for all your help. |
Thanks for clarifying the agent setup question. It shouldn't affect this case, so don't worry about changing it. Also, thank you for sharing your configuration. That actually clarifies the next question:
Adding But, if something else triggers dd-trace-rb start-up (for instance running the Apologies for the confusing behavior in this case! My recommendation would be to prefix the rake task to disable the profiler for it, e.g.
This may be related to a change we've done recently, where the library sends some telemetry to the Datadog agent when it starts up. In the latest version of ddtrace we've disabled this in a number of non-production setups (e.g. #3039 ); if you do see it on 1.14.0, feel free to open a ticket and we can look into fixing our detection :) |
Test failuresCool! Y'all are on top of it 💯 Takeaways:I will:
Thank you! Glad to hear we don't need to mess with our agent config for now 😄 |
We hit the same error again ( |
Yes, github issue! It's always OK to open a support/zendesk ticket, but in this case they'd have to escalate to us anyway ;) |
So far so good on deploy stability. I'm wondering what you want next steps to be, if you want to close this ticket, or if you have more debugging steps you want to try. I know the no signals workaround is not the preferred long term solution, and I'd love to be a vanilla datadog customer without workarounds. But at the same time I think we're getting what we need and I could forget about this in a few weeks. Thanks for all your help! |
Glad to know the "no signals" workaround is working!
Very fair! To be clear, the "no signals" workaround is a feature of the profiler that will remain for as long as the profiler needs to use signals to work. So our intention to avoid its use is to improve the ecosystem and data gathering, and less because it's going away. (I do have some ideas on how to profile without signals [in a better way than what the no signals workaround does] that I hope to prototype at some point, but no specific plans for now!)
Thanks for the patience so far! I have an idea for what we could try next, but if you're happy with the current configuration and want to move on and do more interesting things than help me muck about in the profiler, it's 100% understandable, and I'll be happy to close the ticket. (Out of curiosity, my idea was to prepare a branch that changes the profiler to use a different signal than SIGPROF. This would allow us to check if there's something specifically interfering with SIGPROF on your setup, and also there's some signals that can generate core dumps when they aren't handled and that would also be useful for debugging). |
I'm down to keep going! I mostly thought you'd wanna close this out. I'm going to close out my active ticket on my side, my response time may be a bit lower if I'm busy with something else. All that sounds super cool and I should be able to keep testing on my side. Thanks! |
Awesome! Here's a new branch with the experiment I described above: gem 'ddtrace', git: 'https://github.com/datadog/dd-trace-rb.git', branch: 'ivoanjo/experiment-use-sigtrap' And here's what to expect: This branch is built atop v1.14.0 stable + two changes:
If everything goes well, and the issue we were seeing was specific to Additionally, the profiler will try to enable core dumps, so if it goes well you should see a log message such as:
With both together, you may see Also, don't forget to disable the no signals workaround, otherwise none of these changes do anything ;) |
@arich did you by any chance get the opportunity to try it out? |
I have not yet! I'm sorry, things have been busy on our side. I'll try to do it in the next two weeks. Thank you for your help! |
Hey @ivoanjo, we deployed your fork yesterday. Thanks for your patience with us! We're not seeing any different output in our logs so far, is that expected? I think I was expecting to have I'll wait and see and let you know if we hit deploy errors! Thanks again for your help. |
That's indeed unexpected. The crash handler functionality is independent of the rest of ddtrace and thus uses Ruby's If it's not showing up, that's definitely curious; let me know if you don't get any of them, and I can tweak the logging.
Fingers crossed that we'll finally get to the bottom of this! :) |
We are getting: |
Oh, good! That means we should be in a good state to collect a core dump if the app fails due to the |
Hi! We got a deploy failure. It's been pretty reliable (few failures), I think this is the second I've seen. I think this is the relevant part, here's a gist of the full logs if you need it.
Thanks for your help! |
Thanks for the reply! So I guess this settles it -- the signal causing the app to fail is definitely coming from the profiler, since now we're seeing the error that an unhandled SIGTRAP causes:
and the status code also matches sigtrap (133-128 => 5)
Would you be willing to share with me the core dump that was generated, so I could investigate further? I need to dig deeper to understand what's happening here where the profiler seems to be sending signals but not handling them properly. I'm definitely at a loss at what's up here, this issue is definitely unique :) If so, you'll probably want to share the core dump privately -- if you drop a note to support at https://docs.datadoghq.com/help/ and tell them you need to send a large file to debug an issue, they'll guide you through the steps to get it across :) |
I'm not sure how to get you the core dump! We're deploying with Aptible, a service a lot like heroku, running a command line script to deploy. Would the core dumps be on the local computer of the developer who deployed? Or somewhere in Aptible? I'm guessing the latter. I'm pulling the logs from their website. I can spin up a support ticket with them to try to get additional logs from the deployment failure. |
I spent some time looking through the aptible docs as well. There's "aptible ssh" but it can't access an existing instance, it always starts a new one, and if the crash is very rare, so it doesn't seem to be very feasible to get it there. I'm thinking, since they seem to have enabled the capability to get core dumps at the kernel level (hence the message that core dumps were already enabled), maybe they use such a facility themselves, and have some tooling to access it.Thus, it may be worth asking support if they have anything in place for that. |
Right, I asked:
They responded:
So, seems like no, but I could push them on if they have anything to capture it. |
Ah, that's a shame. We'll need to do it the harder way, sorry! :( So when a core dump gets generated, there's a setting to tell Linux where to place it and how to name it. This is the core pattern option. You can probably use the aptible ssh feature to start a container and look at the E.g. here's an example from my machine, where I cause a crash to happen to check the core dump as well:
After getting the core file, it's usually helpful to compress it (since it tends to be a few tens/hundreds of megabytes and usually compresses well), and then use some tool to upload it somewhere. If you have a way to upload it to somewhere (e.g. S3), that works great! If not, there's an internal support process for Datadog to help out, so let me know if you prefer to use that approach. |
Right, but will this be helpful if it's just from a random ssh container? We still have the issue that we can't get the core dump from the deployment container, because it's killed immediately after the deploy fails. Right? |
Yes, that's right. I mentioned using aptible ssh since the first step is figuring out where the core dump is getting (erm...) dumped, and I assumed they provide the same configuration on those app instances. After that comes the more annoying part, which is somehow tweaking the running scripts (s6-overlay?) to automatically go to that folder, zip up the core, and upload it when the crash gets detected. I realize that it's a big ask; so if you prefer to stay with the "no signals" workaround, since we've seen it works fine, that's always an option too -- we plan for that feature to stay. |
Gotcha. Yeah we're going to go back to the "no signals" workaround for now and may revisit this later on. This is our busy season so I don't see this happening for ~3 months at least as long as we have a good workaround. I'm not super proficient when it comes to capturing this specific error in S6-overlay and uploading it somewhere, so yeah that does sound more complex than I'm willing to invest in here. Thank you so much for your help and patience on this long running issue! Closing. |
Thanks for all the info! This is/was definitely a though one 🙇. |
Current behaviour
Hello! Since upgrading from ddtrace v1.10.1 to 1.11.x and 1.12.0, we've been having intermittent deployment failures due to our rails web worker exiting with status 155.
We use Aptible for hosting. Here's the specific error from a failed deploy:
Expected behaviour
Deploys should consistently succeed without intermittent issues. We've had to downgrade back to ddtrace 1.10.1 to stabilize our deploys.
Steps to reproduce
I'm not totally sure how to reproduce this reliably. We do not use the mysql2 or rugged gems. I can provide our full Gemfile.lock but I'll start here and see if we can get somewhere without it. I'm wondering if you've heard similar reports from anyone else?
Our Aptible support led us to believe this was related to the ddtrace upgrade:
Thank you for your help!
Environment
The text was updated successfully, but these errors were encountered: