-
Notifications
You must be signed in to change notification settings - Fork 15.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seg fault using Ruby google-protobuf v3.15.8 #8559
Comments
I should note the gRPC failure is happening in |
Thanks for the thorough report. Yes debugging symbols or a repro would help a lot here. |
They are using grpc-1.30.2 https://github.com/grpc/grpc/releases/tag/v1.30.2 . Why not using grpc latest (1.35.x)? |
@stanhu Your bug report is not consistent with the log you provided:
Version: v3.15.8 Also you can add grpc version: 1.30.2. |
That was just a leftover from the template. The information was in the title. |
Is there any chance you could reproduce this using a build of protobuf with symbols (and ideally debug symbols)? If you checkout this repo and do I'm also interested in seeing if we can get symbols into our binary gems, but for now this would be the easiest way of getting a better stack trace. |
@haberman Our team is trying to get this, but it's been tricky to reproduce the issue in CI. We tried doing this: gem uninstall google-protobuf
gem install google-protobuf -v 3.15.8 --platform=ruby -- --with-cflags=-ggdb3 I think that's enough to get debug symbols, but we didn't see anything in the Ruby backtrace:
We did get a core dump. We may have to save the
|
Details from the core dump:
Is |
It seems like that should be more that enough. Any build that does not actively call If you're not getting a good stack trace with that command, it seems that either you're still getting the stripped prebuilt or the local build process is stripping the
The |
Yes, it is. I found that the |
Ok, we have a real stack trace now!
|
More details from the core dump:
If i try to look at the state:
|
More context: the protobuf encoding was attempted by this gRPC call:
I believe the protobuf definition is: message FindCommitsRequest {
Repository repository = 1 [(target_repository)=true];
bytes revision = 2;
int32 limit = 3;
int32 offset = 4;
repeated bytes paths = 5;
bool follow = 6;
bool skip_merges = 7;
bool disable_walk = 8;
google.protobuf.Timestamp after = 9;
google.protobuf.Timestamp before = 10;
// all and revision are mutually exclusive
bool all = 11;
bool first_parent = 12;
bytes author = 13;
enum Order {
NONE = 0;
TOPO = 1;
}
Order order = 14;
GlobalOptions global_options = 15;
bool trailers = 16;
} With the gRPC definition: proto
rpc FindCommits(FindCommitsRequest) returns (stream FindCommitsResponse) {
option (op_type) = {
op: ACCESSOR
};
} Likely this is an encoding issue with |
Thanks for this excellent information! Looking at the code in question, it is relatively straightforward and there is not a lot of room for a bug to live. It seems likely that the bug is not in this code per se, but that the proto was somehow corrupted before reaching this point. Is there any chance the bug would reproduce under Valgrind? That could give some extra information to validate the status of the memory that the encoder is trying to read. |
Unfortunately, I don't have a concrete reproduction step yet to use Valgrind. Right now I can only retry our build process, which takes an hour to finish. Will compiling this Ruby extension with AddressSanitizer via |
I think I'll brainstorm some more the best way to debug this, I don't want to send you on too many wild goose chases. Does the issue repro fairly quickly once the process has started, or is it more intermittent? |
It's intermittent but annoying enough that happens a few times a day. It definitely happened with the upgrade to v3.15.8. We reverted to v3.14.0 and never hit the error. We had to upgrade to v3.15.8 a few weeks after the revert, and now we see the issue again. We just retry the build whenever it happens. I'm not sure if it ever happens in production, though. I'll have to look for that. |
Ah I see, if it's that intermittent then debug printf might not be the best way to tackle this one. Yes it totally makes sense that 3.15 would have been the point where this was introduced. There was a major change in 3.15, a rewrite of the data layer basically (#8184), which improved performance and simplified the code a lot. But like all new code, it's had some bugs to work through. Thanks for your patience while we figure them out. Can you give a summary of how this message (the one that is crashing in |
The message is built from scratch: https://gitlab.com/gitlab-org/gitlab/-/blob/ffed5fe4139bb4a49cf1b262dfabff16b7dd38fc/lib/gitlab/gitaly_client/commit_service.rb#L331-351 Line 349 is where the repeated call happens. In the test that hit this effort above, I think the value of The |
Thank you for the code reference; that was enough of a lead that I was able to find the bug! See the attached PR, which I am confident will fix the crashes you are seeing. I was able to reproduce the issue under Valgrind and verify that my fix removes the Valgrind error. This fix should be released in 3.17.1, which I expect will be released on Monday or Tuesday. |
@haberman Awesome, thank you very much! |
@stanhu You're welcome! Small update: this will be in 3.17.2, not 3.17.1. The 3.17.1 release was already in progress when I submitted this. But 3.17.2. should still be released in the next day or two. |
@haberman I think 3.17.1 has this fix. 😄
|
Excellent, even better. :) |
Now I wonder how long it takes for this release to land in Rubygems. https://rubygems.org/gems/google-protobuf |
I think it is up now: https://rubygems.org/gems/google-protobuf/versions/3.17.1-x86_64-linux |
Just to follow up here, I have not see a seg fault since we upgraded to v3.17.1, so thanks for all the work here. I should have remembered that funny things can happen during garbage collection. |
google-protobuf v3.15.x and v3.16.x can seg fault if repeated fields in a Hash are garbage collected: protocolbuffers/protobuf#8559
What version of protobuf and what language are you using?
Version: v3.15.8
Language: Ruby
What operating system (Linux, Windows, ...) and version?
Linux
What runtime / compiler are you using (e.g., python version or gcc version)
What did you do?
Still working on a reproduction step, but we upgraded from google-protobuf v3.14.0 to v3.15.8 and started seeing intermiittent seg faults in CI. It looks like some issue with encoding a protobuf in gRPC.
We may need to turn on debug symbols in the
protobuf.so
because we aren't able to see the function and line numbers in the backtrace.What did you expect to see
No seg fault
What did you see instead?
Seg fault
Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).
This is the relevant information from the
job.log
(job.log):The text was updated successfully, but these errors were encountered: