-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPPI crashing on loading plug-ings #3767
Comments
This is another issue. Please compile with debug flags according to the docs here (https://navigation.ros.org/tutorials/docs/get_backtrace.html) and give me the traceback to know where the crash is occurring. I looked over the recent changes and I can't find anything obvious that would cause this. Make sure that you clean build (e.g. delete install/build spaces) in case there's some caching issue causing this from a major sync update |
I was able to get this backtrace when running mppi from the binary packages:
I hope this helps. When I build from source on humble branch there is no problem everything works like a charm. |
What is your computer architecture? That is also not a traceback, that doesn't show me where the issue is in particular. I still don't see anything obviously wrong to address. If you compile it from source from the |
My computer architecutre is:
I am sorry I don't have experience with gdp or backtrace, I followed the tutorial I compiled from source |
I met this problem as well. I found that this controller can't work in the edition that downloaded by 'sudo apt install ros-humble-nav2-mppi-controller'. And solve it by build the nav2_mppi_controller’s source code rather than download it by sudo. Perhaps it is because the sudo edition didn't update? |
This is really odd. What happens if you I looked into this in detail and I don't see any changes in the function that @AmmarAlbakri got a trace pointing to. Nothing changed with respect to that function or its inputs -- but things around it did change. The rest of Humble and the sync look identical to me, so if its working on local builds but not in binary format I can only seeing it be the cause of 3 things
The second item is the one I can have you guys easily test. The first I've asked Intrinsic if we can re-trigger to test. The last is more complex and I don't want to go down that rabbit hole if unnecessary. Please let me know! |
Once i purge nav2-mppi-controller.The whole navigation will be remove so i reinstall them by [controller_server-15] [INFO] [1693269845.941573472] [controller_server]: Created controller : FollowPath of type nav2_mppi_controller::MPPIController Then after the preferforwardcritic loaded .The error occur: As you see . Controller server crashed after loading the critic. |
Can you re-order / remove some of the critics? Is there one in particular that makes it crash? Someone else got a backtrace that pointed to the Obstacle critic -- but maybe its not that simple. Try with 0 critics and work up from there one by one. If purging and reinstalling doesn't work, I don't think its the debian update process that's the issue. The binary bin job looks fine to me in build.ros2.org so I'm not sure what could cause the issue there. @StetroF what's your computer architecture? |
I tried. First time , I delete the constraincritic and obstaclecritic, then i find that it stil crashed. But i found it there are some logs relative with PathAngleCritic's intialize. So I thought this critic must be normal.Then second time i delete all critic except PathAngleCritic. But this time after loading the critic, the logs were:
Then it died again.I've tried other options that launch with only 1 critic like GoalAngleCritic ,GoalCritic .etc. The logs are similar with the logs upon that after the 'critic loaded ...... ' and 'Optimizer reset' then it crashed.
Then crashed again.Last, my compute architecture is :
|
Fun.
That's the last line of the optimizer's initialization in the bringup https://github.com/ros-planning/navigation2/blob/210977790bfb6e809f2c2ef3b7e1660662d630cc/nav2_mppi_controller/src/optimizer.cpp#L54 The very next thing is the path handler initialization. The following lines were add - but all are in the 1.1.9 sync that was backported to humble
I'm morbidly curious if you add those parameters to your parameter file for MPPI if that crash goes away (e.g. something going on with missing parameters). Keep all the critics out of the configuration. I doubt it'll do anything, but if it does that would be extremely relevant to know. Also, getting a backtrace for the crash that occurs when you're not using any critics would be another valuable data point https://navigation.ros.org/tutorials/docs/get_backtrace.html. Especially if you're not seeing the exact same behavior / issue as @AmmarAlbakri What seems odd to me is that both of these areas of code were changed recently. But neither have been changed since 1.1.9 was released that you've mentioned you've cloned that tag which is the exact same software and seems to work. It makes me think something's up in buildfarm-land Edit: Another thing to test is https://github.com/ros-planning/navigation2/blob/main/nav2_mppi_controller/CMakeLists.txt#L38-L47 It seems that the illegal instruction complaint might be that those flags are illegal for your CPUs but legal for the build farm. If you change
to
And compile locally, do you see any change in behavior with the locally compiled version (e.g. compile fail, or the same binary related issue)? I want both @AmmarAlbakri and @StetroF to try this since you may not actually be experiencing the same issue |
@SteveMacenski I have the same problem, I use ROS Humble Nav2 stack inside Docker container. MPPI controller is proven to launch correctly in the container created the second-to-last Friday (18th August) and it doesn't launch in the container created yesterday (29th August). I tried some of advises you gave in the last reply:
None combination of above worked. I didn't try to compile the package locally. I could play with it when I find some free time, but let's hope someone else will check it before. |
Any traceback to what method specifically crashed? This would be helpful, especially if it changed between different runs / things.
Please do. I'm playing with this right now but I only have about 2 hours today to spend on it. Especially if the types of crashes are different for different people's computers, I'll need that data to know if I'm barking up the right tree from another source. I see this now that I'm testing it locally, which is good that I can reproduce and experiment myself. Though difficult since it only appears in binaries :( |
Notes:
w/ Obstacle critic its always
W/ goal angle, or Prefer forward critic only, Constraint critic only, or path follow only, or path angle only, or path align only, or goal angle only, or twirling only: it switches between one of these two on different runs (which seems reasonable b/c the NoiseGenerator is in another thread, so its basically "who triggers the failure first)
Both of which complain of an illegal instruction where the obstacle layer's seg faults outright. That seems to point to the ordering:
So lets look at these functions
What didn't change anything
--> So I'm pretty sure it has nothing to do with configurations being broken. My local xtensor version is I just tested and the iron binaries are broken too. Beyond the fact that these are orbiting pieces of code that have changed recently, I can't find any throughline or why this would be happening only in the binaries but not in source code -- nor why anything has illegal instructions. |
I have tested compiling with |
Just running the nav stack gave no specific error, it was just about
However, after adding
So it's the method I'm sorry, but I rather won't have time this week to compile the package locally, so if no-one will do this before me, I will try next week. |
I'm running a re-release on Iron and Humble to see if that resolves the issue due to a transient build farm issue. I'm not overly hopeful, but its worth a shot. |
I was facing this issue today and was able to fix it by compiling from source. |
We have recently encountered this issue as well when deploying on new systems. I can confirm that all of our robots that have the older
I even have the older A new container or updated device using version
In all cases the order or ctirics in cofing is always: |
I can't explain this and would need some help from OSRF to make an attempt via some custom jobs to see if we can find the error. @nuclearsandwich @clalancette @wjwwood I know you're all busy, but this is above my knowledge level and all the best debugging I've been able to accomplish point back to the build farm. See #3767 (comment)
Without help, the best I can do is just re-release and pray, or try to release under a different name to create a new binary of a new job to see if that works. But those are both not real solutions to make sure it is resolved (and testing in production). I'm at a loss for direction since I don't know that there's any dials I can play with on the build farm side myself I re-released the same version as Chris suggested to Iron. I just tested on ros2-testing that it is still indeed crashing after a new build with the new version, so it wasn't a one-off fluke. I've been through changes with a fine-toothed comb. Unless the issue is with some implicit casting or comparisons of different types (which I'm now fixing just in case), I'm not seeing whatever this issue is, if its in the code. The next best guess is it has to do with compiler flags (and the cpu or environment in the build farm changed?), though our flags haven't changed since 1.1.8 If @Imaniac230 @jankolkmeier @AmmarAlbakri or any others in the thread had a few minutes, I'd be curious if you could package up your install space and send it to me (x86 only). I'll try running it on my computer to see if we can replicate this without the build farm. |
I've compiled a list of installed packages using Now, this produced only a list of packages and did not archive any of the the binaries locally, so it will still download all from the remote repositories. Also any locally compiled packages were not included in this. I'm not sure if this is the desired output. If it's something else you had in mind I'll see if I can get that working (like |
I was suggesting instead to package up the install space of the manual build to see if I could get those libraries to run on another computer, in case some build flags were obviously causing problems on a different CPU of the same architecture. I’m not convinced it’ll show us anything, but its definitely worth a try if the issue is portability. If it does cause a problem, then that could lead to a workflow for testing solutions! |
Oh, sorry, I was going somewhere else with my thoughts. So I've done some additional experiments in the meantime. The manually compiled from source (tag I also tried using different generators The system I'm testing this on now is:
I just quickly verified that with the default cmake setup the support does get detected and the compile options would get added on my system if left as is:
output:
Then I've tried a dumb hack by replacing my manually compiled lib files from
All of the critic functions pass through ok and it fail on a controller function.
This time it crashes already on a critics function. So it seems that there is something going on with both libraries. Next I can make a simple archive from the manually compiled and working |
I spent a good chunk of last night also looking into MPPI and I found that the @Imaniac230 thanks for the debugging effort and that's a good idea to testing out the specific libraries (I didn't think of that)! The two errors you bring up look alot like the tracebacks I got in my (long) comment trying out different combinations of settings. It looks like the difference between the 2 categories I reported is which library crashes - those logs you sent align perfectly with them. Glad to have validation as well on mfma and mavx2 flags; that's what I also see! Can you send me that install space? |
Sure, I guess I shouldn't be posting it here? So expect an email from In the meantime I'll also try to play with the |
The mail server seems to be rejecting the attachment. Is there a channel where I should send it instead, or should i place it here? |
... I guess it is just libs, that does look suspect as heck. You can either put it in google drive / drop box as a tar file and email me that link or join the nav2 slack (https://join.slack.com/t/navigation2/shared_invite/zt-uj428p0x-jKx8U7OzK1IOWp5TnDS2rA) and hope that it doesn't have the same constraint ... Please don't virus me, that would be unchill |
Well, I zipped the whole contents of the |
This time it passed through with just the libs. Please let me know if it's all good. |
@Imaniac230 tried your file and it fails for me in one of the expected ways! Can you try again without the I'm pleasantly hopeful that this will resolve the problem, but then equally concerned as to what that'll mean in terms of the binary's performance vs building from source. And really confused as to why this is only an issue now |
I'm not sure I can create a difference. Both libs compiled locally on my machine work without errors with or without the However, when replacing just one of them and leaving the other one original (just as in my comment #3767 (comment)), it did crash at a different function:
Meanwhile, I did some experiments on a NUC machine:
and got some different behavior. Any possible combination I tried worked without errors on the Intel:
Now going back to the AMD machine:
I have haven't removed the |
Its not about compiler failures, its about portability. If you send me ones without that argument, I can see if that's what's causing portability problems. Under the hood that enables a flag which could potentially be triggering this behavior! |
So I'm starting to believe this is actually narrowing down to a compatibility problem with AMD CPUs. I've double checked that I haven't made any mistakes when porting the libs to the Intel machine, but it still does actually all work there. I've downloaded the latest I've verified that all of the machines that we had these problems with do have AMC CPUs in them. Then I built the libs locally on the Intel but without the This might also explain why it seemed to work on some systems and not on others. Lastly, I added the
and did the same cross-checking. It does produce exactly the same bevior as only removing or adding the
It is interesting that it fails at a completely different point than any of the previous reported ones. I'm not sure if the release builds are being compiled for different CPUs differently and this is actually normal behavior, or there is something going on here. For additional info, I've taken a look at the specific instructions being called in each function when it crashes(with
From the Intel documentation (https://www.intel.com/content/dam/develop/external/us/en/documents/319433-024-697869.pdf), The The last Now, my knowledge of the It would also explain why the remote binaries were not working for some people, but if they compiled locally from source it would know to use only those instructions that are actually available on their hardware. The one thing that is bugging me, however, is that all of the libs that I sent you @SteveMacenski were compiled on an AMD PC without |
I have an intel i7-8565U and 1.1.9 is broken for me. I also see this problem on my Intel CPU using the libs you provided me with
If you compiled on AMD, then my comment above invalidates that theory. See my long comment above on the exact failures I saw from the build-farm generated binaries. Supposedly also the build farm is using AMD for building the binaries https://build.ros2.org/job/Hbin_uJ64__nav2_mppi_controller__ubuntu_jammy_amd64__binary/11/consoleFull
Interesting. a hail mary I threw was 91b688d last week, which explicitly does the type conversions that I noticed where implicit in the diff between 1.1.8 and 1.1.9.
@Imaniac230 if you were to compile from Still yet, none of this explains why this started now, unless I was literally perfect in all my other casts everywhere else, which seems awfully unlikely. Plus, But its a good start. I can't explain that Intel - AMD relationship you mention though also portability issues across Intel? Mind running that test for me using the newest changes to the MPPI controller (these two commits b0a68bb and 91b688d)? Maybe we can resolve this through removing the illegal instruction-generating actions themselves. I'm not sure how I'll prevent this from happening in the future, but I suppose one step at a time.
Does it work for you on your NUC if you take the binaries from the build farm and from the AMD processor with |
Actually, just pull in Humble again after #3836 is merged. This includes those changes I made to MPPI. You can just use that instead to test if my changes fixed some or all of these issues |
I was hoping a bit that it could be that simple. But all the libs I sent you were built with
I couldn't find any reference to the specific HW being used. The
Yes, they all work. I was not able to get it NOT working with any test.
So here is what I have.
Pulled
calling I also downloaded the release
So now the latest release libs of And after compiling without |
Wait, are your tests showing that compiling even on the same CPU makes it crash now? Make sure I didn't mess up part of my update... Edit: I just tested building humble source and it works fine for me on the same CPU... that was never an issue ?
That error is notably different from the others! Can you check where that happens. The 2 dynamic casts I see are
If you disinclude the obstacle critic, are there any other traceback points that show up now? This seems like progress. If we update the layer cast to
Does that work if that's the last issue? Or I'm wondering if either of those are even where that's being called - those are both line that haven't changed. Also, I'm curious if you tried 1.1.8 across CPUs, the last where things worked? If you do that, do you not run into any issues either way? Just to close the loop that this test actually shows the issue we think it does (or if there's actually something else going on here, like the CPU that the build farm was using changed). If may be updated CPUs or changes that caused this to some degree - but it would be good to validate if our issues are fully internal or also external.
Huh? 1.1.9 is what broke all the things we're discussing, no? It looks like a new build of it come out though |
... I just did some testing and it appears that
Ahhhhhh. This may be it. I don't know the path forward On dates, the only job to make "working" binaries recently has been 1.1.9-1jammy.20230920.000044 whereas 1.1.9-1jammy.20230822.201753 and 1.1.9-1jammy.20230807.181110 were noted broken. I can't point you to what specific build of 1.1.8 was working (all?), but I didn't start receiving issue tickets from users until 1.1.9 was released, so I think its safe to put the range of dates at
The 1.1.10 since are also broken (but I wouldn't totally rule out I broke them, but I need to go in with gdb and see if the locations look similar). The nice thing about that data is that those are all the same software versions, so there's definitely something build-farm-y that changed between those jobs or the agents building them. It is unfortunate I had to run a release before this was resolved to add changes (e.g. 1.1.10) into the mix for what experiments re run now |
Note we found the difference. AMD processors in the build farm build good binaries. Intel do not. I'll need to work with Intrinsic & Co about what to do next about this particular weirdo package |
The attached debian is what I was able to test working with the great work of @claraberendsen. @Imaniac230 please verify this works for you across the board. I promise, its just an output from the build farm. I left my viruses with my sick-self at home 😉 ros-humble-nav2-mppi-controller_1.1.10-1jammy.20230927.201948_amd64.zip Here's my speculation after seeing the CPUs being used: https://gist.github.com/claraberendsen/998caedc6175a3b3b6da94c2d3453979 Intel has some instructions for optimizing that AMD doesn't support, which would explain the AMD failures but not the Intel failures (which I see). I think Intel CPUs that are being used in the cloud are more advanced with newer versions of instructions that older machines don't. The tests have been run on 7th, 8th, and 11th gen core i-series processors. The instructions that were illegal were usually from avx512 . While I'm no expert, Googlefo tells me that this is only available on Xeon and Skylake-X CPUs (https://en.wikipedia.org/wiki/AVX-512) which are not the standard robotics CPUs but the build farm uses Xeon. We want avx enabled for SIMD optimizations, and both "normal" intel + AMD machines provide that. We can disable that under the hood the use of specifically avx512 and use what is portable back to the more broad set of computers (https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2). But avx2 is available on Xeons, so if we could tell it to use that, then I think we're potentially solved. The "why now" is half to do with the deployment of Intel machines in the build farm's pool and a bit of russian roulette if you got one assigned to you that would generate the AVX512 instructions per-package build. That's why that random one works, but the predecessors did not https://stackoverflow.com/questions/60815316/disabling-all-avx512-extensions Going to run a new release with this and see how it goes! |
The only
I think it shows support for the theory nicely. If I search for I've also tried porting all of the lib versions tested so far to the older
and an
What is notable here is that neither of them show support for
The 11th gen
You trusted me with my mail attachments, so I think I can trust you as well 😀. I've tried your attached All of the new segfaults that I'm getting from local builds of
|
Huh, what about the binaries in http://repo.ros2.org/ubuntu/building/pool/main/r/ros-humble-nav2-mppi-controller/ros-humble-nav2-mppi-controller_1.1.11-1jammy.20230928.130430_amd64.deb? That's the same, but build on Intel explicitly with Can you be specific when you say that
You should never be crashing from source builds on the same CPU as it was compiled on. If that's what's happening, then that's totally unrelated to any of this and lets discuss that separately. If its from moving Can you tell what instruction that is - also a AVX thing? If you change that to a dynamic_cast and not a pointer cast like I suggested, does that change anything? Also perhaps removing the PS: if you turn off the obstacle critic, does the rest work? The binaries that you mention doesn't work on anything is the only binary that works for me on my Intel i7 (minus that other random build that works for us all):
The only variable that I've been made aware of in the build farm was the Intel vs AMD pools, so I tried both; AMD worked Intel did not due to the AVX512 issue. If you're saying that that binary worked for you as I, then I think we've shown that must have been an AMD machine. The first binary I linked above in the zip was on AMD. I'm shocked that it doesn't work for you -- are you sure you installed it and verified that was the version in use with I'm hoping that's your issue... Otherwise that's just issue one of potentially others. May be worth at that time on your local builds using the |
Regarding the deb install, it should be installed correctly. I've just ran
And with
The dates on the raw binaries also match ->
It may certainly be that I have messed up something on my side (and managed to do so on four different machines). But, I'm following the exact same testing steps as I have been from the start, using the same environments.
The crashing instruction is just a normal memory manipulation
If I remove the Obstacle critic, I don't the segfault, but I do get a different crash:
And it's the same thing when I install form the Regarding the local However, if I run with the local build workspace being sourced instead and with the obstacles critic enabled, I do not get the segfault crashes, but I do get the same error as above. I'll try building again from the
I'll do that when I get to the NUC. I can't currently explain why these new errors started happening now, they weren't a problem previously. But a small pattern I see is that they started only happening from versions related to |
Standard is more
Humble
I think that points to a problem in your workflow if the built workspace transferring libs and taking debians of the same thing don't result in the same outcomes.
OK! The different from .9 to .10 are these three commits b0a68bb 91b688d 1b13476 .10 to .11 adds 2d6e9a9 only Hopefully those should be easy to test anything meaningful that's changed. You may also benefit from enacting a clean restart of your previous changes to make sure you haven't landed up in an odd state -- unless it is really just Edit: I did some research on |
Yes, at least from my understanding, the
If I'm manually copying the locally compiled libs to I checked-out at the Then I completely cleaned the build and install space and re-built the whole nav2 from While at this working state, I checked the
When ported to an AMD PC, the libs worked without problems. If the Regarding the new errors I've been getting, it seems they don't relate to changes in Then, I went commit by commit. When all is built from 0cf0462 ( Going from the MPPI dependencies listed in
All of this is resulting from just copying individual lib files around. The error I was getting when running from an environment that has the nav2 workspace sourced were in fact because I have the config set so that all of the available controller plugins get loaded at startup (
When I remove TEB from the config and run from the environment that has the nav2 workspace sourced, I have no errors and all works. So here, all of the re-compiled packages from 0ca14fe are within the sourced environment path and it all works. If I make similar experiments with this environment - first building all of nav2 from A pattern I would deduce from this would be that the added pre-shutdown in 0ca14fe made a lot of the older pre- 0ca14fe binaries incompatible with the new ones? So after this change, compatibility with TEB is also broken? That would explain why I started getting the errors when I install the newer MPPI libs from the debians or port the newer libs compiled locally. I'm not sure if this behavior is even an issue since it is expected that when a dependency to a package changes and the package is ported individually without it's changed dependency it is going to cause problems. However, shouldn't minor revisions still be cross-portable independently? So this would mean that if I update my This was always my nav2 building environment:
So it should be just the native ros install with no additional workspaces. |
There's alot of good stuff here, so I want to parrot back to you the highlights to make sure we're on the same page
Agreed.
No, only within a distribution compatibility is assured. Rolling / main is bleeding edge so anything can change anytime. You shouldn't generally expect to be able to take a package from version X and apply it to the stack at version Y off mainline. In a released distro (e.g. humble, iron, etc) you should be able to do that. So in summary here: it sounds like its all fixed and if I bloom up a new release, we should be good to go on this issue in entirety. I think you mention some other TEB problem, but its unclear to me if that's an issue from the version X package and version Y framework or something else that should be addressed |
Releases cut! The issue should be resolved next humble / iron sync! still follow up on TEB if there’s anything actionable! |
Yes.
Yes.
Yes, If I start mixing old
Yes, it does seem like something of that nature to me. It might be due to the way some of the libraries are linked to each other during compilation and a change in one core part manifests across multiple other ones. But I'm just speculating here. It basically required most of the nav2 packages to be on the same updated versions and not just some to make it work.
Yes, I believe so. When all (or almost all) packages are at the same release version, and there are no mixtures of old
The error from TEB looks to be of the same nature as mixing of old and new lib versions. I re-compiled the whole nav2 workspace from |
Bug report
After the binary packages sync on 18.08.2023 the
![image](https://private-user-images.githubusercontent.com/49076784/262016895-dd23ff38-5843-4ead-8f35-b4d06e519e19.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2MTAzMjAsIm5iZiI6MTczOTYxMDAyMCwicGF0aCI6Ii80OTA3Njc4NC8yNjIwMTY4OTUtZGQyM2ZmMzgtNTg0My00ZWFkLThmMzUtYjRkMDZlNTE5ZTE5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE1VDA5MDAyMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWMzNjk5MGMyM2IxY2VmZmQyNTU5ZDY0Y2QwZDQyNzAwMmU4NGViMmNiZDQ0YTU3NTIxODMwZGYxMzRmYjU5ZjEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.gQUp7Hqmx2kgdXuCcyy1xrZDFDqBTwNCYRrobB8oV8g)
nav2_mppi_controller
started crashing on when launching;ps: same non-updated setup still working.
By the way, the sync was supposed to fix this issue but it hasn't been solved yet: #3762 (comment)
The text was updated successfully, but these errors were encountered: