Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSDK-923] Make GPS-RTK code more robust, remove bad test #1679

Merged
merged 11 commits into from
Jan 5, 2023

Conversation

penguinland
Copy link
Member

Changes include:

  • Migrate go statements to utils.PanicCapturingGo()
  • Don't start the background goroutine unless the rest of startup succeeds (to prevent leaking goroutines)
  • Lock the mutex whenever you read or write any data that is also read/written from a different goroutine (this was the actual cause of the race conditions!)
  • Remove the test that sometimes fails due to a bug in third-party code. It sucks that it sometimes fails, but all the problems are down in the bowels of https://github.com/de-bkg/gognss, and not in code we own.

Things I kinda wanted to change but didn't:

  • RTKMovementSensor.GetStream() is nearly identical to ntripCorrectionSource.GetStream(), and the only times you call the RTK version, you always pass in the same state stored in an ntrip object. Unfortunately, it's a different ntrip object defined in the same file as ntripCorrectionSource. Should the function be moved to that class instead, or removed entirely? I dunno. Something in here is redundant, but I couldn't remove the duplication without thinking a lot more.

I ran the tests 1,000 times: no failures. It's possible I'm being overly cautious with the mutex, and if you have specific ones you want me to remove, I'm happy to do so. It's possible I'm writing non-idiomatic Go code, and I'm happy to rewrite this whole thing if you have a different style I should match.

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Dec 16, 2022
components/movementsensor/gpsrtk/gpsrtk.go Show resolved Hide resolved
components/movementsensor/gpsrtk/gpsrtk.go Show resolved Hide resolved
return g.nmeamovementsensor.Position(ctx, extra)
}

// LinearVelocity passthrough.
func (g *RTKMovementSensor) LinearVelocity(ctx context.Context, extra map[string]interface{}) (r3.Vector, error) {
g.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe if we have mu defined as a RWMutex instead, we can just use a read lock here (and in other places where we're just checking the lastError property). We might otherwise have some latency issues

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gautham and I discussed: if the lock was held for a long time by multiple reading coroutines, we both agree that changing it from a Mutex to a RWMutex would be a great idea, and we also both agree that given how quickly the lock is released and how few coroutines will be using it, this isn't obviously an improvement. Neither of us feel strongly either way, so I'm going to leave it as-is until it becomes important.


g.lastError = err
}

// Start begins NTRIP receiver with specified protocol and begins reading/updating MovementSensor measurements.
func (g *RTKMovementSensor) Start(ctx context.Context) error {
if err := g.nmeamovementsensor.Start(ctx); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] could you add a TODO that requires testing out/cleaning up whether it is okay to have the gps Start before sending correction data via one of the comms protocols in the next two lines.

I have a feeling doing them in parallel is the correct ultimate driver choice, but something might be tied up in both functions right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that hadn't occurred to me! I moved this one up so that we don't leak goroutines if this line fails, but it hadn't occurred to me that this line might give bad data if we don't have the corrections up and going first (which might still be a problem in the original ordering, too). TODO added!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: for any TODOs added, please make it TODO(JIRA-XXX): where JIRA-XXX is some newly filed JIRA issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent feedback, thank you!

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 4, 2023
g.errMu.Lock()
defer g.errMu.Unlock()
g.mu.Lock()
defer g.mu.Unlock()

g.lastError = err
}

// Start begins NTRIP receiver with specified protocol and begins reading/updating MovementSensor measurements.
func (g *RTKMovementSensor) Start(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're fixing issues here and this may be dumb... do we actually use RTKMovementSensor? I only see newRTKMovementSensor called in tests. Furthermore, the constructor calls Start which has a context that is passed into newRTKMovementSensor which is incorrect compared to passing in the cancelCtx. Furthermore, the cancel context is created with the context passed into new which is also wrong. It should be a standalone background context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your issue the public struct? we can make it private.

I remember that context was something that got the driver working way back when. However, this package needs a bit of cleaning but is not easily testable in real contexts as it needs proximity to a ntrip mountpoint and a wide open sky. A spring cleaning task for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue is what is this code for. There's no way to use it as a user

Copy link
Member Author

@penguinland penguinland Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look! I'm not aware of anyone currently using this code (I think it was written by an intern over the summer, and not used since?), but it's possible to create a gps-rtk sensor via app.viam.com, so somebody might be using it.

I'm right there with you that Start shouldn't take any arguments and should use g.cancelCtx instead, and have made that change. I'm not so sure that it should be a standalone context, though. If whatever creates this sensor shuts down, shouldn't that also shut this down, too? and I thought that was the point of passing these contexts around: if the parent component-wrangler goes away, its child components should, too. but I'm new to Go and contexts (and Viam!); maybe I've misunderstood?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of, this sensor has two physical things communicating. There's a GPS chip that wants to get satellite data and use correction data from a source to get sub 2cm positioning corrections.

There is a source of these corrections that is also set up in this driver or a separate model, that starts a server or i2c/serial correction from a secondary chip that you can hoist on an antenna by yourself.

It is ugly but it works. Can it be cleaned up? Yes. Do we need to care about cleaning it up right now? No. Was the flaky test due to a mountpoint time out on the ntrip external library and not something we wrote, yes. I'd stick to just fixing that and the locks for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm just confused why we're testing code that isn't in use anywhere. Regarding contexts, for a constructor specifically, a context is used at request time and for any startup related routines that run before construction returns

Copy link
Contributor

@edaniels edaniels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just driving by here, getting confused by code you didn't write!

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 4, 2023
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 4, 2023
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 4, 2023
@penguinland
Copy link
Member Author

The orbslam integration tests are failing, but I'm pretty sure that's unrelated to this change. I'm going to merge this anyway, and will beg forgiveness if that was the wrong choice.

@penguinland penguinland merged commit 66cdb28 into viamrobotics:main Jan 5, 2023
@penguinland penguinland deleted the bugfix_race branch January 5, 2023 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants