Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed processing for ProcessManager.pidToProcessInfo #321

Merged
merged 6 commits into from
Jan 24, 2025

Conversation

christos68k
Copy link
Member

@christos68k christos68k commented Jan 22, 2025

Summary

  • Renamed SymbolizationComplete to ProcessedUntil and moved to processinfo.go
  • Dropped no longer needed "symbolize now" remnants
  • Implemented delayed ProcessManager.pidToProcessInfo cleanup

Leverages #307 to ensure that process metadata is not discarded before all relevant trace events have been processed.

Fixes #278.

You may find reviewing commit-by-commit to be simpler.

@christos68k christos68k self-assigned this Jan 22, 2025
@christos68k christos68k requested review from a team as code owners January 22, 2025 23:45
if !ok {
log.Debugf("Skip process exit handling for unknown PID %d", pid)
return symbolize
return
}

// Delete all entries we have for this particular PID from pid_page_to_mapping_info.
Copy link
Member Author

@christos68k christos68k Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this cleanup here as there's no immediate need to postpone cleaning up the eBPF map until traceCaptureKTime >= pidExitKtime (unlike pidToProcessInfo). This also speeds up execution of ProcessedUntil compared to having the map cleanup take place there.

// fast enough and this particular pid is reused again by the system.
// NOTE: Exported only for tracer.
func (pm *ProcessManager) ProcessPIDExit(pid libpf.PID) bool {
func (pm *ProcessManager) ProcessPIDExit(pid libpf.PID) {
exitKTime := times.GetKTime()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this outside the lock for improved accuracy (there's a debug log in ProcessedUntil that prints exit latency).

Uses ProcessedUntil mechanism to guarantee that process metadata
is not discarded before all relevant trace events have been
processed.
return symbolize
return
}
if pidExited {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to attempt a repeat cleanup for the same PID, if we've previously performed it.

@@ -348,34 +347,6 @@ func (pm *ProcessManager) MaybeNotifyAPMAgent(
return serviceName
}

func (pm *ProcessManager) SymbolizationComplete(traceCaptureKTime times.KTime) {
Copy link
Member Author

@christos68k christos68k Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to processinfo.go for consistency (all pidToProcessInfo accessors in one place), renamed to ProcessedUntil and updated to also cleanup pidToProcessInfo.

@@ -548,9 +553,6 @@ func (pm *ProcessManager) ProcessPIDExit(pid libpf.PID) bool {
address, pid, err)
}
}
delete(pm.pidToProcessInfo, pid)
Copy link
Member Author

@christos68k christos68k Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now taking place in ProcessedUntil, delayed until traceCaptureKTime >= exitKTime.

@@ -294,33 +293,6 @@ func (pm *ProcessManager) ConvertTrace(trace *host.Trace) (newTrace *libpf.Trace
return newTrace
}

// findMappingForTrace locates the mapping for a given host trace.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved without changes to processinfo.go for consistency.

if len(pm.interpreters[pid]) > 0 {
pidExited := false
info, pidExists := pm.pidToProcessInfo[pid]
if pidExists || (pm.interpreterTracerEnabled &&
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially same logic as before with these additions:

  1. Don't add exitKTime to pm.exitEvents if it already exists.
  2. Also add exitKTime to pm.exitEvents if pm.pidToProcessInfo[pid] exists, as we want to cleanup the latter in delayed fashion.

continue
}

delete(pm.pidToProcessInfo, pid)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same logic as before with this single-line addition.

Copy link
Contributor

@florianl florianl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first look with some comments


info, ok := pm.pidToProcessInfo[pid]
if !ok {
if !pidExists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the global read & write lock as short as possible, the if !pidExists {..} part should be moved before if pidExists || (pm.interpreterTracerEnabled && len(pm.interpreters[pid]) > 0) {..}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would prevent executing if _, pidExited = ... in case (pm.interpreterTracerEnabled && len(pm.interpreters[pid]) > 0 is true.

Copy link
Member Author

@christos68k christos68k Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Tim wrote, this would alter the logic. I tried to keep as much of the original semantics the same to avoid introducing new races. Maybe here it's possible to safely say that if !pidExists then it's OK not to write exitKTime in pm.exitEvents but we'd need to carefully examine all subsystem interactions, check for race conditions etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up is done in #325

return symbolize
return
}
if pidExited {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, pidExited should be renamed to pidExitProcessed so something similar, this would it make obvious, that we want to avoid duplicate work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

Comment on lines +704 to +710
pm.mu.Lock()
defer pm.mu.Unlock()

nowKTime := times.GetKTime()
log.Debugf("ProcessedUntil captureKT: %v latency: %v ms",
traceCaptureKTime, (nowKTime-traceCaptureKTime)/1e6)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the lock holding as short as possible:

Suggested change
pm.mu.Lock()
defer pm.mu.Unlock()
nowKTime := times.GetKTime()
log.Debugf("ProcessedUntil captureKT: %v latency: %v ms",
traceCaptureKTime, (nowKTime-traceCaptureKTime)/1e6)
nowKTime := times.GetKTime()
log.Debugf("ProcessedUntil captureKT: %v latency: %v ms",
traceCaptureKTime, (nowKTime-traceCaptureKTime)/1e6)
pm.mu.Lock()
defer pm.mu.Unlock()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can affect the latency measurement, since we're timing before the lock.

@christos68k christos68k merged commit f094776 into main Jan 24, 2025
24 checks passed
@christos68k christos68k deleted the ck/processed-until branch January 24, 2025 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sending executable path for processes that have exited
3 participants