-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Registrar not removing files when clean_removed is configured #10747
Conversation
d := util.NewData() | ||
d.SetState(h.state) | ||
h.publishState(d) | ||
|
||
logp.Debug("harvester", "Update state: %s, offset: %v", h.state.Source, h.state.Offset) | ||
h.states.Update(h.state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this move? I would assume the harvester itself should be the first one to know about state updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add my findings to the PR description.
The Update here is no to the harvester, but to the prospector. Thanks to shutdown racing with scan phase we sometimes send finish and remove events out of order if publishState
is not called before states.Update
.
2186da4
to
ee8c565
Compare
Windows is messing with me. The file can not be opened (access denied), which leads to the prospector never cleaning up it's state:
|
libbeat/common/file/file_windows.go
Outdated
) | ||
|
||
var ( | ||
ErrDeletePending = errors.New("delete pending") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported var ErrDeletePending should have comment or be unexported
Investigating the windows case: I updated detection of file removal on windows by calling GetFileInformationByHandleEx. This call allows us to read the This change only slightly improves the validation of Note: I wonder if we can make close_removed more 'stable' by creating a hardlink on windows. But this might not work if network shares or different volumes are used. |
@@ -44,3 +44,4 @@ func (p Pipe) Name() string { return p.File.Name() } | |||
func (p Pipe) Stat() (os.FileInfo, error) { return p.File.Stat() } | |||
func (p Pipe) Continuable() bool { return false } | |||
func (p Pipe) HasState() bool { return false } | |||
func (p Pipe) Removed() bool { return false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported method Pipe.Removed should have comment or be unexported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the investigation and all the details around it.
I like the abstraction in the tests, this could be used in many other places to simplify the existing tests I'm thinking.
- Could you add a changelog entry?
- I assume this will also need to be backported to 7.0 and 6.6 / 6.7?
Still trying to fully understand the impact. close_, clean_removed
were not working in the cases where the file was still opened by an other service, if that was not the case, it worked as intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
On linux a file gets 'unlinked'. This means a file is not visible in the file system anymore, but the inode and data is cleared after the last process has open the file handle. After unlinking a new file with the very same name can be written to. On windows a file is not 'unlinked', but the 'Delete Pending' flag is set. The file continues to exist within the file system (can still be seen via Glob) as long as at least one process has an open file handle. The 'Delete Pending' flag is not accessible/accounted for when listing a directory or calling After stabilizing the tests, the Good news is: filebeat windows tests have been green after fixing |
Jenkins, test this. |
Still in progress. It has been a rabid hole so far. Let's see if I find more :)
+1 |
Jenkins, test this. |
22ae88e
to
3358a74
Compare
Filebeat tests have been green 5 times in a row on all platforms. |
This pull request doesn't have a |
Closes: #7690, #9215, #10606
This PR includes some changes that also did help to clean/stabilize the
text_clean_removed*
tests. Often times the tests have been stopped to early, due to races between the input file states, registrar states, and test code.After stabilizing the tests, I reliably (still needs a few runs) can trigger a data-synchronisation issue between the prospectors state and the registry. This is clearly a bug and results in state not being removed from the registry file at all.
After some more testing (yay, heisenbug :/) I found a race condition between harvester shutdown, prospector state cleaning (e.g. if clean_removed is set) and registry updates. The sequence of events goes like this:
The change proposed changes the order the harvester cleanup propagates updates by sending the 'finished event' first before updating the prospectors state. This way we postpone the file being cleaned by the prospector and guarantee the 'finished event' and 'remove event' order (The prospector can not remove the state yet, because Finished is not set until after the 'finished event' has been published).
Additions:
self.registry()
). This class is used to check/read the registrywith open(...) as f:
blocks.checkpoint
or is automatically advanced ifnext
succeeds. UsingLogState
we can wait for a particular message to appear in the log, by ignoring old messages. This removes the need to have a count on actual messages.LogState
and some minor adjustments to timings I did shave of like 10s from test_registrar.py on my machine.FB success count: