Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Registrar not removing files when clean_removed is configured #10747

Merged
merged 10 commits into from
Feb 18, 2019

Conversation

urso
Copy link

@urso urso commented Feb 14, 2019

Closes: #7690, #9215, #10606

This PR includes some changes that also did help to clean/stabilize the text_clean_removed* tests. Often times the tests have been stopped to early, due to races between the input file states, registrar states, and test code.

After stabilizing the tests, I reliably (still needs a few runs) can trigger a data-synchronisation issue between the prospectors state and the registry. This is clearly a bug and results in state not being removed from the registry file at all.

After some more testing (yay, heisenbug :/) I found a race condition between harvester shutdown, prospector state cleaning (e.g. if clean_removed is set) and registry updates. The sequence of events goes like this:

1. harvester closes file
2. harvester updates prospectors local file states (marks file as finished)
3. prospector starts GC, cleaning old states
4. prospector sends 'remove state event' to the registrar (TTL=0, Finished=True). Note: state is not removed by prospector until next scan completes.
5. harvestar sends 'finished state event' to the registrar (TTL=-1, Finished=True)
6. registrar reads applies state updates:
  1. change state to TTL=0, Finished=True
  2. change state to TTL=-1, Finished=True
  3. clean local state -> file is not removed from registry, due to TTL=-1
  4. write registry

The change proposed changes the order the harvester cleanup propagates updates by sending the 'finished event' first before updating the prospectors state. This way we postpone the file being cleaned by the prospector and guarantee the 'finished event' and 'remove event' order (The prospector can not remove the state yet, because Finished is not set until after the 'finished event' has been published).

Additions:

  • Add Registry class (accessed via self.registry()). This class is used to check/read the registry
  • Add InputLogs class for creating/apppending/removing log files used to drive the tests. This reduces path manipulations and reduces repetitive with open(...) as f: blocks.
  • Add LogState class. This class keeps an offset to start reading from. The offset can be advanced via checkpoint or is automatically advanced if next succeeds. Using LogState we can wait for a particular message to appear in the log, by ignoring old messages. This removes the need to have a count on actual messages.
  • Thanks to LogState and some minor adjustments to timings I did shave of like 10s from test_registrar.py on my machine.

FB success count:

  • test_clean_removed: 5
  • test_clean_removed_with_clean_inactive: 5

@urso urso added the in progress Pull request is currently in progress. label Feb 14, 2019
@urso urso requested a review from a team as a code owner February 14, 2019 09:06
d := util.NewData()
d.SetState(h.state)
h.publishState(d)

logp.Debug("harvester", "Update state: %s, offset: %v", h.state.Source, h.state.Offset)
h.states.Update(h.state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this move? I would assume the harvester itself should be the first one to know about state updates?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add my findings to the PR description.

The Update here is no to the harvester, but to the prospector. Thanks to shutdown racing with scan phase we sometimes send finish and remove events out of order if publishState is not called before states.Update.

@urso urso added review Filebeat Filebeat and removed in progress Pull request is currently in progress. labels Feb 14, 2019
@urso urso changed the title [WIP] registrar flaky tests Fix Registrar not removing files when clean_removed is configured Feb 14, 2019
@urso urso force-pushed the flaky-fb-clean-removed branch from 2186da4 to ee8c565 Compare February 14, 2019 16:52
@urso
Copy link
Author

urso commented Feb 14, 2019

Windows is messing with me. The file can not be opened (access denied), which leads to the prospector never cleaning up it's state:



2019-02-14T17:17:51.799Z	DEBUG	[input]	log/input.go:477	Start harvester for new file: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1
...
2019-02-14T17:17:52.006Z	DEBUG	[harvester]	log/log.go:102	End of file reached: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1; Backoff now.
...

# file 'input1' is removed
...
2019-02-14T17:17:52.009Z	DEBUG	[input]	log/input.go:174	Start next scan
2019-02-14T17:17:52.010Z	DEBUG	[input]	log/input.go:404	Check file for harvesting: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1
2019-02-14T17:17:52.010Z	DEBUG	[input]	log/input.go:477	Start harvester for new file: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1
2019-02-14T17:17:52.011Z	ERROR	log/input.go:484	Harvester could not be started on new file: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1, Err: error setting up harvester: Harvester setup failed. Unexpected file opening error: Failed opening C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1: Error creating file 'C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1': Access is denied.
...
2019-02-14T17:17:52.012Z	DEBUG	[input]	log/input.go:226	State for file not removed because harvester not finished: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1
2019-02-14T17:17:52.012Z	DEBUG	[input]	log/input.go:216	Remove state for file as file removed or renamed: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1
2019-02-14T17:17:52.107Z	DEBUG	[harvester]	log/log.go:102	End of file reached: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1; Backoff now.
...

# this log repeats by now:
2019-02-14T17:17:52.113Z	DEBUG	[input]	log/input.go:174	Start next scan
2019-02-14T17:17:52.114Z	DEBUG	[input]	log/input.go:404	Check file for harvesting: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1
2019-02-14T17:17:52.114Z	DEBUG	[input]	log/input.go:477	Start harvester for new file: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1
2019-02-14T17:17:52.115Z	ERROR	log/input.go:484	Harvester could not be started on new file: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1, Err: error setting up harvester: Harvester setup failed. Unexpected file opening error: Failed opening C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1: Error creating file 'C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1': Access is denied.
...
2019-02-14T17:17:52.207Z	DEBUG	[harvester]	log/log.go:102	End of file reached: C:\Users\jenkins\workspace\elastic+beats+pull-request+multijob-windows\beat\filebeat\label\windows\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_registrar.Test.test_clean_removed\log\input1; Backoff now.

libbeat/common/file/file_windows.go Outdated Show resolved Hide resolved
)

var (
ErrDeletePending = errors.New("delete pending")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported var ErrDeletePending should have comment or be unexported

libbeat/common/file/file_other.go Outdated Show resolved Hide resolved
@urso urso added in progress Pull request is currently in progress. and removed review labels Feb 15, 2019
@urso
Copy link
Author

urso commented Feb 15, 2019

Investigating the windows case:
Filebeat used to call Stat in order to check if a file has been removed. This is implemented via GetFileAttributesEx. The call always succeeds because the file is still 'visible' in the directory hierarchy + plus the call only checks for the file sizes.
When a file is deleted by another process the 'DeletePending' flag is set. The file will be unlinked and deleted only after all processes have closed the file. This can unfortunately impact log rotation, as no new file with the same name can be generated, as long as the file is still open.

I updated detection of file removal on windows by calling GetFileInformationByHandleEx. This call allows us to read the DeletePending flag. With this change close_remove will ensure the file is closed if the DeletePending flag is set. This still requires the reader to actively check for file removal, which is not immediate and might be postponed for an undefinite amount of time, if the outputs are blocking.

This change only slightly improves the validation of clean_removed, if at all. Alternatively to this change we can warn/error if the user has set close_removed on windows, and skip any close_removed related system tests on windows.

Note: I wonder if we can make close_removed more 'stable' by creating a hardlink on windows. But this might not work if network shares or different volumes are used.

@@ -44,3 +44,4 @@ func (p Pipe) Name() string { return p.File.Name() }
func (p Pipe) Stat() (os.FileInfo, error) { return p.File.Stat() }
func (p Pipe) Continuable() bool { return false }
func (p Pipe) HasState() bool { return false }
func (p Pipe) Removed() bool { return false }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported method Pipe.Removed should have comment or be unexported

filebeat/input/log/file.go Outdated Show resolved Hide resolved
Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the investigation and all the details around it.

I like the abstraction in the tests, this could be used in many other places to simplify the existing tests I'm thinking.

  • Could you add a changelog entry?
  • I assume this will also need to be backported to 7.0 and 6.6 / 6.7?

Still trying to fully understand the impact. close_, clean_removed were not working in the cases where the file was still opened by an other service, if that was not the case, it worked as intended?

libbeat/common/file/file_windows.go Outdated Show resolved Hide resolved
Copy link
Contributor

@kvch kvch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@urso
Copy link
Author

urso commented Feb 15, 2019

Still trying to fully understand the impact. close_, clean_removed were not working in the cases where the file was still opened by an other service, if that was not the case, it worked as intended?

On linux a file gets 'unlinked'. This means a file is not visible in the file system anymore, but the inode and data is cleared after the last process has open the file handle. After unlinking a new file with the very same name can be written to.

On windows a file is not 'unlinked', but the 'Delete Pending' flag is set. The file continues to exist within the file system (can still be seen via Glob) as long as at least one process has an open file handle. The 'Delete Pending' flag is not accessible/accounted for when listing a directory or calling Stat. That is, filebeat did keep the file open and thusly prevented itself from noticing the file has actually been removed. Worse, as long as filebeat has the file open, no other process can access or replace the file anymore for as long as the 'Delete Pending' flag is set. Here filebeat has the chance to block other processes from making progress (if file names are to be reused). This we can't really fix.
Due to this close_removed setting was ineffective on Windows. The clean_removed setting is only effective if filebeat (or no other process) has an open file handle.

After stabilizing the tests, the test_clean_removed* tests were failing the very same way on windows, always.

Good news is: filebeat windows tests have been green after fixing close_removed. Will restart jenkins a few time and see if I can tune settings a little. I still suspect us having some flakyness due to timing and exponential backoff in the file reader.

@urso
Copy link
Author

urso commented Feb 15, 2019

Jenkins, test this.

@urso
Copy link
Author

urso commented Feb 15, 2019

Could you add a changelog entry?

Still in progress. It has been a rabid hole so far. Let's see if I find more :)

I assume this will also need to be backported to 7.0 and 6.6 / 6.7?

+1

@urso
Copy link
Author

urso commented Feb 15, 2019

Jenkins, test this.

@urso urso added review and removed in progress Pull request is currently in progress. labels Feb 18, 2019
@urso urso force-pushed the flaky-fb-clean-removed branch from 22ae88e to 3358a74 Compare February 18, 2019 07:36
@urso
Copy link
Author

urso commented Feb 18, 2019

Filebeat tests have been green 5 times in a row on all platforms.

@urso urso merged commit 1438323 into elastic:master Feb 18, 2019
@urso urso added the needs_backport PR is waiting to be backported to other branches. label Feb 18, 2019
@urso urso deleted the flaky-fb-clean-removed branch February 19, 2019 18:45
@urso urso removed the needs_backport PR is waiting to be backported to other branches. label Jul 3, 2020
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 3, 2020
@botelastic
Copy link

botelastic bot commented Jul 3, 2020

This pull request doesn't have a Team:<team> label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Filebeat Filebeat needs_team Indicates that the issue/PR needs a Team:* label review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Filebeat] Test test_registrar.test_clean_removed fails
4 participants