Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fingerprint file identity by default and migrate file state from native or path #41762

Merged
merged 35 commits into from
Dec 19, 2024

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Nov 22, 2024

Proposed commit message

This commit changes the default file_identity from native to
fingerprint, any previous state from native (or path) is
automatically migrated to fingerprint whe Filestream is starting.

The Filestream input has always had the ability to update file identifiers,
however it never worked as expected, leading to full data duplication
when changing the file identity. This commit fixes it to allow
changing the file identity from native (inode + device ID) and
path to fingerprint without any data duplication.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Because the fingerprint is the new default file identity, files are now only ingested when they reach at least 1024 bytes. The old default behaviour can be enabled by setting the file identity to native and disabling the fingerprint in the scanner.

filebeat.inputs:
  - type: filestream
    id: "8.x-default-behaviour"
    paths:
      - /tmp/flog.log
    file_identity.native: ~
    prospector:
      scanner:
        fingerprint.enabled: false

Author's Checklist

  • Test with dynamic config reload
  • Test with Kubernetes
  • Test with Elastic-Agent
  • Fix all the tests that break with the new behaviour
  • Investigate which integration tests are going to break in the Elastic-Agent repo

Regarding the Elastic-Agent integration tests, most tests actually use the log input because when they were written, Filestream was not available as an integration package. The very few other test that use Filestrem either generate a log file large enough or are skipped as flaky.

How to test this PR locally

  1. Create a log file with at least a few log lines and more than 1kb (e.g: /tmp/flog.log, 15 log lines), you can use flog with Docker:

    docker run -it --rm mingrammer/flog -n 15 > /tmp/flog.log
    
  2. Start Filebeat with the following configuration

    filebeat.yml (native)

    filebeat.inputs:
      - type: filestream
        id: "test-migrate-ID"
        paths:
          - /tmp/flog.log
        file_identity.native: ~
        prospector:
          scanner:
            check_interval: 0.1s
            fingerprint.enabled: false
    
    queue.mem:
      flush.timeout: 0s
    
    output.file:
      path: ${path.home}
      filename: "output-file"
      rotate_on_startup: false
    
    logging:
      level: debug
      selectors:
        - input
        - input.filestream
        - input.filestream.prospector
      metrics:
        enabled: false

  3. Wait until the file is fully ingested (wait for End of file reached: /tmp/flog.log; Backoff now. in the logs)

  4. Ensure all events have been published to the output (wc -l ./output-file* should return 15)

  5. Stop Filebeat

  6. Change the file identity to fingerprint. It's the new default, hence it's not explicitly set.

    filebeat.yml (fingerprint)

    filebeat.inputs:
      - type: filestream
        id: "test-migrate-ID"
        paths:
          - /tmp/flog.log
        prospector:
          scanner:
            check_interval: 0.1s
    
    queue.mem:
      flush.timeout: 0s
    
    output.file:
      path: ${path.home}
      filename: "output-file"
      rotate_on_startup: false
    
    logging:
      level: debug
      selectors:
        - input
        - input.filestream
        - input.filestream.prospector
      metrics:
        enabled: false

  7. Start Filebeat

  8. Wait until the Filebeat "finds the end of the file" (wait for End of file reached: /tmp/flog.log; Backoff now. in the logs)

  9. Ensure no extra event was published ((wc -l ./output-file* should still return 15)

  10. Add 10 more lines to the file:

    docker run -it --rm mingrammer/flog -n 10 >> /tmp/flog.log
    
  11. Wait until the new lines are ingested (wait for End of file reached: /tmp/flog.log; Backoff now. in the logs)

  12. Ensure all events have been published to the output with no duplication (wc -l ./output-file* should return 25)

Related issues

Use cases

Dealing with identity reuse (e.g: inode reuse) without facing re-ingestion of data with Filestream input

## Screenshots

Logs

The `sourceStore.UpdateIdentifiers` has always been part of the
fileProspector.Init, its purpose is to update the identifiers in the
registry if the file identity has changed, however it was generating
the wrong key and not updating the in memory
registry (store.ephemeralStore).

This commit fixes it and also removes `sourceStore.FixUpIdentifiers`
because it just a working version of
`sourceStore.UpdateIdentifiers`. Now there is a single method to
manipulate identifiers in the `sourceStore`.
This commit checks if 'source' matches the real file by calculating
the registry key using the old identifier, if they match, then update
the registry.
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 22, 2024
Copy link
Contributor

mergify bot commented Nov 22, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Nov 22, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 22, 2024
@belimawr belimawr changed the title 40197 filestream migrate file identity Fix file identity migration on Filestream input Nov 25, 2024
@belimawr belimawr added the bug label Nov 25, 2024
@belimawr belimawr changed the title Fix file identity migration on Filestream input Enable Filestream input to change file identity to fingerprint without re-ingesting files Nov 25, 2024
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 25, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 25, 2024
@rdner
Copy link
Member

rdner commented Dec 11, 2024

Let's make sure it's also tested with dynamic config reload and with the Elastic Agent control protocol.

When I worked on take_over (log->filestream input migration) I discovered that we have separate code paths for applying dynamic configuration and it requires special handling for state changes.

I'm not saying it's not handled here, just we need to include this into testing procedures.

@belimawr
Copy link
Contributor Author

Let's make sure it's also tested with dynamic config reload and with the Elastic Agent control protocol.

When I worked on take_over (log->filestream input migration) I discovered that we have separate code paths for applying dynamic configuration and it requires special handling for state changes.

I'm not saying it's not handled here, just we need to include this into testing procedures.

Thanks Denis! Do you mean at least a manual test or an integration test?

The prospector initialisation happens much after any code path for starting/configuring an input, it should be totally agnostic from how the input was configured started. So I believe those cases are also covered. However, I do agree it is good to at least perform some manual test, just to be on the safe side.

@belimawr
Copy link
Contributor Author

The Windows test failure is unrelated to this PR, I created a flaky test issue: #42059

@belimawr
Copy link
Contributor Author

I merged main onto this branch/PR, let's see if CI gets green with a re-run

@cmacknz
Copy link
Member

cmacknz commented Dec 18, 2024

files are now only ingested when they reach at least 1024 bytes.

This will be breaking for somebody, let's only keep the change of default identity in 9.0.

It would likely be helpful to backport the code here but keep the default identity in 8.x unchanged to make future backports easier.

@jlind23
Copy link
Collaborator

jlind23 commented Dec 19, 2024

I merged main onto this branch/PR, let's see if CI gets green with a re-run

Looks like only the linter is unhappy.

@belimawr
Copy link
Contributor Author

I merged main onto this branch/PR, let's see if CI gets green with a re-run

Looks like only the linter is unhappy.

I touched too many files, all those warnings are from bits of code I didn't touch. Because this PR is rather large, I wasn't planning on fixing all those lint warnings to reduce the changes that need reviewing.

However, if you insist, I can fix them.

@jlind23
Copy link
Collaborator

jlind23 commented Dec 19, 2024

However, if you insist, I can fix them.

This is not what I said 👍🏼 I am merging this as you didn't touch the code yelling at you.

@belimawr
Copy link
Contributor Author

files are now only ingested when they reach at least 1024 bytes.

This will be breaking for somebody, let's only keep the change of default identity in 9.0.

It would likely be helpful to backport the code here but keep the default identity in 8.x unchanged to make future backports easier.

I've been thinking about the best way to do this:

  • Get this PR merged
  • Let mergify create the backport
  • Update the backport PRs by reverting the commits that change the defaults and docs
  • Ask someone to review/test just to be on the safe side as we don't want any breaking change to slip into 8.x branch
  • Merge into 8.x once approved and CI green

filebeat/input/filestream/prospector_test.go Outdated Show resolved Hide resolved
@belimawr belimawr changed the title Use fingerprint file identity by default and migrate file state from native or path` Use fingerprint file identity by default and migrate file state from native or path Dec 19, 2024
@belimawr belimawr enabled auto-merge (squash) December 19, 2024 17:11
@belimawr
Copy link
Contributor Author

I've enabled auto merge, it should get merged after CI runs :D

Comment on lines +185 to +186
// do not match, log it at debug level and do nothing.
if previousIdentifierKey != registryKey {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the comment it is missing a debug log here.

@belimawr belimawr merged commit 78fe7a5 into elastic:main Dec 19, 2024
140 of 142 checks passed
mergify bot pushed a commit that referenced this pull request Dec 19, 2024
…m `native` or `path` (#41762)

This commit changes the default `file_identity` from `native` to
`fingerprint`, any previous state from `native` (or `path`) is
automatically migrated to `fingerprint` whe Filestream is starting.

The Filestream input has always had the [ability to update file identifiers](https://github.com/elastic/beats/blob/4278366ab03221e8b62183dc06f9505f6ccc5209/filebeat/input/filestream/prospector.go#L104-L122),
however it never worked as expected, leading to full data duplication
when changing the file identity. This commit fixes it to allow
changing the file identity from `native` (inode + device ID) and
`path` to `fingerprint` without any data duplication.

(cherry picked from commit 78fe7a5)

# Conflicts:
#	filebeat/tests/integration/filestream_test.go
belimawr added a commit that referenced this pull request Jan 6, 2025
…`fingerprint` for Filestream inputs (#42126)

The Filestream input has always had the [ability to update file identifiers](https://github.com/elastic/beats/blob/4278366ab03221e8b62183dc06f9505f6ccc5209/filebeat/input/filestream/prospector.go#L104-L122),
however it never worked as expected, leading to full data duplication
when changing the file identity. This commit fixes it to allow
changing the file identity from `native` (inode + device ID) and
`path` to `fingerprint` without any data duplication.

---------

Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
Co-authored-by: Julien Lind <julien.lind@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use fingerprint file identity by default and migrate all existing filestream inputs to it
9 participants