Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CSV feed] CSV feed flood ingestion with identical data #8588

Open
Lhorus6 opened this issue Oct 3, 2024 · 5 comments
Open

[CSV feed] CSV feed flood ingestion with identical data #8588

Lhorus6 opened this issue Oct 3, 2024 · 5 comments
Labels
bug use for describing something not working as expected
Milestone

Comments

@Lhorus6
Copy link

Lhorus6 commented Oct 3, 2024

Description

CSV feed import seems buggy or not optimized.

In my case, I have an import from the Blocklist.de source, which contains around 30K IPs. E.g. at this moment, we have 27K entries in the source:

image

However, just for this small source, I currently find myself with 2.36M bundles in the queue and tons of works.

image

image

image

Environment

OCTI 6.3.4

Reproducible Steps

Steps to create the smallest reproducible scenario:

  1. Create this CSV Mapper:

image

  1. Create this CSV feed: https://lists.blocklist.de/lists/all.txt

image

  1. Let it run for several hours, or even 24 - 48 hours, to see how it behaves.

Additional information

It seems to me that it only imports the data if the hash changes. So this source updates its file every 30 minutes? (because I have a work every 30min)

This seems unlikely, perhaps we have a bug in the hash generation that takes meta data as input?
Just a guess

If the file does change continuously, maybe we shouldn't have to retrieve it every time, but just 2 times a day?

@Lhorus6 Lhorus6 added bug use for describing something not working as expected needs triage use to identify issue needing triage from Filigran Product team labels Oct 3, 2024
@nino-filigran
Copy link

I've started a feed to reproduce, will let you know about the output

@nino-filigran nino-filigran added needs more info Intel needed about the use case and removed needs triage use to identify issue needing triage from Filigran Product team labels Oct 4, 2024
@richard-julien
Copy link
Member

We compute the hash on the full file.
We cant really do much on term of data control.
To prevent too much works I currently try to not create any job is there is something already in the queue

@nino-filigran nino-filigran removed the needs more info Intel needed about the use case label Oct 7, 2024
@nino-filigran
Copy link

I reproduced your issue @Lhorus6. So based on your comment @richard-julien , can I consider it as a "wont fix"?

@Lhorus6
Copy link
Author

Lhorus6 commented Oct 7, 2024

Maybe it's not the hash calculation we have to play with, but there's something to be done in any case IMO. Here we're blowing up the ingestion queues

Julien said " To prevent too much works I currently try to not create any job is there is something already in the queue", so I guess he is testing possibilities for improvement

@richard-julien
Copy link
Member

Yes. Testing PR opened here #8617

@nino-filigran nino-filigran added this to the Bugs backlog milestone Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug use for describing something not working as expected
Projects
None yet
Development

No branches or pull requests

3 participants