Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated Report IDs #535

Open
fabm3n opened this issue Jul 16, 2024 · 8 comments
Open

Duplicated Report IDs #535

fabm3n opened this issue Jul 16, 2024 · 8 comments

Comments

@fabm3n
Copy link

fabm3n commented Jul 16, 2024

I just started using parsedmarc and got the same DMARC Report from Google twice:
image

As found out by this Reddit post, this is a TTL issue. https://www.reddit.com/r/DMARC/comments/1bafpk5/getting_multiple_identical_reports_from_google/

For me, the behaviour from parsedmarc is also wrong because both reports with the same report id have been added to the Elasticsearch database:
image

I expected that the report id is unique and there should be only one document in the database.
The best way could be to override the document with the last processed dmarc report.

@fabm3n
Copy link
Author

fabm3n commented Jul 16, 2024

I checked the code and noticed, this should not happen because there is a duplication check:
https://github.com/domainaware/parsedmarc/blob/master/parsedmarc/elastic.py#L410

When i now try and move one of the processed mails from the archive to the inbox, the check works:
WARNING:cli.py:100:An aggregate report ID 16651655217010351577 from google.com about DOMAIN with a date range of 2024-07-13 00:00:00Z UTC to 2024-07-13 23:59:59Z UTC already exists in Elasticsearch

I enabled debug and try to reproduce the issue.

@fabm3n
Copy link
Author

fabm3n commented Jul 16, 2024

Ok, already reproduced it. I cleared my index and moved both mails in the inbox.
After that, they are processed as a batch, and it looks like there is no duplication check:

Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:    DEBUG:__init__.py:1433:Found 2 messages in INBOX
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:    DEBUG:__init__.py:1441:Processing 2 messages
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:    DEBUG:__init__.py:1445:Processing message 1 of 2: UID 6
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:     INFO:__init__.py:1085:Parsing mail from noreply-dmarc-support@google.com on 2024-07-13 16:59:59-07:00
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:    DEBUG:utils.py:388:IP address IPwas found in cache
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:    DEBUG:__init__.py:1445:Processing message 2 of 2: UID 7
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:     INFO:__init__.py:1085:Parsing mail from noreply-dmarc-support@google.com on 2024-07-13 16:59:59-07:00
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:    DEBUG:utils.py:388:IP address IPwas found in cache
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:    DEBUG:__init__.py:1506:Moving aggregate report messages from INBOX to Archive/Aggregate
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:    DEBUG:__init__.py:1513:Moving message 1 of 2: UID 6
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:    DEBUG:__init__.py:1513:Moving message 2 of 2: UID 7
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:     INFO:elastic.py:369:Saving aggregate report to Elasticsearch
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:    DEBUG:elastic.py:289:Creating Elasticsearch index: dmarc_aggregate-2024-07-13
Jul 16 13:18:29 parsedmarc parsedmarc[1023588]:     INFO:elastic.py:369:Saving aggregate report to Elasticsearch

@EwenBara
Copy link

EwenBara commented Dec 24, 2024

I observe same thing. But when I open the XML report from Google, I have three "record" corresponding to my three rows in Elasticsearch.

@seanthegeek
Copy link
Contributor

I observe same thing. But when I open the XML report from Google, I have three "record" corresponding to my three rows in Elasticsearch.

Correct. Each report can have multiple records, and each of those rows will be a separate event in Elastic/Splunk/CSVs/etc. So multiple entries with the same report ID is normal.

@fabm3n
Copy link
Author

fabm3n commented Dec 25, 2024

@seanthegeek my issue is not with multiple entries in one XML file like @EwenBara is reporting.
I got two mails with the same XML file (each file has one entry and the report ID is the same) which were imported twice by parsedmarc.

For me, the behaviour from parsedmarc is also wrong because both reports with the same report id have been added to the Elasticsearch database

Please reread my initial issue report.

Can you please reopen my issue?

EDIT: like in my previous comment mentioned this looks like a missing deduplication check for the report ID.
If you need, I can share my original XML files from Google with you.

@seanthegeek seanthegeek reopened this Dec 25, 2024
@EwenBara
Copy link

I did some tests and I can reproduce the issue. It's happen when the two reports was parsed in same batch.

To reproduce:

  • Identify a duplicated report for testing
  • Stop parsedmarc
  • Clean corresponding elasticsearch index
  • Move reports in INBOX
  • Set batch_size to 0
  • Start parsedmarc

To confirm, I run same test with batch_size to 1. In this case, the second report is not saved in Elasticsearch.

seanthegeek added a commit that referenced this issue Dec 25, 2024
- Ignore aggregate DMARC reports seen within a period of one hour (#535)
@seanthegeek
Copy link
Contributor

@fabm3n My apologies. I just published a release to try and solve this problem. It will keep track of up to 1 million report IDs seen in an hour and ignore duplicates.

8.16.0...8.16.1#diff-a1dcb2664f7e405007ed531c6c33eb4432f86a2fe4f4782a4763c95811f5754f

Let me know if this solves the problem.

@fabm3n
Copy link
Author

fabm3n commented Dec 30, 2024

This didn't fixed my issue with the same report id. For testing i moved all of my reports in the inbox again (176 reports).
The following report ID i received twice from google:
image

when i search in the elasticsearch indicies i found the ID twice:
image

This is the debug log:
Dec 30 20:44:57 parsedmarc parsedmarc[669131]: DEBUG:init.py:1619:Processing 2 messages
Dec 30 20:44:57 parsedmarc parsedmarc[669131]: DEBUG:init.py:1623:Processing message 1 of 2: UID 383
Dec 30 20:44:57 parsedmarc parsedmarc[669131]: INFO:init.py:1199:Parsing mail from noreply-dmarc-support@google.com on 2024-08-18 16:59:59-07:00
Dec 30 20:44:58 parsedmarc parsedmarc[669131]: DEBUG:utils.py:347:Trying to fetch reverse DNS map from https://raw.githubusercontent.com/domainaware/parsedmarc/master/parsedmarc/resources/maps/base_reverse_dns_map.csv...
Dec 30 20:44:58 parsedmarc parsedmarc[669131]: DEBUG:utils.py:439:IP address 2a01:::1 added to cache
Dec 30 20:44:58 parsedmarc parsedmarc[669131]: DEBUG:init.py:1623:Processing message 2 of 2: UID 384
Dec 30 20:44:58 parsedmarc parsedmarc[669131]: INFO:init.py:1199:Parsing mail from noreply-dmarc-support@google.com on 2024-08-18 16:59:59-07:00
Dec 30 20:44:58 parsedmarc parsedmarc[669131]: DEBUG:utils.py:406:IP address 2a01::1 was found in cache
Dec 30 20:44:58 parsedmarc parsedmarc[669131]: DEBUG:init.py:1696:Moving aggregate report messages from INBOX to Archive/Aggregate
Dec 30 20:44:58 parsedmarc parsedmarc[669131]: DEBUG:init.py:1704:Moving message 1 of 2: UID 383
Dec 30 20:44:58 parsedmarc parsedmarc[669131]: DEBUG:init.py:1704:Moving message 2 of 2: UID 384
Dec 30 20:44:58 parsedmarc parsedmarc[669131]: INFO:elastic.py:390:Saving aggregate report to Elasticsearch
Dec 30 20:44:58 parsedmarc parsedmarc[669131]: DEBUG:elastic.py:312:Creating Elasticsearch index: dmarc_aggregate-2024-08
Dec 30 20:44:58 parsedmarc parsedmarc[669131]: INFO:elastic.py:390:Saving aggregate report to Elasticsearch

The debug message "Skipping duplicate report ID" is missing, so the deduplication does not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants