Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discord parser #1859

Merged
merged 87 commits into from
Feb 27, 2024
Merged

Discord parser #1859

merged 87 commits into from
Feb 27, 2024

Conversation

felipecampanini
Copy link
Contributor

No description provided.

Performance improvement and disk usage in the CacheAddr class, now it
returns the InputStream of exactly the necessary size.
Suitability for using the corrected getInputStream method of the
CacheAddr class
The DiscordParser class is now able to parse any cache that has the
Google Chrome cache structure, and all of its subitems. When a cached
Discord chat file is found, the original JSON file is parsed separately
from the chat file. Therefore, for each chat found, there will be an
original JSON (found in cache) and a chat interpreted in HTML.
Deduplication of participant names in calls
Change to display AM/PM indication and time in 24h format for Brazilian
language.
BUGFIX - If there are two files with the same name, the parser could
show the wrong file. Now the file is fetched based on the name and id
contained in the URL.
BUGFIX - The character ")" was being printed wrongly in the HTML;
@lfcnassif
Copy link
Member

lfcnassif commented Sep 1, 2023

Thank you @felipecampanini! Have you noticed merge conflicts below? Are you comfortable to fix them?

@felipecampanini
Copy link
Contributor Author

Hello @lfcnassif, sorry for the delay. I will review the conflicts

@lfcnassif
Copy link
Member

@felipecampanini, @patrickdalla was trying to review this but he found merge conflicts in thousands of files. I took a look and seems you did a bad reverse merge of master here. I needed to force-push to fix that incorrect merge. Please take a look and check if I reverted some of your logic changes unintentionally, so @patrickdalla can proceed, thanks!

@lfcnassif
Copy link
Member

At a first look, seems the date format in localization files were fixed just into the pt-BR localization and an AM/PM marker was introduced just in the English one, missing the other 3 localization files. I suggest always using 24h in all languages.

@patrickdalla
Copy link
Collaborator

Hi @felipecampanini . It seems the implementation mixes generic chrome cache index parsing with discord specific parsing. Wouldn't it be better to create a generic chrome index cache parser, and a specific discord parser?
It seems the generic chrome parser could be very useful to parse some other content than discord content.
Maybe the discord parser could create the chats items querying the items from lucene already extracted by the generic chrome index cache parser.

@lfcnassif
Copy link
Member

lfcnassif commented Nov 1, 2023

@felipecampanini does this contain all proposed changes on #1568, right?

@lfcnassif lfcnassif linked an issue Nov 1, 2023 that may be closed by this pull request
@felipecampanini
Copy link
Contributor Author

@felipecampanini does this contain all proposed changes on #1568, right?

Exactly

@felipecampanini
Copy link
Contributor Author

Hi @felipecampanini . It seems the implementation mixes generic chrome cache index parsing with discord specific parsing. Wouldn't it be better to create a generic chrome index cache parser, and a specific discord parser? It seems the generic chrome parser could be very useful to parse some other content than discord content. Maybe the discord parser could create the chats items querying the items from lucene already extracted by the generic chrome index cache parser.

@patrickdalla, I even discussed this with lfcnassif in #1568.

The discord app uses the exact cache structure of Google Chrome. Therefore, I believe it would be ideal to use the same parser for both applications (Chrome and Discord).

However, the name should be ChromeCacheParser or something similar, but as explained by @lfcnassif, renaming the parser could cause some problems.

The same chat file that can be found in Chrome's cache can also be found in Discord's cache, so I believe they should be the same parser.

@felipecampanini
Copy link
Contributor Author

@felipecampanini, @patrickdalla was trying to review this but he found merge conflicts in thousands of files. I took a look and seems you did a bad reverse merge of master here. I needed to force-push to fix that incorrect merge. Please take a look and check if I reverted some of your logic changes unintentionally, so @patrickdalla can proceed, thanks!

I will review it calmly on Monday, sorry for the delay, this week I needed to focus on some urgent demands from the sector.

@lfcnassif
Copy link
Member

However, the name should be ChromeCacheParser or something similar, but as explained by @lfcnassif, renaming the parser could cause some problems.

One option is having 2 parsers: renaming the current one to ChromeCacheParser and creating a new one with the old name, extending the renamed parser. So things would continue working with old profiles.

The same chat file that can be found in Chrome's cache can also be found in Discord's cache, so I believe they should be the same parser.

Out of curiosity, is this situation detected and deduplicated by current code?

I will review it calmly on Monday, sorry for the delay, this week I needed to focus on some urgent demands from the sector.

I totally understand, don't worry.

@lfcnassif
Copy link
Member

lfcnassif commented Feb 22, 2024

After commit c5eb90d (propagate exception instead of silently catching it), I surprisingly got 25% more decoded chats and almost 50% more instant messages, although more parsing exceptions happen now. @felipecampanini or @patrickdalla, could some of you double check the commit to confirm it is fine?

@patrickdalla
Copy link
Collaborator

patrickdalla commented Feb 23, 2024

After commit c5eb90d (propagate exception instead of silently catching it), I surprisingly got 25% more decoded chats and almost 50% more instant messages, although more parsing exceptions happen now. @felipecampanini or @patrickdalla, could some of you double check the commit to confirm it is fine?

Are you using as sample your browser navigation? I came across some similar strange situation also, when using my own browser and discord session cache to test. My chats did not appear at first in IPED result, although I have seen them on the browser. On a latter test, many chats appeared in IPED, from channels I have subscribed, although I have not seen them on browser. It seemed some Discord background processing that works with the cache, in second case, and delayed cache flush, in first, as the real index files content used (seeing them with other tools) were consistent with IPED results. So I ignored.

If yes, could you revert your simple change on commit and test again to see if something the chat appears.

At first sight, this change isn't enough to recognize or recover any new chat.

If not, could you share your test base?

@lfcnassif
Copy link
Member

Are you using as sample your browser navigation?

No, I'm using a fixed and immutable data set extracted from real cases...

@lfcnassif
Copy link
Member

If not, could you share your test base?

Sure.

@lfcnassif
Copy link
Member

lfcnassif commented Feb 23, 2024

Before last commits:
image

After them, looks much better:
image

If there are no objections, I will merge with last commits. I just want to configure a few metadata types of new metadata keys, to avoid unneeded :number suffixes duplicating keys:
image

@patrickdalla
Copy link
Collaborator

@lfcnassif I could not repeat the first image of last msg with the test sample you sent me.
Running with code till commit de7b6a0 gave me a similar result of the second image (except that discord JSON index entries was being counted as Chat category).

One thing I can observe in it is: It does not have any "browser artifacts" category items. So it seems you have run a yet older code commit, where the "Chrome Cache" subcategory wasn't defined.

@patrickdalla
Copy link
Collaborator

patrickdalla commented Feb 24, 2024

@lfcnassif: About the removal of Whatsapp related CSS, if I well remember, I have put it to make Audio and Video attachments reproducible directly from HTML exported via html report.
I could not find any audio or video attached to any of the chats in your sample, so I suppose you have not tested this possibility.
Could you review it?

@lfcnassif
Copy link
Member

lfcnassif commented Feb 24, 2024

Thanks @patrickdalla for your tests!

I'll try to reproduce the first image I sent. About whatsapp css removal, I made it because the whatsapp chat background image was being displayed in some Discord chats, giving a very weird look. Please send me a sample with one audio or video as attachment, so I can test the change with them.

@lfcnassif
Copy link
Member

Hi @patrickdalla, you're right, I did a mvn clean install and the same number of artifacts were recovered before and after my logic changes. About the WhatsApp CSS removal, this was what motivated me, this background shows up in many Discord chats:
image

@wladimirleite
Copy link
Member

About the WhatsApp CSS removal, this was what motivated me, this background shows up in many Discord chats:

I think it would be better to have another CSS, even if it means some duplication. Otherwise changes in WhatsApp CSS (which is already complex) may cause undesired side effects in other parsers.

@lfcnassif
Copy link
Member

lfcnassif commented Feb 27, 2024

I finished my review and pushed some fixes and adjustments. Thank you very much @felipecampanini and @patrickdalla for this great work!

@lfcnassif lfcnassif merged commit 041638b into sepinf-inc:master Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Google Chrome Cache Parser
4 participants