Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage when reading json file #906

Merged
merged 9 commits into from
Feb 8, 2023

Conversation

fukusuket
Copy link
Collaborator

@fukusuket fukusuket commented Feb 4, 2023

I found a place where the memory usage can be reduced by reading the json file, so I changed it a little :)

What Changed

Process line by line instead of reading the whole json at once.
(Sorry if I misunderstood the spec and can't make this change ... )

Evidence

Environment

  • OS: macOS montery version 13.1
  • Hard: Macbook Air(M1, 2020) , Memory 8GB, Core 8

Test1

OTRF/Security-Datasets apt29/day2 json data.

Version Elapsed time Memory(peak) Events with hits / Total events Output file size(bytes)
before 00:02:57.420 12.8 GiB 22,272 / 587,286 210356244
This PR 00:02:26.988 1.2 GiB 22,272 / 587,286 210349945

The number of detections is the same, but the file sizes are different for the following reasons:

before
< 2020-05-02 17:25:39.884 +09:00,UTICA.dmevals.local,Sysmon,11,info,1043132,File Created,"Path: C:\Windows\Temp\{44FCCA49-A5F2-4F9C-8C69-F85E76CBBC93}, - OProcSessId.dat ¦ Proc: C:\Program Files\Common Files\Microsoft Shared\ClickToRun\OfficeC2RClient.exe ¦ PID: 8948 ¦ PGUID: {8320f18b-2de8-5ead-3b01-000000000500}"
---
This PR
> 2020-05-02 17:25:39.884 +09:00,UTICA.dmevals.local,Sysmon,11,info,1043132,File Created,Path: C:\Windows\Temp\{44FCCA49-A5F2-4F9C-8C69-F85E76CBBC93} - OProcSessId.dat ¦ Proc: C:\Program Files\Common Files\Microsoft Shared\ClickToRun\OfficeC2RClient.exe ¦ PID: 8948 ¦ PGUID: {8320f18b-2de8-5ead-3b01-000000000500}

In main branch , when } was included in the log, an invalid , was added, so this PR removed invalid , .

Console output

before

Results Summary:

Events with hits / Total events: 22,272 / 587,286 (Data reduction: 565,014 events (96.21%))

Total | Unique detections: 176,839 | 163
Total | Unique critical detections: 0 (0.00%) | 0 (0.00%)
Total | Unique high detections: 317 (0.18%) | 42 (25.77%)
Total | Unique medium detections: 4,681 (2.65%) | 45 (27.61%)
Total | Unique low detections: 100,105 (56.61%) | 33 (20.25%)
Total | Unique informational detections: 71,736 (40.57%) | 43 (26.38%)
...
Saved file: main2.csv (210.3 MB)
Elapsed time: 00:02:57.420
Rule Parse Processing Time: 00:00:00.899
Analysis Processing Time: 00:02:55.140
Output Processing Time: 00:00:01.380

Memory usage stats:
heap stats:    peak      total      freed    current       unit      count
  reserved:   12.8 GiB   24.2 GiB   16.3 GiB    7.9 GiB
 committed:   12.6 GiB   31.1 GiB   28.5 GiB    2.6 GiB

This PR

Results Summary:

Events with hits / Total events: 22,272 / 587,286 (Data reduction: 565,014 events (96.21%))

Total | Unique detections: 176,839 | 163
Total | Unique critical detections: 0 (0.00%) | 0 (0.00%)
Total | Unique high detections: 317 (0.18%) | 42 (25.77%)
Total | Unique medium detections: 4,681 (2.65%) | 45 (27.61%)
Total | Unique low detections: 100,105 (56.61%) | 33 (20.25%)
Total | Unique informational detections: 71,736 (40.57%) | 43 (26.38%)
...
Saved file: new.csv (210.3 MB)
Elapsed time: 00:02:26.988
Rule Parse Processing Time: 00:00:00.839
Analysis Processing Time: 00:02:25.428
Output Processing Time: 00:00:00.720

Memory usage stats:
heap stats:    peak      total      freed    current       unit      count
  reserved:    1.2 GiB    1.2 GiB      0        1.2 GiB
 committed:    1.2 GiB    8.7 GiB    7.7 GiB  948.5 MiB

Test2

OTRF/Security-Datasets apt29/day1 json data.

Version Elapsed time Memory(peak) Events with hits / Total events Output file size(bytes)
before 00:00:36.544 3.9 GiB 25,957 / 196,081 19911194
This PR 00:00:36.623 864.1 MiB 25,957 / 196,081 19906946

I would appreciate it if you could review🙏

@fukusuket fukusuket self-assigned this Feb 4, 2023
@fukusuket fukusuket added the enhancement New feature or request label Feb 4, 2023
Copy link
Collaborator

@hitenkoku hitenkoku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fukusuket Thanks for your Pull Request.

The way you suggested it, it's supposed to be a json file with one record per line.

src/detections/utils.rs Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Feb 7, 2023

Codecov Report

Base: 74.81% // Head: 74.91% // Increases project coverage by +0.09% 🎉

Coverage data is based on head (f027861) compared to base (c57e6a4).
Patch coverage: 97.77% of modified lines in pull request are covered.

❗ Current head f027861 differs from pull request most recent head e3c12c5. Consider uploading reports for the commit e3c12c5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #906      +/-   ##
==========================================
+ Coverage   74.81%   74.91%   +0.09%     
==========================================
  Files          24       24              
  Lines       15663    15722      +59     
==========================================
+ Hits        11719    11778      +59     
  Misses       3944     3944              
Impacted Files Coverage Δ
src/detections/utils.rs 82.36% <97.72%> (+1.68%) ⬆️
src/main.rs 27.35% <100.00%> (ø)
src/afterfact.rs 37.15% <0.00%> (-0.08%) ⬇️
src/detections/configs.rs 56.76% <0.00%> (+0.06%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@fukusuket
Copy link
Collaborator Author

fukusuket commented Feb 7, 2023

@hitenkoku @YamatoSecurity
I have fixed it with the following specs! I would appreciate it if you could review🙇

Suuported JSON format

I changed to support the following 3 pattern formats.

  1. JSONL(Convert line by line) <- Only this pattern can reduce memory usage
{key : val}
{key : val}
  1. JSON Array(Convert entire file at once)
[
    {
        key : val
    },
    {
        key : val
    },
]
  1. jq -C(Convert entire file at once)
    (Strictly speaking, this format is invalid as JSON, but it is likely to be used frequently, so I support it)
{
    key : val
}
{
    key : val
}

Benchmark

OTRF/Security-Datasets apt29/day2 json data.

Version Elapsed time Memory(peak) Events with hits / Total events Output file size(bytes)
main 00:02:49.293 12.8 GiB 22,305 / 587,286 210380953
after 00:02:30.529 5.4 GiB 22,305 / 587,286 210374654

The rate of improvement has decreased...
It looks like it can be improved more, but it will take time to investigate, so I will try to improve it in another PR.

Copy link
Collaborator

@YamatoSecurity YamatoSecurity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested csv-timeline and json-timeline against the two APT29 files. (2.1GB)
Processing time and detections are the same.
Memory usage went from 12.7GB to 5.4GB
LGTM!
Thank you so much!

Copy link
Collaborator

@hitenkoku hitenkoku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM

@hitenkoku hitenkoku merged commit 9a3b8b3 into main Feb 8, 2023
@hitenkoku hitenkoku deleted the improve-memory-usage-by-reduct-string-allocation branch February 8, 2023 00:51
@fukusuket
Copy link
Collaborator Author

Thank you so much for review and benchmark :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants