Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_sample: duplicate record to support destructive change #4586

Merged
merged 1 commit into from
Aug 16, 2024

Conversation

kenhys
Copy link
Contributor

@kenhys kenhys commented Aug 14, 2024

Which issue(s) this PR fixes:

Fixes #4575

What this PR does / why we need it:

When in_sample plugin is used with filter parser which uses remove_key_name_field, it raises the following error repeatedly.

  #0 dump an error event: error_class=ArgumentError error="message does not exist"

This kind of error occurs when key_name and remove_key_name_field removes key from record with destructive change in filter parser affects generated sample data.

Docs Changes:

TODO: Add reuse_record parameter.

Release Note:

N/A

@kenhys
Copy link
Contributor Author

kenhys commented Aug 14, 2024

checking whether it does not break existing test cases with CI.

@kenhys kenhys added this to the v1.17.1 milestone Aug 14, 2024
@kenhys kenhys marked this pull request as ready for review August 14, 2024 06:55
@kenhys kenhys requested a review from daipom August 14, 2024 08:10
Copy link
Contributor

@daipom daipom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix!
Shouldn't we remove this dup?

if @auto_increment_key
d = d.dup
d[@auto_increment_key] = @storage.update(:auto_increment_value){|v| v + 1 }
end

@kenhys
Copy link
Contributor Author

kenhys commented Aug 14, 2024

Thanks, it should be removed.

@kenhys kenhys requested a review from daipom August 14, 2024 08:34
daipom
daipom previously approved these changes Aug 14, 2024
Copy link
Contributor

@daipom daipom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the impact on performance.
There seems to be a noticeable load on the CPU and RES for large data flows.

It seems acceptable to me because it is hard to imagine that the performance of in_sample would be a problem.
In addition, this fix looks very natural for in_sample's design.

This may affect users using in_sample to test performance.
So, it would be enough to report at the release that the in_sample has become a bit overloaded when flowing large data for testing.

I would like to hear one more person's opinion just to be sure.


the impact on performance

<source>
  @type sample
  tag test
  size xxx
  rate xxx
</source>

size: 10000, rate: 100

Current

$ top -o RES -p xxx
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1610784 daipom    20   0  273060  61036  11152 S  41.6   0.2   0:10.14 ruby

This PR

$ top -o RES -p xxx
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1610963 daipom    20   0  274264  69680  11260 S  57.0   0.2   0:09.73 ruby

size: 100000, rate: 100

Current

$ top -o RES -p xxx
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1611452 daipom    20   0  289000  80492  11072 S 101.0   0.2   0:16.52 ruby

This PR

$ top -o RES -p xxx
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1611243 daipom    20   0  304720 148012  11152 S 100.0   0.5   0:11.36 ruby

@kenhys
Copy link
Contributor Author

kenhys commented Aug 14, 2024

Thanks for checking in point of performance.
It is expected that there is performance regression in some extent.

If user uses in_sample for their own plugin benchmark, it affects because baseline was changed.
But I guess that it is very limited use case.
(It is enough to mention in changelog explicitly IMHO)

@daipom
Copy link
Contributor

daipom commented Aug 15, 2024

I'll wait a bit to see if anyone else has an opinion.

@kenhys
Copy link
Contributor Author

kenhys commented Aug 16, 2024

Added renew_record parameter like this:

<source>
    @type sample
    tag test
    size 100000
    rate 100
    #renew_record false
    renew_record true
</source>
<match test>
  @type null
</match>

size: 10000

current master
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 201025 kenhys    20   0  340940  54800  13648 S  24.6   0.1   0:04.62 ruby

renew_record: false
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 202342 kenhys    20   0  340948  54732  13576 S  25.7   0.1   0:06.83 ruby

renew_record: true
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 203402 kenhys    20   0  343288  63556  13500 S  30.7   0.1   0:06.86 ruby

size: 100000

current master
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 204681 kenhys    20   0  364324  80880  13448 S 100.0   0.1   0:14.46 ruby

renew_record: false
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 205560 kenhys    20   0  364332  80868  13448 S  99.7   0.1   0:31.34 ruby

renew_record: true
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 207133 kenhys    20   0  365804 113320  13576 S  99.7   0.2 0:13.87 ruby

With added conditional switch, there is a slightly performance regression with previous behavior.
but it is better to give user a choice which behavior is suitable.

@kenhys
Copy link
Contributor Author

kenhys commented Aug 16, 2024

Changed to more conservative approach.

What do you think? @daipom

Copy link
Contributor

@daipom daipom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

lib/fluent/plugin/in_sample.rb Outdated Show resolved Hide resolved
lib/fluent/plugin/in_sample.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@daipom daipom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It's very clear with reuse_record!

Sorry for the minor points..., but the following are points of concern.

lib/fluent/plugin/in_sample.rb Outdated Show resolved Hide resolved
lib/fluent/plugin/in_sample.rb Outdated Show resolved Hide resolved
When in_sample plugin is used with filter parser which uses
remove_key_name_field, it raises the following error repeatedly.

  #0 dump an error event: error_class=ArgumentError error="message
  does not exist"

This kind of error occurs when key_name and remove_key_name_field
removes key from record with destructive change in filter parser.
It affects generated sample data.

To fix this issue, it is simple to just dup every record even though
it has a significant performance penalty.

Considering keeping compatibility and providing way to a workaround,
added option to enable previous behavior - reuse_record, disabled by default.

ref. fluent#4575

Here is the small benchmark:

<source>
    @type sample
    tag test
    size xxx
    rate 100
    reuse_record
</source>
<match test>
  @type null
</match>

size: 100000

 master:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   330767 kenhys    20   0  364316  81036  13620 S 100.3   0.1   0:52.10 ruby

 reuse_record: true

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   333640 kenhys    20   0  364328  80956  13560 S 100.0   0.1   0:17.04 ruby

 reuse_record: false

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   335843 kenhys    20   0  366188 113300  13536 S 100.3   0.2   0:17.24 ruby

Signed-off-by: Kentaro Hayashi <hayashi@clear-code.com>
Copy link
Contributor

@daipom daipom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@kenhys kenhys merged commit accd0c8 into fluent:master Aug 16, 2024
14 of 16 checks passed
@kenhys kenhys deleted the dup-sample branch August 16, 2024 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remove_key_name_field true when key_name is message causes ArgumentError error="message does not exist"
2 participants