-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update code, use Rufus scheduler, load dictionary better, add "iterate_on" #67
Conversation
CHANGELOG.md
Outdated
@@ -1,3 +1,11 @@ | |||
## 3.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are all these changes worthy of a minor bump maybe? any potential BWC issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think you are correct. There are no known BWC issues but there is a new feature.
the impact on throughput by having the refresh in the scheduler thread. | ||
Any ongoing modification of the dictionary file should be done using a | ||
copy/edit/rename or create/rename mechanism to avoid the refresh code from | ||
processing half-baked dictionary content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
end | ||
end | ||
|
||
def read_file_into_dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't that be better using protected
instead of private
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then initialize_for_file_type
should be protected too, no?
# few intermediate objects as possible | ||
# this overwrites the value at key | ||
IO.foreach(@dictionary_path, :mode => 'r:bom|utf-8') do |line| | ||
@io.string = line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sur eI understand the purpose of this @io
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line 7 I create one long lasting instance StringIO @io
, at line 8 I pass it into the constructor of a long lasting instance of CSV @csv
which memoizes the io
. Then at line 16 I replace the backing string of @io
this effectively resets the StringIO. I looked at the CSV code, when it gets a String arg, it wraps it in a discardable StringIO unless the arg is an IO object.
I could have done csv = CSV.new(line)
on line 16, but processing a 200K+ line file would mean creating and GC many instances.
Before, the code looked like this.
data = CSV.read(@dictionary_path).inject(Hash.new) do |acc, v|
acc[v[0]] = v[1]
acc
end
refresh_dictionary!(data)
For me the goal here is to support 2+ million line dictionaries because the jdbc_static has bad performance on that size of table. I want to be able to advise people to switch to this filter instead of using jdbc_static for simple key lookup. The seduction of using jdbc_static is not needing to prepare a local file. When this plugin gets the feature of loading a dictionary via JDBC (and later S3 perhaps) then it will be very useful, think Docker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, I missed the CSV.new(@io)
.
So I'd be curious to actually measure the actual GC impact here. I personally don't think it would make any noticeable difference. IMO such GC related optimizations are usually not necessary unless we are dealing with very hot code paths. For code outside the hot paths I'd suggest code correctness and readability.
OTOH sometimes simple changes can actually avoid superfluous object creation which can be beneficial if there is a valid reason for such specific optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bear in mind that I am optimising for very large files.
def initialize(path, refresh_interval, exact, regex) | ||
@dictionary_path = path | ||
@refresh_interval = refresh_interval | ||
@short_refresh = @refresh_interval < 300.001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<= 300.0
?
@short_refresh = @refresh_interval < 300.001 | ||
rw_lock = java.util.concurrent.locks.ReentrantReadWriteLock.new | ||
@write_lock = rw_lock.writeLock | ||
@dictionary = Hash.new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Hash.new
?
@fetch_strategy = FetchStrategy::File::ExactRegex.new(@dictionary, rw_lock) | ||
else | ||
@fetch_strategy = FetchStrategy::File::Exact.new(@dictionary, rw_lock) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: could be a one-line using @fetch_strategy = regex ? ... : ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do that. I was worried about line length readability. As each class has the same constructor signature I could return the class fetch_strategy_class
and instantiate once.
fetch_strategy_class = if exact
regex ? FetchStrategy::File::ExactRegex : FetchStrategy::File::Exact
else
FetchStrategy::File::RegexUnion
end
@fetch_strategy = fetch_strategy_class.new(@dictionary, rw_lock)
Clearer? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meh. I personally prefer the original code but then this is really minor so I leave it up to you.
else | ||
@fetch_strategy = FetchStrategy::File::RegexUnion.new(@dictionary, rw_lock) | ||
end | ||
load_dictionary(raise_exception) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of declaring raise_exception = true
above for the only purpose of passing it here by name, you could also remove the above statement and just do
load_dictionary(raise_exception = true)
if the intent is to be explicit on the parameter purpose. otherwise
load_dictionary(true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was from before and I think the intention is to reveal the meaning behind the boolean.
I think I should add a comment.
load_dictionary(true) # true here means raise an exception once on initial load and not thereafter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup or a constant or as I suggested just load_dictionary(raise_exception = true)
which is a poor's man named parameters. It is completely useless code-wise but serves the purpose of describing the intent of the boolean param.
def loading_exception(e, raise_exception) | ||
msg = "Translate: #{e.message} when loading dictionary file at #{@dictionary_path}" | ||
if raise_exception | ||
raise RuntimeError.new(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure RuntimeError
is appropriate here? I'd suggest creating a custom FileError < StandardError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Previous code.
|
||
def read_file_into_dictionary | ||
content = IO.read(@dictionary_path, :mode => 'r:bom|utf-8') | ||
@dictionary.update(::JSON.load(content)) unless content.nil? || content.empty? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shoud be using LogStash::Json.load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 reasons to leave as is:
- I benchmarked both - no significant speed up to
LogStash::Json
. - I have a long term goal of removing deps on JrJackson in LS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, reasons:
- we specifically added the json serialization classes to avoid discrepancies across the LS code base and the plugins for json serialization. The whole code-base should be using the same serializer and if a bug/security fix/enhancement is made, everyone benefits uniformly.
- the term goal of removing deps on JrJackson will be easier achieved by swapping the implementation under our own json serialization classes. Once we are ready, only a single place to change and the whole code base benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I like your reasoning. I'll change it.
require 'jrjackson' | ||
|
||
module LogStash module Filters module Dictionary | ||
class JsonHandler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that JsonHandler necessary? is there a net gain performance-wise versus using a straight json deserializer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this section is below a __END__
point.
It is recorded here in case I want to use it later.
I wanted to give json_file
the same "streaming" style of dictionary load as csv and yaml but there were no performance gains in using JrJackson 😢 and so I have a long term goal of removing deps on JrJackson in LS
applied.
I looked at whether the JSON gem had recently acquired "streaming" support but it has not.
An alternative is to use a similar Jackson construct as shown it the StreamingJson experiment PR I put up but that is the subject of another PR.
logstash-filter-translate.gemspec
Outdated
|
||
if RUBY_VERSION.start_with?("1") | ||
s.add_runtime_dependency 'rake', '~> 12.1.0' | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this needed for? I haven't seen that in other plugins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to do this when I manually installed the built gem on an existing LS 5.X.X install. 😕 .
Travis does not have this problem as it builds 5.6 from scratch.
The latest file input PR did this too but pinned it at 12.2.0
I am not sure where the runtime dependency on rake comes from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah my problem here is that if we start introducing such litlle differences if every plugin we work on it will become a mess. I really wish we could keep stuff like Rakefile, gemspec, build.gradle, etc as generic as possible. some of these problem are environment dependent and should probably be only fixed locally. when we see such a problem which is not only a local env problem we should create a separate issue which not only solve it for a specific plugin but for all plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
My concern is that some popular plugins will get installed back in older versions of LS and if we know that it will be a problem and we do nothing it will cause more user pain.
I don't yet know of a smooth answer to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove that and open an issue to find a proper solution to this problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
@guyboertje did a first pass review. Great refactor overall 👍 |
def initialize(hash, exact, regex) | ||
if exact | ||
if regex | ||
@fetch_strategy = FetchStrategy::Memory::ExactRegex.new(hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change this too if suggestion in file.rb
is acceptable.
No memory leak testing done as yet. |
@guyboertje it shouldn't be too hard to write dictionary reloading stress tests which should reveal quickly if there are any memory leaks. We could create a separate followup issue for this. I am focusing on this because this is a very typical place where such memory leaks happens. |
@@ -33,32 +36,26 @@ def self.create(path, refresh_interval, refresh_behaviour, exact, regex) | |||
def initialize(path, refresh_interval, exact, regex) | |||
@dictionary_path = path | |||
@refresh_interval = refresh_interval | |||
@short_refresh = @refresh_interval < 300.001 | |||
@short_refresh = @refresh_interval <= 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we make explicit what the unit is for 300
, is that seconds?
else | ||
@fetch_strategy = FetchStrategy::File::Exact.new(@dictionary, rw_lock) | ||
end | ||
@fetch_strategy = regex ? FetchStrategy::File::ExactRegex.new(*args) : FetchStrategy::File::ExactRegex.new(*args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@field = ensure_reference_format(field) | ||
@destination = ensure_reference_format(destination) | ||
@fallback = fallback | ||
@use_fallback = !fallback.nil? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trick: can also be written as @use_fallback = !!fallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have used that trick before. In this case, fallback being false
is legitimate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok! (but its a bit counter intuitive when reading that code that @use_fallback
would be true
when fallback
is false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case @use_fallback
is true when the user supplied a fallback
value in the config, i.e. it is not nil.
I'll add a comment.
LGTM |
https://github.com/logstash-plugins/logstash-filter-translate/pull/67/files#diff-bf70891427c2690568e84ec2c794d12dR4 Is, I believe, breaking the functionality of this plugin for us - auto tests which were working now broken with no related change and the following error when trying to start the logstash service: Couldn't find any filter plugin named 'translate'. Are you sure this is correct? Trying to load the translate filter plugin resulted in this error: no such file to load -- logstash/util/loggable I presume there is now a dependency that I need to load? I see this is also now: |
iterate_on
with its three modes of single, array of strings and array of objects.