Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update code, use Rufus scheduler, load dictionary better, add "iterate_on" #67

Merged
merged 14 commits into from
Aug 14, 2018

Conversation

guyboertje
Copy link

@guyboertje guyboertje commented Jul 23, 2018

  • Move file based dictionary support to separate classes
  • Move scheduling to use Rufus Scheduler interval and simplify the read write locking.
  • Try to read (huge) CSV, JSON and YAML files in a "streaming" way, i.e. to not read the full file as a string then build a massive hash then update the dictionary hash, but to update the dictionary hash as a key value line is read from disk.
  • Abstract the config based dictionary into the same API as the file ones.
  • Optimise the three modes of query: exact, exact + regex and regex union.
  • Add iterate_on with its three modes of single, array of strings and array of objects.

@guyboertje guyboertje self-assigned this Jul 25, 2018
CHANGELOG.md Outdated
@@ -1,3 +1,11 @@
## 3.1.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all these changes worthy of a minor bump maybe? any potential BWC issues?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think you are correct. There are no known BWC issues but there is a new feature.

the impact on throughput by having the refresh in the scheduler thread.
Any ongoing modification of the dictionary file should be done using a
copy/edit/rename or create/rename mechanism to avoid the refresh code from
processing half-baked dictionary content.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

end
end

def read_file_into_dictionary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be better using protected instead of private?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then initialize_for_file_type should be protected too, no?

# few intermediate objects as possible
# this overwrites the value at key
IO.foreach(@dictionary_path, :mode => 'r:bom|utf-8') do |line|
@io.string = line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sur eI understand the purpose of this @io here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 7 I create one long lasting instance StringIO @io, at line 8 I pass it into the constructor of a long lasting instance of CSV @csv which memoizes the io. Then at line 16 I replace the backing string of @io this effectively resets the StringIO. I looked at the CSV code, when it gets a String arg, it wraps it in a discardable StringIO unless the arg is an IO object.
I could have done csv = CSV.new(line) on line 16, but processing a 200K+ line file would mean creating and GC many instances.

Before, the code looked like this.

  data = CSV.read(@dictionary_path).inject(Hash.new) do |acc, v|
    acc[v[0]] = v[1]
    acc
  end
  refresh_dictionary!(data)

For me the goal here is to support 2+ million line dictionaries because the jdbc_static has bad performance on that size of table. I want to be able to advise people to switch to this filter instead of using jdbc_static for simple key lookup. The seduction of using jdbc_static is not needing to prepare a local file. When this plugin gets the feature of loading a dictionary via JDBC (and later S3 perhaps) then it will be very useful, think Docker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, I missed the CSV.new(@io).

So I'd be curious to actually measure the actual GC impact here. I personally don't think it would make any noticeable difference. IMO such GC related optimizations are usually not necessary unless we are dealing with very hot code paths. For code outside the hot paths I'd suggest code correctness and readability.
OTOH sometimes simple changes can actually avoid superfluous object creation which can be beneficial if there is a valid reason for such specific optimizations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bear in mind that I am optimising for very large files.

def initialize(path, refresh_interval, exact, regex)
@dictionary_path = path
@refresh_interval = refresh_interval
@short_refresh = @refresh_interval < 300.001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<= 300.0 ?

@short_refresh = @refresh_interval < 300.001
rw_lock = java.util.concurrent.locks.ReentrantReadWriteLock.new
@write_lock = rw_lock.writeLock
@dictionary = Hash.new()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Hash.new ?

@fetch_strategy = FetchStrategy::File::ExactRegex.new(@dictionary, rw_lock)
else
@fetch_strategy = FetchStrategy::File::Exact.new(@dictionary, rw_lock)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: could be a one-line using @fetch_strategy = regex ? ... : ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do that. I was worried about line length readability. As each class has the same constructor signature I could return the class fetch_strategy_class and instantiate once.

      fetch_strategy_class = if exact
          regex ? FetchStrategy::File::ExactRegex : FetchStrategy::File::Exact
        else
          FetchStrategy::File::RegexUnion
        end
      @fetch_strategy = fetch_strategy_class.new(@dictionary, rw_lock)

Clearer? WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meh. I personally prefer the original code but then this is really minor so I leave it up to you.

else
@fetch_strategy = FetchStrategy::File::RegexUnion.new(@dictionary, rw_lock)
end
load_dictionary(raise_exception)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of declaring raise_exception = true above for the only purpose of passing it here by name, you could also remove the above statement and just do

load_dictionary(raise_exception = true)

if the intent is to be explicit on the parameter purpose. otherwise

load_dictionary(true)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was from before and I think the intention is to reveal the meaning behind the boolean.
I think I should add a comment.

load_dictionary(true) # true here means raise an exception once on initial load and not thereafter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup or a constant or as I suggested just load_dictionary(raise_exception = true) which is a poor's man named parameters. It is completely useless code-wise but serves the purpose of describing the intent of the boolean param.

def loading_exception(e, raise_exception)
msg = "Translate: #{e.message} when loading dictionary file at #{@dictionary_path}"
if raise_exception
raise RuntimeError.new(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure RuntimeError is appropriate here? I'd suggest creating a custom FileError < StandardError ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Previous code.


def read_file_into_dictionary
content = IO.read(@dictionary_path, :mode => 'r:bom|utf-8')
@dictionary.update(::JSON.load(content)) unless content.nil? || content.empty?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shoud be using LogStash::Json.load

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 reasons to leave as is:

  1. I benchmarked both - no significant speed up to LogStash::Json.
  2. I have a long term goal of removing deps on JrJackson in LS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, reasons:

  • we specifically added the json serialization classes to avoid discrepancies across the LS code base and the plugins for json serialization. The whole code-base should be using the same serializer and if a bug/security fix/enhancement is made, everyone benefits uniformly.
  • the term goal of removing deps on JrJackson will be easier achieved by swapping the implementation under our own json serialization classes. Once we are ready, only a single place to change and the whole code base benefits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I like your reasoning. I'll change it.

require 'jrjackson'

module LogStash module Filters module Dictionary
class JsonHandler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that JsonHandler necessary? is there a net gain performance-wise versus using a straight json deserializer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this section is below a __END__ point.

It is recorded here in case I want to use it later.

I wanted to give json_file the same "streaming" style of dictionary load as csv and yaml but there were no performance gains in using JrJackson 😢 and so I have a long term goal of removing deps on JrJackson in LS applied.

I looked at whether the JSON gem had recently acquired "streaming" support but it has not.

An alternative is to use a similar Jackson construct as shown it the StreamingJson experiment PR I put up but that is the subject of another PR.


if RUBY_VERSION.start_with?("1")
s.add_runtime_dependency 'rake', '~> 12.1.0'
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this needed for? I haven't seen that in other plugins?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do this when I manually installed the built gem on an existing LS 5.X.X install. 😕 .
Travis does not have this problem as it builds 5.6 from scratch.
The latest file input PR did this too but pinned it at 12.2.0
I am not sure where the runtime dependency on rake comes from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah my problem here is that if we start introducing such litlle differences if every plugin we work on it will become a mess. I really wish we could keep stuff like Rakefile, gemspec, build.gradle, etc as generic as possible. some of these problem are environment dependent and should probably be only fixed locally. when we see such a problem which is not only a local env problem we should create a separate issue which not only solve it for a specific plugin but for all plugins.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.
My concern is that some popular plugins will get installed back in older versions of LS and if we know that it will be a problem and we do nothing it will cause more user pain.

I don't yet know of a smooth answer to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove that and open an issue to find a proper solution to this problem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@colinsurprenant
Copy link
Contributor

colinsurprenant commented Jul 26, 2018

@guyboertje did a first pass review. Great refactor overall 👍
Have we made any memory leaks testing to make sure reloading/refreshing dictionaries does not leak?

def initialize(hash, exact, regex)
if exact
if regex
@fetch_strategy = FetchStrategy::Memory::ExactRegex.new(hash)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change this too if suggestion in file.rb is acceptable.

@guyboertje
Copy link
Author

No memory leak testing done as yet.

@colinsurprenant
Copy link
Contributor

@guyboertje it shouldn't be too hard to write dictionary reloading stress tests which should reveal quickly if there are any memory leaks. We could create a separate followup issue for this. I am focusing on this because this is a very typical place where such memory leaks happens.

@@ -33,32 +36,26 @@ def self.create(path, refresh_interval, refresh_behaviour, exact, regex)
def initialize(path, refresh_interval, exact, regex)
@dictionary_path = path
@refresh_interval = refresh_interval
@short_refresh = @refresh_interval < 300.001
@short_refresh = @refresh_interval <= 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we make explicit what the unit is for 300, is that seconds?

else
@fetch_strategy = FetchStrategy::File::Exact.new(@dictionary, rw_lock)
end
@fetch_strategy = regex ? FetchStrategy::File::ExactRegex.new(*args) : FetchStrategy::File::ExactRegex.new(*args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@field = ensure_reference_format(field)
@destination = ensure_reference_format(destination)
@fallback = fallback
@use_fallback = !fallback.nil?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trick: can also be written as @use_fallback = !!fallback

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used that trick before. In this case, fallback being false is legitimate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok! (but its a bit counter intuitive when reading that code that @use_fallback would be true when fallback is false)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case @use_fallback is true when the user supplied a fallback value in the config, i.e. it is not nil.
I'll add a comment.

@colinsurprenant
Copy link
Contributor

LGTM

@guyboertje guyboertje merged commit 675e130 into logstash-plugins:master Aug 14, 2018
@guyboertje guyboertje deleted the fix/65-66 branch August 14, 2018 07:10
@ccayg-sainsburys
Copy link

ccayg-sainsburys commented Aug 15, 2018

https://github.com/logstash-plugins/logstash-filter-translate/pull/67/files#diff-bf70891427c2690568e84ec2c794d12dR4
(require "logstash/util/loggable")

Is, I believe, breaking the functionality of this plugin for us - auto tests which were working now broken with no related change and the following error when trying to start the logstash service:

Couldn't find any filter plugin named 'translate'. Are you sure this is correct? Trying to load the translate filter plugin resulted in this error: no such file to load -- logstash/util/loggable

I presume there is now a dependency that I need to load?
But I'm not clear what?

I see this is also now:
#69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants