Update code, use Rufus scheduler, load dictionary better, add "iterate_on" #67

guyboertje · 2018-07-23T21:08:15Z

Move file based dictionary support to separate classes
Move scheduling to use Rufus Scheduler interval and simplify the read write locking.
Try to read (huge) CSV, JSON and YAML files in a "streaming" way, i.e. to not read the full file as a string then build a massive hash then update the dictionary hash, but to update the dictionary hash as a key value line is read from disk.
Abstract the config based dictionary into the same API as the file ones.
Optimise the three modes of query: exact, exact + regex and regex union.
Add iterate_on with its three modes of single, array of strings and array of objects.

…gies

colinsurprenant · 2018-07-26T17:36:38Z

CHANGELOG.md

@@ -1,3 +1,11 @@
+## 3.1.1


are all these changes worthy of a minor bump maybe? any potential BWC issues?

Actually I think you are correct. There are no known BWC issues but there is a new feature.

colinsurprenant · 2018-07-26T17:39:29Z

docs/index.asciidoc

+the impact on throughput by having the refresh in the scheduler thread.
+Any ongoing modification of the dictionary file should be done using a
+copy/edit/rename or create/rename mechanism to avoid the refresh code from
+processing half-baked dictionary content.


colinsurprenant · 2018-07-26T18:04:21Z

lib/logstash/filters/dictionary/file.rb

+      end
+    end
+
+    def read_file_into_dictionary


shouldn't that be better using protected instead of private?

Then initialize_for_file_type should be protected too, no?

colinsurprenant · 2018-07-26T18:07:07Z

lib/logstash/filters/dictionary/csv_file.rb

+      # few intermediate objects as possible
+      # this overwrites the value at key
+      IO.foreach(@dictionary_path, :mode => 'r:bom|utf-8') do |line|
+        @io.string = line


not sur eI understand the purpose of this @io here?

At line 7 I create one long lasting instance StringIO @io, at line 8 I pass it into the constructor of a long lasting instance of CSV @csv which memoizes the io. Then at line 16 I replace the backing string of @io this effectively resets the StringIO. I looked at the CSV code, when it gets a String arg, it wraps it in a discardable StringIO unless the arg is an IO object.
I could have done csv = CSV.new(line) on line 16, but processing a 200K+ line file would mean creating and GC many instances.

Before, the code looked like this.

data = CSV.read(@dictionary_path).inject(Hash.new) do |acc, v| acc[v[0]] = v[1] acc end refresh_dictionary!(data)

For me the goal here is to support 2+ million line dictionaries because the jdbc_static has bad performance on that size of table. I want to be able to advise people to switch to this filter instead of using jdbc_static for simple key lookup. The seduction of using jdbc_static is not needing to prepare a local file. When this plugin gets the feature of loading a dictionary via JDBC (and later S3 perhaps) then it will be very useful, think Docker.

ah yes, I missed the CSV.new(@io).

So I'd be curious to actually measure the actual GC impact here. I personally don't think it would make any noticeable difference. IMO such GC related optimizations are usually not necessary unless we are dealing with very hot code paths. For code outside the hot paths I'd suggest code correctness and readability.
OTOH sometimes simple changes can actually avoid superfluous object creation which can be beneficial if there is a valid reason for such specific optimizations.

Bear in mind that I am optimising for very large files.

colinsurprenant · 2018-07-26T18:11:35Z

lib/logstash/filters/dictionary/file.rb

+    def initialize(path, refresh_interval, exact, regex)
+      @dictionary_path = path
+      @refresh_interval = refresh_interval
+      @short_refresh = @refresh_interval < 300.001


<= 300.0 ?

colinsurprenant · 2018-07-26T18:12:01Z

lib/logstash/filters/dictionary/file.rb

+      @short_refresh = @refresh_interval < 300.001
+      rw_lock = java.util.concurrent.locks.ReentrantReadWriteLock.new
+      @write_lock = rw_lock.writeLock
+      @dictionary = Hash.new()


minor: Hash.new ?

colinsurprenant · 2018-07-26T18:13:28Z

lib/logstash/filters/dictionary/file.rb

+          @fetch_strategy = FetchStrategy::File::ExactRegex.new(@dictionary, rw_lock)
+        else
+          @fetch_strategy = FetchStrategy::File::Exact.new(@dictionary, rw_lock)
+        end


minor: could be a one-line using @fetch_strategy = regex ? ... : ...

I could do that. I was worried about line length readability. As each class has the same constructor signature I could return the class fetch_strategy_class and instantiate once.

fetch_strategy_class = if exact regex ? FetchStrategy::File::ExactRegex : FetchStrategy::File::Exact else FetchStrategy::File::RegexUnion end @fetch_strategy = fetch_strategy_class.new(@dictionary, rw_lock)

Clearer? WDYT?

meh. I personally prefer the original code but then this is really minor so I leave it up to you.

colinsurprenant · 2018-07-26T18:17:04Z

lib/logstash/filters/dictionary/file.rb

+      else
+        @fetch_strategy = FetchStrategy::File::RegexUnion.new(@dictionary, rw_lock)
+      end
+      load_dictionary(raise_exception)


instead of declaring raise_exception = true above for the only purpose of passing it here by name, you could also remove the above statement and just do

load_dictionary(raise_exception = true)

if the intent is to be explicit on the parameter purpose. otherwise

load_dictionary(true)

This code was from before and I think the intention is to reveal the meaning behind the boolean.
I think I should add a comment.

load_dictionary(true) # true here means raise an exception once on initial load and not thereafter

yup or a constant or as I suggested just load_dictionary(raise_exception = true) which is a poor's man named parameters. It is completely useless code-wise but serves the purpose of describing the intent of the boolean param.

colinsurprenant · 2018-07-26T18:21:45Z

lib/logstash/filters/dictionary/file.rb

+    def loading_exception(e, raise_exception)
+      msg = "Translate: #{e.message} when loading dictionary file at #{@dictionary_path}"
+      if raise_exception
+        raise RuntimeError.new(msg)


not sure RuntimeError is appropriate here? I'd suggest creating a custom FileError < StandardError ?

👍
Previous code.

colinsurprenant · 2018-07-26T18:23:17Z

lib/logstash/filters/dictionary/json_file.rb

+
+    def read_file_into_dictionary
+      content = IO.read(@dictionary_path, :mode => 'r:bom|utf-8')
+      @dictionary.update(::JSON.load(content)) unless content.nil? || content.empty?


we shoud be using LogStash::Json.load

2 reasons to leave as is:

I benchmarked both - no significant speed up to LogStash::Json.

I have a long term goal of removing deps on JrJackson in LS.

I disagree, reasons:

we specifically added the json serialization classes to avoid discrepancies across the LS code base and the plugins for json serialization. The whole code-base should be using the same serializer and if a bug/security fix/enhancement is made, everyone benefits uniformly.

the term goal of removing deps on JrJackson will be easier achieved by swapping the implementation under our own json serialization classes. Once we are ready, only a single place to change and the whole code base benefits.

OK, I like your reasoning. I'll change it.

colinsurprenant · 2018-07-26T18:26:16Z

lib/logstash/filters/dictionary/json_file.rb

+require 'jrjackson'
+
+module LogStash module Filters module Dictionary
+  class JsonHandler


is that JsonHandler necessary? is there a net gain performance-wise versus using a straight json deserializer?

All of this section is below a __END__ point.

It is recorded here in case I want to use it later.

I wanted to give json_file the same "streaming" style of dictionary load as csv and yaml but there were no performance gains in using JrJackson 😢 and so I have a long term goal of removing deps on JrJackson in LS applied.

I looked at whether the JSON gem had recently acquired "streaming" support but it has not.

An alternative is to use a similar Jackson construct as shown it the StreamingJson experiment PR I put up but that is the subject of another PR.

colinsurprenant · 2018-07-26T18:31:08Z

logstash-filter-translate.gemspec

+
+  if RUBY_VERSION.start_with?("1")
+    s.add_runtime_dependency 'rake', '~> 12.1.0'
+  end


what is this needed for? I haven't seen that in other plugins?

I need to do this when I manually installed the built gem on an existing LS 5.X.X install. 😕 .
Travis does not have this problem as it builds 5.6 from scratch.
The latest file input PR did this too but pinned it at 12.2.0
I am not sure where the runtime dependency on rake comes from.

yeah my problem here is that if we start introducing such litlle differences if every plugin we work on it will become a mess. I really wish we could keep stuff like Rakefile, gemspec, build.gradle, etc as generic as possible. some of these problem are environment dependent and should probably be only fixed locally. when we see such a problem which is not only a local env problem we should create a separate issue which not only solve it for a specific plugin but for all plugins.

True.
My concern is that some popular plugins will get installed back in older versions of LS and if we know that it will be a problem and we do nothing it will cause more user pain.

I don't yet know of a smooth answer to this.

can we remove that and open an issue to find a proper solution to this problem?

colinsurprenant · 2018-07-26T18:33:23Z

@guyboertje did a first pass review. Great refactor overall 👍
Have we made any memory leaks testing to make sure reloading/refreshing dictionaries does not leak?

guyboertje · 2018-07-27T10:00:12Z

lib/logstash/filters/dictionary/memory.rb

+    def initialize(hash, exact, regex)
+      if exact
+        if regex
+          @fetch_strategy = FetchStrategy::Memory::ExactRegex.new(hash)


Will change this too if suggestion in file.rb is acceptable.

guyboertje · 2018-08-06T20:29:23Z

No memory leak testing done as yet.

colinsurprenant · 2018-08-07T16:58:28Z

@guyboertje it shouldn't be too hard to write dictionary reloading stress tests which should reveal quickly if there are any memory leaks. We could create a separate followup issue for this. I am focusing on this because this is a very typical place where such memory leaks happens.

colinsurprenant · 2018-08-07T17:06:31Z

lib/logstash/filters/dictionary/file.rb

@@ -33,32 +36,26 @@ def self.create(path, refresh_interval, refresh_behaviour, exact, regex)
    def initialize(path, refresh_interval, exact, regex)
      @dictionary_path = path
      @refresh_interval = refresh_interval
-      @short_refresh = @refresh_interval < 300.001
+      @short_refresh = @refresh_interval <= 300


I suggest we make explicit what the unit is for 300, is that seconds?

colinsurprenant · 2018-08-07T17:07:32Z

lib/logstash/filters/dictionary/file.rb

-        else
-          @fetch_strategy = FetchStrategy::File::Exact.new(@dictionary, rw_lock)
-        end
+        @fetch_strategy = regex ? FetchStrategy::File::ExactRegex.new(*args) : FetchStrategy::File::ExactRegex.new(*args)


colinsurprenant · 2018-08-07T17:41:13Z

lib/logstash/filters/array_of_maps_value_update.rb

+      @field = ensure_reference_format(field)
+      @destination = ensure_reference_format(destination)
+      @fallback = fallback
+      @use_fallback = !fallback.nil?


trick: can also be written as @use_fallback = !!fallback

I have used that trick before. In this case, fallback being false is legitimate.

ok! (but its a bit counter intuitive when reading that code that @use_fallback would be true when fallback is false)

In this case @use_fallback is true when the user supplied a fallback value in the config, i.e. it is not nil.
I'll add a comment.

colinsurprenant · 2018-08-13T20:01:38Z

LGTM

ccayg-sainsburys · 2018-08-15T14:00:41Z

https://github.com/logstash-plugins/logstash-filter-translate/pull/67/files#diff-bf70891427c2690568e84ec2c794d12dR4
(require "logstash/util/loggable")

Is, I believe, breaking the functionality of this plugin for us - auto tests which were working now broken with no related change and the following error when trying to start the logstash service:

Couldn't find any filter plugin named 'translate'. Are you sure this is correct? Trying to load the translate filter plugin resulted in this error: no such file to load -- logstash/util/loggable

I presume there is now a dependency that I need to load?
But I'm not clear what?

I see this is also now:
#69

Guy Boertje added 9 commits July 20, 2018 15:20

add class per dictionary type, add rufus scheduler, add update strate…

1b99cd5

…gies

benchmarking

4832ab7

take out STDERR.puts...

07f28e3

better expectation

f052fa8

add new, register, filter BM

6df5af4

tidy

8ec047e

rename foreach to iterate_on

ca661d0

last clean up

0c975d7

split out fetch_strategies

cc746ff

guyboertje added the needs reviewing label Jul 23, 2018

guyboertje mentioned this pull request Jul 23, 2018

Enhance multi-field lookup enrichment #44

Open

Guy Boertje added 2 commits July 23, 2018 23:14

Hmm, JRuby 1.7.2X file modification times are not so fine grained.

aac6ccc

improvements coming from benchmarking

b3b5ed0

guyboertje self-assigned this Jul 25, 2018

guyboertje requested a review from colinsurprenant July 25, 2018 16:24

colinsurprenant reviewed Jul 26, 2018

View reviewed changes

guyboertje commented Jul 27, 2018

View reviewed changes

jsvd removed the needs reviewing label Jul 30, 2018

updates from review.

dfb8471

colinsurprenant reviewed Aug 7, 2018

View reviewed changes

remove rake pin for JRuby 1.7.27

18c177e

colinsurprenant approved these changes Aug 13, 2018

View reviewed changes

clarify use_fallback

e11aa9a

guyboertje mentioned this pull request Aug 14, 2018

Translate for multi values fields #19

Closed

guyboertje merged commit 675e130 into logstash-plugins:master Aug 14, 2018

guyboertje deleted the fix/65-66 branch August 14, 2018 07:10

ccayg-sainsburys mentioned this pull request Aug 15, 2018

LS 2.4.1 - can't use translate filter after v3.2.0 #69

Open

guyboertje mentioned this pull request Aug 24, 2018

Add feature to identify object to translate by tag, or add wildcard/regex suppport for field value #26

Closed

Update code, use Rufus scheduler, load dictionary better, add "iterate_on" #67

Update code, use Rufus scheduler, load dictionary better, add "iterate_on" #67

Conversation

guyboertje commented Jul 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colinsurprenant commented Jul 26, 2018 • edited Loading

Choose a reason for hiding this comment

guyboertje commented Aug 6, 2018

colinsurprenant commented Aug 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colinsurprenant commented Aug 13, 2018

ccayg-sainsburys commented Aug 15, 2018 • edited Loading

guyboertje commented Jul 23, 2018 •

edited

Loading

colinsurprenant commented Jul 26, 2018 •

edited

Loading

ccayg-sainsburys commented Aug 15, 2018 •

edited

Loading