Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to just read from files (without continuously watching them) #48

Closed
ppf2 opened this issue May 28, 2015 · 47 comments
Closed

Option to just read from files (without continuously watching them) #48

ppf2 opened this issue May 28, 2015 · 47 comments
Assignees

Comments

@ppf2
Copy link
Member

ppf2 commented May 28, 2015

When using file input against files that are static (will not be appended to anymore), it will be nice to provide an option to exit the LS pipeline once the files have been read (instead of keeping the process running and keep watching for new streams). This will allow the end user to schedule periodic runs to read from files and exit when done.

@jordansissel
Copy link
Contributor

For this it may make sense to have a new input. I wonder if it's too confusing to provide "follow files forever" behavior in the same plugin, and near the same settings, as "read files once, from the beginning" behavior.

@ppf2
Copy link
Member Author

ppf2 commented May 28, 2015

👍

@jordansissel
Copy link
Contributor

Also, a confession from an implementor's perspective, the code to "Find some files, read them until EOF, then be done" is way simpler than the file input is today. :P

@MarkusMayer
Copy link

+1 for this feature.
We are importing many (approx 1 Mio/day) static files to ELK (on Windows). The files are written once and never appended. I assume that sincedb/watch is causing a lot of overhead unecessary for such a use case (sincedb dev/nul does not work on windows).

@jordansissel
Copy link
Contributor

I am +1 on this feature. Probably a new plugin for simplicity of interface. I'm not sure what we should call it?

Any ideas? I'm open to making the file input do this as well, but need convincing since this new feature won't need sincedb, or stuff like it that the file input uses today.

@guyboertje
Copy link
Contributor

I am keen to create a simpler plugin called, maybe, logstash-input-read_file. It should reuse most of the current file input but "know" that it is not tailing.

@yehosef
Copy link

yehosef commented Jan 25, 2016

It's not clear why this needs to be a different plugin. While it's definitely easier than then read-watch case, once you've solved that case it seems adding this logic is not complicated. Somewhere in the code of the current file plugin it's going to say "if EOF wait() and try_again()" - you can just have the a config called eof_action which can do what you want (delete/mv are the basic use cases - you could start there - maybe add run_script to cover everything else).

The advantage of this approach is that it's simpler for me to know that I have one plugin that can do different things - than to keep track of how the file its going to be used and use different plugins to do almost the same thing. It also seems that you'll have to duplicate a lot of the file logic in two plugins instead of providing a pretty trivial hook into the current plugin.

I haven't looked through the code - java/ruby isn't my strong point - but it seems like we shouldn't have to go here.

@guyboertje
Copy link
Contributor

What would your guess be to the percentages of use case for Tail only, Read only and mixed Tail some Read some? My guess is that it may be as much as 49.5, 49.5, 1.

I did start a branch with a mode config of tail, read, but I stopped work on this because most of the heavy lifting in this input is done in the filewatch library. It is heavily optimised for the tail use case.
When we open a file to read it, very occasionally if the timing is perfect we see an empty file - we get an eof then. Obviously this depends on how the user puts the files in the watched folder.
In the read case I would like to support zipped files and breadth or depth first operations. Breadth first means reading 32K from each file in turn until 'done'.

In the tail case, assume for a moment that there are 5000 files detected from the glob and they have all just been rotated and so are empty. The current tail implementation will open all 5000, see that they are empty and loop, we do not open-read-close now. In time, as content is appended we will loop through each file reading what we can and there is no logical eof. We can never consider it 'done'.

In the read case, assume the same number 5000 of content complete files but they are > 800K big. If we reuse the same filewatch lib, we will open each one in turn and begin reading to eof. It will be many minutes before we get to the 5000th file. Further, because filewatch is tailing, we will monitor those 'done' files for changes every stat_interval and the 5000 files will stay open. This is why we are introducing ignore_older and close_older configs and the auto_flush config on the multiline codec but actually they are hacks to help support the read use case. If the user has, say 20 file inputs operating on different folders, when all inputs are opening all of their files, one can run out of file_handles and then many things start going wrong, e.g. the sincedb can't be saved and filters/outputs that use files or sockets begin to fail. I suspect that some of the 'can't recreate' weirdness issues may be related to this.

A specialised readfile input using a read optimised filewatch can do things much more efficiently. We have no need for the stat_interval or start_position. We can have a scan_mode for depth or breath first. We know we only need to stat, open, looped read and close and when 'done' we can 'tidy up' - flush buffered multilines and 'signal' that we are done. We can operate over a smaller bunch of files, say 256, at a time and therefore support reading many many thousands of files in multiple readfile inputs with a low file_handle impact and lower memory usage.

IMO, we would not want a synchronous eof_action that runs a script or operates on a file - we just need to signal to the user that we are done with a file then the user can action this out-of-band - presuming that we can know what 'done' means.

@yehosef
Copy link

yehosef commented Jan 25, 2016

@guyboertje
Thanks for the explanation - I hear some of the challenges given a combined implementation. Here's my thoughts.

For files that are being added to or rotated, things should work as they do now - as messy as that is. These files won't have an eof_action and will work as now.

There will be other file globs with config that will specify an eof_action. Then will read until the eof and then do whatever the eof says to do. Here, the main problem (I see) is that if I want to be able to add complete files into directory it will take time before it's all copied and I might hit the eof but it's really still copying .Here the stat_interval might be helpful - you might hit the eof, but you would wait until the next stat interval (or two?) to make sure it didn't change. You could also handle this by just changing the filename when finished - file.json.tmp -> file.json - that's basically instant and avoids the problems.. not sure how big of an issue this is.

I think it is important to manage the case where logstash dies or is killed - so I assume you do need a file/position status also - I think it's important that's not overlooked.

It seems that some of the problems related to high number of files will automatically be taken care of - eg, you probably won't have to handle 5000 files because they'll be processed out when finished. You could even support an empty handler to just say - "ignore additional changes" - don't watch - and let other outside scripts handle it.

About the script/action - I agree it shouldn't be blocking - if you don't need to output (which I assume you don't) - you can just fork the process and let it do its thing. But there are many options on how the handle this and what to support - I was just suggesting something easy and flexible - but I am sure there are other/better options.

But all this is really just implementation details - If in the end it is actually better, stronger, faster and able to process large log files in a single bound.. :) and it happens sooner than later - I'm all for it.

@jordansissel
Copy link
Contributor

Somewhere in the code of the current file plugin it's going to say "if EOF wait() and try_again()"

I wish it were this simple, but it's not clear how everyone will be doing these batch file deliveries. For example, if Logstash is running always and watching for new files to process, and a delivery starts of a single file that will eventually be 500GB, Logstash may start seeing that file as new before the first few hundred megs are written, and it is likely that Logstash catches up to the writer, say, by 500MB (in this fictional example), and gets EOF. Now with this approach of EOF meaning "I hit the end of the file" we will miss the remaining 499.5GB of data!

Further, our incentive to make a new plugin is partly because the use cases of files-as-a-stream and files-read-once have pretty different configuration needs. Streams need logrotation detection, file rename handling, etc. Read-once needs EOF actions and won't deal with renames or log rotation -- so if we combined these behaviors into a single plugin, you'd have a bunch of settings that are invalid for entire use cases. For example, "delete the file at EOF" wouldn't be something useful for files-as-streams, since there's no "end" to such streams.

I'm open to having it be one plugin, but not if there are so many mutually-exclusive configuration settings.

@yehosef
Copy link

yehosef commented Jan 26, 2016

The case you mention where the logstash processing overruns the filesystem copy, I also mentioned. As I said, you can could handle it in logstash by wait for changes after an eof to see if more changes come. If after the stat_interval (or two or three) there are no more changes - that's considered an EOF and whatever is supposed to happen, happens.

I'm curious how you're going to handle this case with the proposed filereader plugin - why will it be simpler - won't you also need to poll and try again?

I looked through the docs on the file plugin and the only option that doesn't make sense for the batch use case is the start_position - you obviously want to start at the beginning. As I mentioned, I think you will still need the sincedb because you need to know where you're up to incase logstash dies. Or even if it doesn't, if the file size is changing because it's being copied in you need to know where you're up to. Which other options are not applicable for the batch process?

@guyboertje
Copy link
Contributor

@jordansissel, @yehosef - a tie-breaker argument for a different plugin is, being in a new GH repo, the issues and PRs will be segregated. We will know that an issue logged pertains to the read use case only. Testing and manual fix verification only needs to consider the read use case. We will not have to be concerned about whether a change for one affects the performance of the other.

@vmorris
Copy link

vmorris commented Feb 23, 2016

+1 for this feature, just ran headlong into this.

@guyboertje
Copy link
Contributor

@vmorris: I am working on a readfile input that will have many new features:

  • prioritise the order in which files are processed
  • better detection of files already seen
  • signal when done with a file
  • striped read in blocks across files (breadth first) option

@suyograo suyograo added the P3 label Apr 26, 2016
@thenom
Copy link

thenom commented Jul 25, 2016

H, this is something i would be interested in. If we want to just process a file for testing purposes then we would like the logstash process to end when it is finished reading. Is this something that this plugin could do?

@guyboertje
Copy link
Contributor

We intend to have a signal mechanism that could be listened to and cause a LS shutdown. The input plugin itself would not end the LS process.

@MarkusMayer
Copy link

Hi,
do you have any rough ETA on the readfile input?
Thanks in advance,
Markus

@guyboertje
Copy link
Contributor

@MarkusMayer - unfortunately no. I have a very big PR to filewatch pending that affects the current file input and the future readfile input. When this PR is merged I can proceed.

@GabrielUlici
Copy link

Any news about the new plugin?

@ameade
Copy link

ameade commented Nov 22, 2016

I need this as I have files that do not end with a new line, so logstash never interprets the last line. Can't find a workaround after a few hours.

@mrkwtz
Copy link

mrkwtz commented Nov 23, 2016

Great to see that this is a features which is going to be developed. It would be perfect when i could say something like:

if you got an event with a specific flag, then close the filestream to this file and flag it as "done".

@jsarmento
Copy link

Nice to know about this plugin. I'm having a huge headache that this new plugin could "cure" :)
Any idea on when it's going to be available? 👍

@vicvega
Copy link

vicvega commented Mar 6, 2017

any news?

@guyboertje
Copy link
Contributor

@guyboertje
Copy link
Contributor

Closing

@PhaedrusTheGreek
Copy link

Filebeat close_eof closes the file, but doesn't exit. I think there is still a need for something in the stack that can process a bunch of files then exit when done.

@domak
Copy link

domak commented May 31, 2017

To parse log files and exit I use stdin input plugin:
input { stdin { codec => plain { charset => "ISO-8859-1" } } }
And
logstash -f conf_file.conf < log_file.log

@gmoskovicz
Copy link
Contributor

@guyboertje reopening this issue. close-eof doesn't work the way we think, because unless you have a "next line" character the last line (or the single line of the file) will never be picked up. See elastic/beats#3852.

@gmoskovicz gmoskovicz reopened this Mar 23, 2018
@guyboertje
Copy link
Contributor

@gmoskovicz

Ask and you shall receive 😃
See https://github.com/logstash-plugins/logstash-input-file/pull/171/files#diff-03902eccff1b447715ec7622656f3965R52

In read mode, because we know that the file is finished with we don't need the last newline - we just use what we have buffered since the last new line and create the final event.

@gmoskovicz
Copy link
Contributor

💥 💣 ! Amazing @guyboertje !

So this means that after the #171 PR is merged, we'll be able to use that file input version in Logstash to read events, correct? So this is pending for #171 merge. This means that Logstash is getting the expected behavior.

@guyboertje
Copy link
Contributor

guyboertje commented Mar 23, 2018

read mode, gunzip and file completed actions [delete it|log it] too.

@guyboertje
Copy link
Contributor

It does not do the read all discovered file and terminate LS yet though.

@idarlund
Copy link

will this become a seperate plugin or will it be an option to the input file plugin?

@guyboertje
Copy link
Contributor

@idarlund Same plugin, new setting mode can be "tail" or "read"

@MauJFernandezUVR
Copy link

MauJFernandezUVR commented Apr 27, 2018

Maybe this is what I was looking for quite some time, I have static log files (a couple of sources) which are rotated each day and put in a shared storage for me to process, in LS as it's today I have no way of loading last day logs and stopping it after it finishes reading it, so I think this will come in handy.

I see that #171 was already merged, any ETA on this one?

@idarlund
Copy link

It's been released in the 4.1.0 @MauJFernandezUVR https://github.com/logstash-plugins/logstash-input-file/releases/tag/v4.1.0
Check changelog here; https://github.com/logstash-plugins/logstash-input-file/blob/master/CHANGELOG.md#410

Many, many thanks @guyboertje!

@sylvainlaurent
Copy link

Is there another issue to have LS terminate after completion of files?

@idarlund
Copy link

idarlund commented Jun 25, 2018

Updated to logstash 6.3 which was release almost 10 days ago;
/usr/share/logstash/bin/logstash --version
logstash 6.3.0

But it still has the old logstash-input-file;
/usr/share/logstash/bin/logstash-plugin list --verbose|grep logstash-input-file
logstash-input-file (4.0.5)

When will 4.1.x be released with the upstream logstash package?

@guyboertje
Copy link
Contributor

@idarlund
v4.1.X has some edge case issues. I would greatly appreciate some feedback on the new version.
Please do a logstash-plugin update logstash-input-file to install.

@idarlund
Copy link

what do you want me/us to test @guyboertje ? is there any particular issues filed here on github you want me/us to look at?

@guyboertje
Copy link
Contributor

@idarlund
From your first comment I take it that you are interested in read mode. Please hammer read mode as much as you can and report back.

@idarlund
Copy link

feeded in 2988 files containing 285m documents successfully with 4.1.5, using the following config:
input {
file {
path => "/path/fo/files/**/."
sincedb_path => "/dev/null"
mode => "read"
file_completed_action => "log"
file_completed_log_path => "/root/logstash/new-inputfile-test.log"
}
}

@idarlund
Copy link

seems like this has been released to upstream now https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html

thanks a lot for your work on this feature @guyboertje!

@guyboertje
Copy link
Contributor

Read mode has been out for a while now. Closing.

@hensansi
Copy link

hensansi commented Dec 21, 2018

I'm using read mode on 4.1.9 and I'm wondering, to stop the logstash job I need to issue a signal correct? I'm reading three files and as soon as I have 3 logs or a log with 3 lines, issue a termination signal or am I doing something wrong?

I'm expecting that logstash will terminate after reading the files like logstash terminates after the jdbc-plugin is done.

edit: I see that there is an issue about it,
#212

logstash 6.2.4
logstash-input-file 4.1.9

input {
  file {
    type => "file"
    path => "/usr/share/logstash/data/*.json"
    codec => json
    mode => read
    file_completed_action => log
    file_completed_log_path => "/usr/share/logstash/logs"
    sincedb_path => "/dev/null"
  }
}

I see that there is already an issue about it:
https://github.com/logstash-plugins/logstash-input-file/issues/212
Ok, I just saw the,
https://github.com/logstash-plugins/logstash-input-file/issues/212

output {
  elasticsearch {
    "hosts" => ["${ELASTICSEARCH_URL}"]
    "action" => "index"
    "index" => "anIndex"
    "template" => "/usr/share/logstash/pipelines/mapping.json"
    "template_name" => "anIndex"
    "template_overwrite" => true
    "document_id" => "%{key}"
    "document_type" => "_doc"
  }
}

@ppf2
Copy link
Member Author

ppf2 commented Mar 11, 2019

To clarify: Even with the read mode added later to the file input, Logstash does not currently exit once it has reached the EOF when using read mode. See outstanding issue.

@Bonn93
Copy link

Bonn93 commented Mar 18, 2019

Just run into this now. I'm building an application that's sending everything from some folders to ES and failing because there's one file with 1 line. Building a dnyamic app on Kubernetes, I can't accept a "just use filebeats" response.

Whilst I agree that no "sane" log should work this way, people use ES for MANY things...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests