-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to just read from files (without continuously watching them) #48
Comments
For this it may make sense to have a new input. I wonder if it's too confusing to provide "follow files forever" behavior in the same plugin, and near the same settings, as "read files once, from the beginning" behavior. |
👍 |
Also, a confession from an implementor's perspective, the code to "Find some files, read them until EOF, then be done" is way simpler than the file input is today. :P |
+1 for this feature. |
I am +1 on this feature. Probably a new plugin for simplicity of interface. I'm not sure what we should call it? Any ideas? I'm open to making the |
I am keen to create a simpler plugin called, maybe, logstash-input-read_file. It should reuse most of the current file input but "know" that it is not tailing. |
It's not clear why this needs to be a different plugin. While it's definitely easier than then read-watch case, once you've solved that case it seems adding this logic is not complicated. Somewhere in the code of the current file plugin it's going to say "if EOF wait() and try_again()" - you can just have the a config called eof_action which can do what you want (delete/mv are the basic use cases - you could start there - maybe add run_script to cover everything else). The advantage of this approach is that it's simpler for me to know that I have one plugin that can do different things - than to keep track of how the file its going to be used and use different plugins to do almost the same thing. It also seems that you'll have to duplicate a lot of the file logic in two plugins instead of providing a pretty trivial hook into the current plugin. I haven't looked through the code - java/ruby isn't my strong point - but it seems like we shouldn't have to go here. |
What would your guess be to the percentages of use case for Tail only, Read only and mixed Tail some Read some? My guess is that it may be as much as 49.5, 49.5, 1. I did start a branch with a In the tail case, assume for a moment that there are 5000 files detected from the glob and they have all just been rotated and so are empty. The current tail implementation will open all 5000, see that they are empty and loop, we do not open-read-close now. In time, as content is appended we will loop through each file reading what we can and there is no logical eof. We can never consider it 'done'. In the read case, assume the same number 5000 of content complete files but they are > 800K big. If we reuse the same filewatch lib, we will open each one in turn and begin reading to eof. It will be many minutes before we get to the 5000th file. Further, because filewatch is tailing, we will monitor those 'done' files for changes every A specialised IMO, we would not want a synchronous |
@guyboertje For files that are being added to or rotated, things should work as they do now - as messy as that is. These files won't have an There will be other file globs with config that will specify an I think it is important to manage the case where logstash dies or is killed - so I assume you do need a file/position status also - I think it's important that's not overlooked. It seems that some of the problems related to high number of files will automatically be taken care of - eg, you probably won't have to handle 5000 files because they'll be processed out when finished. You could even support an empty handler to just say - "ignore additional changes" - don't watch - and let other outside scripts handle it. About the script/action - I agree it shouldn't be blocking - if you don't need to output (which I assume you don't) - you can just fork the process and let it do its thing. But there are many options on how the handle this and what to support - I was just suggesting something easy and flexible - but I am sure there are other/better options. But all this is really just implementation details - If in the end it is actually better, stronger, faster and able to process large log files in a single bound.. :) and it happens sooner than later - I'm all for it. |
I wish it were this simple, but it's not clear how everyone will be doing these batch file deliveries. For example, if Logstash is running always and watching for new files to process, and a delivery starts of a single file that will eventually be 500GB, Logstash may start seeing that file as new before the first few hundred megs are written, and it is likely that Logstash catches up to the writer, say, by 500MB (in this fictional example), and gets EOF. Now with this approach of EOF meaning "I hit the end of the file" we will miss the remaining 499.5GB of data! Further, our incentive to make a new plugin is partly because the use cases of files-as-a-stream and files-read-once have pretty different configuration needs. Streams need logrotation detection, file rename handling, etc. Read-once needs EOF actions and won't deal with renames or log rotation -- so if we combined these behaviors into a single plugin, you'd have a bunch of settings that are invalid for entire use cases. For example, "delete the file at EOF" wouldn't be something useful for files-as-streams, since there's no "end" to such streams. I'm open to having it be one plugin, but not if there are so many mutually-exclusive configuration settings. |
The case you mention where the logstash processing overruns the filesystem copy, I also mentioned. As I said, you can could handle it in logstash by wait for changes after an eof to see if more changes come. If after the stat_interval (or two or three) there are no more changes - that's considered an EOF and whatever is supposed to happen, happens. I'm curious how you're going to handle this case with the proposed filereader plugin - why will it be simpler - won't you also need to poll and try again? I looked through the docs on the file plugin and the only option that doesn't make sense for the batch use case is the |
@jordansissel, @yehosef - a tie-breaker argument for a different plugin is, being in a new GH repo, the issues and PRs will be segregated. We will know that an issue logged pertains to the read use case only. Testing and manual fix verification only needs to consider the read use case. We will not have to be concerned about whether a change for one affects the performance of the other. |
+1 for this feature, just ran headlong into this. |
@vmorris: I am working on a readfile input that will have many new features:
|
H, this is something i would be interested in. If we want to just process a file for testing purposes then we would like the logstash process to end when it is finished reading. Is this something that this plugin could do? |
We intend to have a signal mechanism that could be listened to and cause a LS shutdown. The input plugin itself would not end the LS process. |
Hi, |
@MarkusMayer - unfortunately no. I have a very big PR to filewatch pending that affects the current file input and the future readfile input. When this PR is merged I can proceed. |
Any news about the new plugin? |
I need this as I have files that do not end with a new line, so logstash never interprets the last line. Can't find a workaround after a few hours. |
Great to see that this is a features which is going to be developed. It would be perfect when i could say something like: if you got an event with a specific flag, then close the filestream to this file and flag it as "done". |
Nice to know about this plugin. I'm having a huge headache that this new plugin could "cure" :) |
any news? |
You should use filebeat for this. |
Closing |
Filebeat |
To parse log files and exit I use stdin input plugin: |
@guyboertje reopening this issue. |
Ask and you shall receive 😃 In read mode, because we know that the file is finished with we don't need the last newline - we just use what we have buffered since the last new line and create the final event. |
💥 💣 ! Amazing @guyboertje ! So this means that after the #171 PR is merged, we'll be able to use that file input version in Logstash to read events, correct? So this is pending for #171 merge. This means that Logstash is getting the expected behavior. |
read mode, gunzip and file completed actions [delete it|log it] too. |
It does not do the read all discovered file and terminate LS yet though. |
will this become a seperate plugin or will it be an option to the input file plugin? |
@idarlund Same plugin, new setting |
Maybe this is what I was looking for quite some time, I have static log files (a couple of sources) which are rotated each day and put in a shared storage for me to process, in LS as it's today I have no way of loading last day logs and stopping it after it finishes reading it, so I think this will come in handy. I see that #171 was already merged, any ETA on this one? |
It's been released in the 4.1.0 @MauJFernandezUVR https://github.com/logstash-plugins/logstash-input-file/releases/tag/v4.1.0 Many, many thanks @guyboertje! |
Is there another issue to have LS terminate after completion of files? |
Updated to logstash 6.3 which was release almost 10 days ago; But it still has the old logstash-input-file; When will 4.1.x be released with the upstream logstash package? |
@idarlund |
what do you want me/us to test @guyboertje ? is there any particular issues filed here on github you want me/us to look at? |
@idarlund |
feeded in 2988 files containing 285m documents successfully with 4.1.5, using the following config: |
seems like this has been released to upstream now https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html thanks a lot for your work on this feature @guyboertje! |
Read mode has been out for a while now. Closing. |
I'm using read mode on 4.1.9 and I'm wondering, to stop the logstash job I need to issue a signal correct? I'm reading three files and as soon as I have 3 logs or a log with 3 lines, issue a termination signal or am I doing something wrong? I'm expecting that logstash will terminate after reading the files like logstash terminates after the jdbc-plugin is done. edit: I see that there is an issue about it, logstash 6.2.4
|
To clarify: Even with the read mode added later to the file input, Logstash does not currently exit once it has reached the EOF when using read mode. See outstanding issue. |
Just run into this now. I'm building an application that's sending everything from some folders to ES and failing because there's one file with 1 line. Building a dnyamic app on Kubernetes, I can't accept a "just use filebeats" response. Whilst I agree that no "sane" log should work this way, people use ES for MANY things... |
When using file input against files that are static (will not be appended to anymore), it will be nice to provide an option to exit the LS pipeline once the files have been read (instead of keeping the process running and keep watching for new streams). This will allow the end user to schedule periodic runs to read from files and exit when done.
The text was updated successfully, but these errors were encountered: