Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat Fails After Power Failure #3537

Closed
andynz2017 opened this issue Feb 6, 2017 · 25 comments
Closed

Filebeat Fails After Power Failure #3537

andynz2017 opened this issue Feb 6, 2017 · 25 comments
Labels
Filebeat Filebeat

Comments

@andynz2017
Copy link

andynz2017 commented Feb 6, 2017

Issue: Filebeat running on CentOS Linux release 7.3.1611 (Core), filebeat 5.1. Server suffers power failure due to host failing. Ungraceful shutdown, VM running off remote QNAP via ISCSI. VM recovers fine but filebeat doesn't start.

Host OS: Windows Server 2016 Datacenter Evaluation.
HyperVisor: Hyper-V 10.0.14393.0.
VM OS Affected: CentOS Linux release 7.3.1611 (Core), x64 Server Edition.
Filebeat: Version 5.1.1
Filebeat Log: Default logging verbosity.

File before failure:

  • input_type: log
    • /var/log/audit/audit.log

This is the only file that was in use. /var/log/filebeat/* yields just the standard EOF error, no details around time stamp or what action the filebeat agent was taking at the time of power failure. Is there an alternate location that might tell me this information?

2017-02-05T19:21:25+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T19:21:25+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
2017-02-05T19:21:25+13:00 INFO Max Retries set to: 3
2017-02-05T19:21:25+13:00 INFO Activated logstash as output plugin.
2017-02-05T19:21:25+13:00 INFO Publisher name: mon.testdomain.com
2017-02-05T19:21:25+13:00 INFO Flush Interval set to: 1s
2017-02-05T19:21:25+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T19:21:25+13:00 INFO filebeat start running.
2017-02-05T19:21:25+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T19:21:25+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T19:21:25+13:00 INFO Total non-zero values:
2017-02-05T19:21:25+13:00 INFO Uptime: 9.9703ms
2017-02-05T19:21:25+13:00 INFO filebeat stopped.
2017-02-05T19:21:25+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF

*** This issue has been moved to Github from https://discuss.elastic.co/t/filebeat-fails-after-power-failure/73967 ***

Happy to provide any additional detail, please just provide steps to acquire it if its not in obvious log files.

@andrewkroh
Copy link
Member

Is there another log file, like /var/log/filebeat/filebeat.1, containing data from before the ungraceful shutdown? It may not contain anything useful since the logs may not have been synced to disk.

Filebeat writes registry data to temporary file, calls fsync, then uses a move to achieve something that's close to an atomic update. So we are a bit surprised to see a zero byte file since fsync is supposed to block until the bytes are written to disk.

@andynz2017
Copy link
Author

andynz2017 commented Feb 6, 2017 via email

@andrewkroh andrewkroh added the Filebeat Filebeat label Feb 7, 2017
@a1exus
Copy link

a1exus commented Feb 8, 2017

I'm experiencing (or rather experienced) similar issue, after updating filebeat to latest (5.0.2) version, filebeat service wouldn't start anymore, however after removing /var/lib/filebeat/registry filebeat service starts without any issues)

ref: Exiting: Could not start registrar: Error loading state: Error decoding states: EOF - Beats / Filebeat - Discuss the Elastic Stack

@BenjaFriend
Copy link

I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs.

@andynz2017
Copy link
Author

andynz2017 commented Feb 9, 2017 via email

@ruflin
Copy link
Contributor

ruflin commented Feb 10, 2017

Since version 5.0 filebeat does some checks of the registry file if it is in valid state and with 5.1 additional checks were introduced. The problem before was, that above issues were not even detected. Now if there is an invalid registry file, filebeat will not start. If we just reset the registry file this would lead to duplicated events.

@andynz2017 About the log files you posted. It seems you tried to start filebeat several times which lead to the issue that each time a new log file was written, and the one we are interested in does not exist anymore.

So far it seems, all people reporting an issue on restart and the registry file are using VM Virtual Machines. Could you guys share what file systems you are using? Did anyone find a "reliable" way to reproduce this?

@andynz2017
Copy link
Author

andynz2017 commented Feb 11, 2017 via email

@andynz2017
Copy link
Author

andynz2017 commented Feb 13, 2017 via email

@ruflin
Copy link
Contributor

ruflin commented Feb 14, 2017

@andynz2017 Thanks a lot for digging deeper into this one. I went through all the logs above, unfortunately there isn't something inside that would even indicate that filebeat got stopped. But this is kind of not expected, as the machine is not shut down properly.

Here my notes I made when going through the issue:

  • For testing the file output could be used to eliminate the dependency on ES setup.
  • Only the local file system should matter, where the registry file is stored.
  • As far as I understand, this file system is XFS?
  • It is very strange that it there is a difference between CentOS and others. Do these systems have different file systems?

Could it be that something "strange" happens in case a process has an open file handler of a file when the OS crashes?

@avleen
Copy link

avleen commented Feb 23, 2017

Hi folks,

On a hunch, I'm going to suggest the problem is a lack of fsync in here:

// writeRegistry writes the new json registry file to disk.
func (r *Registrar) writeRegistry() error {
logp.Debug("registrar", "Write registry file: %s", r.registryFile)
tempfile := r.registryFile + ".new"
f, err := os.OpenFile(tempfile, os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0600)
if err != nil {
logp.Err("Failed to create tempfile (%s) for writing: %s", tempfile, err)
return err
}
// First clean up states
states := r.states.GetStates()
encoder := json.NewEncoder(f)
err = encoder.Encode(states)
if err != nil {
f.Close()
logp.Err("Error when encoding the states: %s", err)
return err
}
// Directly close file because of windows
f.Close()
err = file.SafeFileRotate(r.registryFile, tempfile)
logp.Debug("registrar", "Registry file updated. %d states written.", len(states))
registryWrites.Add(1)
statesCurrent.Set(int64(len(states)))
return err
}

Event timeline:

  1. New registry file is opened.
  2. File is written, but not synced. This means the file on disk is zero bytes, and the changes are cached are dirty pages in memory waiting for the next file system sync. On Linux (at least CentOS) this happens every 30s.
  3. The new registry file is swapped in for the real one. Remember: it's still zero bytes on disk, so now the only legit knowledge is the state is cached in memory, still waiting for the dirty pages to flush to disk.
  4. Server crashes, changes are lost. File on disk stays zero bytes.

The fix is probably to open the file with O_SYNC on L301.

@leinad13
Copy link

leinad13 commented Feb 24, 2017

Apologies I havn't had time to read the whole thread, but I think I have experienced this issue as well. I got an ELK dev stack up and running on my laptop, installed filebeat and had it working yesterday. Then when i booted this morning, checked Kibana - no logs since last night. My laptop didn't have a un-graceful shutdown so not sure what caused the issue.

`2017-02-24T10:10:27Z INFO filebeat start running.
2017-02-24T10:10:27Z INFO Registry file set to: C:\ProgramData\filebeat\registry
2017-02-24T10:10:27Z INFO Loading registrar data from C:\ProgramData\filebeat\registry
2017-02-24T10:10:27Z ERR Error decoding old state: invalid character '\x00' looking for beginning of value
2017-02-24T10:10:27Z INFO Total non-zero values:
2017-02-24T10:10:27Z INFO Uptime: 35.9984ms
2017-02-24T10:10:27Z INFO filebeat stopped.
2017-02-24T10:10:27Z CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character '\x00' looking for beginning of value

EDIT : Sorry was in a bit of rush earlier - i'm running windows 10, build 14393.726 - JDK v8 u121.
`

@ruflin
Copy link
Contributor

ruflin commented Feb 24, 2017

@leinad13 That seems to be a different issue. Can you open a topic on discuss? https://discuss.elastic.co/c/beats/filebeat

@ruflin
Copy link
Contributor

ruflin commented Feb 24, 2017

@avleen Great find. Based on your input I opened #3668 so we can directly discuss further based on code.

@andynz2017
Copy link
Author

andynz2017 commented Feb 24, 2017 via email

@ruflin
Copy link
Contributor

ruflin commented Feb 27, 2017

@avleen PR just go merged. I really appreciate the work you put into this one to find the issue and point it out.

@avleen
Copy link

avleen commented Feb 27, 2017

Thanks @ruflin !

ruflin added a commit to ruflin/beats that referenced this issue Feb 28, 2017
tsg pushed a commit that referenced this issue Feb 28, 2017
@cosminpopescu
Copy link

I was having the same error as @leinad13 , after a BSOD, namely "2017-09-07T10:51:51+03:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character '\x00' looking for beginning of value"

I am using filebeat 5.5.2.

The fix was to delete the files in "C:\ProgramData\filebeat" if it helps someone reading this thread.

@JoshSchreuder
Copy link

JoshSchreuder commented Nov 21, 2017

I was having the same error as @leinad13 , after a BSOD, namely "2017-09-07T10:51:51+03:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character '\x00' looking for beginning of value"

We had the same issue, registry file was full of NUL characters.
I deleted registry and renamed registry.old to registry and things seem to be working now.

This was on 5.2.2

@trondhindenes
Copy link

I just had the same issue after a host failure, filebeat v 5.6.3

@ruflin
Copy link
Contributor

ruflin commented Dec 19, 2017

@cosminpopescu @JoshSchreuder @trondhindenes There must be something more there as the above fix is in all 5.x branches. Any chance you guys could provide some more informations like

  • OS
  • File system used
  • Filebeat configs
  • Potentially content of registry file (if not empty)

@Vladimir-csp
Copy link

I am still able to reproduce it with filebeat 6.1.1 and 6.2.1 on Linux and FreeBSD.

cat /dev/null > /var/db/beats/filebeat/data/registry
service filebeat start

Start fails with empty registry

beat.go:635: Exiting: Could not start registrar: Error loading state: Error decoding states: EOF

@avleen
Copy link

avleen commented Feb 28, 2018

I suggest not catting /dev/null into your registry file ;-)

@polmolea
Copy link

polmolea commented Mar 5, 2018

@ruflin I've also had the same issue with filebeat not starting after a hard shutdown on multiple systems. Here are some details:

  • Filebeat 6.2.2 installed via chocolatey on Windows Server 2016 on AWS EC2
  • Filesystem is NTFS
  • Server is stopped and started on a schedule, using the AWS API (StopInstance, StartInstance)
  • When system comes back up, the filebeat service is Stopped and the log shows the following:
2018-03-05T06:30:05.118Z    INFO    instance/beat.go:468    Home path: [C:\ProgramData\chocolatey\lib\filebeat\tools\filebeat-6.2.2-windows-x86_64] Config path: [C:\ProgramData\chocolatey\lib\filebeat\tools\filebeat-6.2.2-windows-x86_64] Data path: [C:\\ProgramData\\filebeat] Logs path: [C:\\ProgramData\\filebeat\logs]
2018-03-05T06:30:05.161Z    INFO    instance/beat.go:475    Beat UUID: 36b5d0e5-691c-4343-bb3b-b64ca4ac2ad2
2018-03-05T06:30:05.161Z    INFO    instance/beat.go:213    Setup Beat: filebeat; Version: 6.2.2
2018-03-05T06:30:05.162Z    INFO    pipeline/module.go:76   Beat name: STAGE-HOSTNAME
2018-03-05T06:30:05.391Z    INFO    instance/beat.go:301    filebeat start running.
2018-03-05T06:30:05.392Z    INFO    [monitoring]    log/log.go:97   Starting metrics logging every 30s
2018-03-05T06:30:05.404Z    INFO    registrar/registrar.go:108  Loading registrar data from C:\ProgramData\filebeat\registry
2018-03-05T06:30:05.404Z    INFO    instance/beat.go:308    filebeat stopped.
2018-03-05T06:30:21.991Z    INFO    [monitoring]    log/log.go:132  Total non-zero metrics  {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":78,"time":78},"total":{"ticks":78,"time":93,"value":0},"user":{"ticks":0,"time":15}},"info":{"ephemeral_id":"5772a22c-bf28-4ae4-8e54-97ee4b5f703d","uptime":{"ms":16483}},"memstats":{"gc_next":4473924,"memory_alloc":2891360,"memory_total":2891360,"rss":19214336}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"output":{"type":"logstash"},"pipeline":{"clients":0,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"cpu":{"cores":2}}}}}
2018-03-05T06:30:21.991Z    INFO    [monitoring]    log/log.go:133  Uptime: 16.4832551s
2018-03-05T06:30:21.991Z    INFO    [monitoring]    log/log.go:110  Stopping metrics logging.
2018-03-05T06:30:21.994Z    ERROR   instance/beat.go:667    Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character '\x00' looking for beginning of value
  • Deleting the registrar and restarting the service solves the problem, but it's not a nice solution from an automation point of view
  • Filebeat config is the following:
filebeat.prospectors:
- input_type: log
  paths:
  - C:\AppLogs\Service1\Service1.log
  multiline.pattern: '(INFO|ERROR|DEBUG|TRACE|FATAL)'
  multiline.negate: true
  multiline.match: after
- input_type: log
  paths:
    - C:\AppLogs\Service2\Service2.log
    - C:\AppLogs\Service3\Service3.log
    - C:\AppLogs\Service4\Service4.log
    - C:\AppLogs\Service5\Service5.log
    - C:\AppLogs\Service6\Service6.log
    - C:\AppLogs\Service7\Service7.log
    - C:\AppLogs\Service8\Service8.log
  #exclude_lines: ["^DBG"]
  #include_lines: ["^ERR", "^WARN"]
  #exclude_files: [".gz$"]
  #fields:
  #  level: debug
  #  review: 1
  ignore_older: 15m
  multiline.pattern: ^\d{1,2}\/\d{1,2}\/\d{4}
  multiline.negate: true
  multiline.match: after
#name:
#tags: ["service-X", "web-tier"]
fields:
  env: STAGE
#output.elasticsearch:
  #protocol: "https"
  #username: "elastic"
  #password: "changeme"
output.logstash:
  hosts: ["logstash.example.local:5044"]
  #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]
  #ssl.certificate: "/etc/pki/client/cert.pem"
  #ssl.key: "/etc/pki/client/cert.key"
#logging.level: debug
#logging.selectors: ["*"]
  • The registry file is empty when the service has failed to start, but deleting it and restarting the service produces a new file with the following contetn:
[{"source":"C:\\AppLogs\\Service4\\Service4.log","offset":323454,"timestamp":"2018-03-05T09:06:20.3922249Z","ttl":-1,"type":"log","FileStateOS":{"idxhi":2228224,"idxlo":36817,"vol":3841148702}}]

Let me know if there's anything else I can get from these systems to get to the bottom of this.

@ruflin
Copy link
Contributor

ruflin commented Mar 11, 2018

@polmolea We potential have a fix for this which we applied in winlogbeat but missed in Filebeat: #2434 We will also apply it to Filebeat.

@ruflin
Copy link
Contributor

ruflin commented Apr 12, 2018

Here is the Filebeat issue for it: #6792

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Filebeat Filebeat
Projects
None yet
Development

No branches or pull requests