-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kworker task blocked - reading temperature? #192
Comments
It could be that the GPU had crashed for other reasons (could there have been some glitch/surge that also killed the network), and the problem wasn't specific to reading the temperature. If it happens again, check if any other communitcation with the GPU is possible. |
Will do. I've now hooked it up to an HDMI monitor and added a USB keyboard, so next time it happens I should be able to confirm if the GPU has crashed or if it's just the network. It's highly likely though that vcgencmd was not responding correctly shortly before the crash, here's the last output from the script that I received in my ssh session:
The missing values in the final line correspond to vcgencmd calls (measure_clock arm/core/h264, measure_volts core/sdram_c). Temperature is read from /sys/class/thermal/thermal_zone0/temp. So although the network was still up, it looks like vcgencmd had already started to misbehave. After that final line, there was no further output from the script as the session disconnected at 19:42:42. |
Problem seems to persist with the version used in Raspbian: vcgencmd is called every 5 minutes three times by munin-node (temperature, volatges, freqencies) It has run well for several days but has failed over the weekend leaving this in syslog:
the vcgencmd processes were left as zombies on the system and the system load climbed to 700 over the weekend. |
sudo rpi-update and retest? |
@licaon-kter done just now. Now let's wait and see. I have no way to force this error. |
i've found that the backtrace in the dmesg paste always happens when the VPU fails to respond to mailbox requests, usually when i crashed the VPU the exact cause can vary wildly, makes me wonder if maybe popcornmix can expose a tool to make coredump of the VPU so we could submit the failure without needing to reproduce it? |
Not really, if there's no debugger attached then the processor is just going to go off and crash through memory doing 'bad stuff (tm)" You can do this (basically what you do is read all of memory and write it to a file starting at 0 and finishing at 512MB). Popcornmix has a method of loading that into the debugger and at least understanding what the threads were doing the last time they were run, but doesn't actually give much in the way of useful information about the thread that actually crashed! Gordon |
without the SP register, you cant get a decent backtrace?, and SP is only visible via jtag or a working VPU mailbox api? any exception interrupts that could send a crashdump to the arm core? |
The hangs still occur. I'm not sure why. The Mailbox interrupt stops, kworker is endlessly waiting (Status D) and if I try to run "vcgenmod version" it also hangs and can't be terminated (except by rebooting the RasPi. |
@mtedaldi |
It seems to be quite random. But it seems it happened more while I was monitoring frequency and core temperature every 5 minutes with munin. I'm running that system headless (no display or keyboard/mouse). I could try if I can force the hang by requesting temperature readings in a thight loop. |
@popcornmix I've tried to force it the whole day on friday. With no succes. Her is the syslog-output from the last hang:
|
@mtedaldi |
@popcornmix I'm not using the GPU (knowingly). There is nothing attached except the network-cable (and some i2c stuff). |
After the hang, can you report output of:
to see if there are any complaints from gpu. |
So finally, my Raspi decided to kill the mailbox interrupt again (gain, while I was not around, so I only got to investigate it after 3 days)
@popcornmix the results of the commands:
|
@popcornmix
|
@mtedaldi |
@popcornmix it took some time after that change, but now it happened again:
|
Does removing the underclock help?
If you use those, does the problem still occur? |
@popcornmix Even using cat on the /sys/ interface blocked when trying to read in such a case. So in the background this driver seems to rely on the same infrastructure as the gencmd. |
@popcornmix Now THAT's interesting. I've removed the "arm_freq_min=200" statement from the config.txt (commented it out) and rebooted. So it seems that the frequency scaling somehow led to this situation. So, until the next crash (which probably won't occur) the problem seems to be solved. |
For completeness could you try arm_freq=710 (i.e. a very small overclock). |
Having a very recent kernel and firmware on a headless RPi, the GPU hangs after a day.
Configuration files:
Some dumps after the hang:
If needed a complete 512MiB memory dump is available. During last 3 days there are 2 broken devices out of 10 devices in the test. The reason for the hangs is unclear. |
The latest firmware update (Hexxeh/rpi-firmware@998327d) enabled assert logging in the vcdbg log. It would be useful if you could catch a hang with that kernel. A repeatable sequence of commands that results in a hang would also be useful. |
Updated to latest firmare, changed gpu_mem from 32 to 64 and after 15 hours of uptime so far no hangs. Running the test on 10 RPIs. At the moment following asserts were collected (same on all RPIs, assuming not critical?):
The issue is hard to reproduce as it was happening also on idle RPI (booted and not doing anything useful). Continue in test, waiting if some hangs occur. BTW: We do have a custom kernel (same config as latest raspbian but then compiled in some more options - btrfs, luks, aufs patch). We could provide a .config diff if useful. |
Nothing too concerning in the assert log fragment. |
Regarding the display asserts - at the end of boot the display is turned off - exactly as you suspected. So far no hangs - I have a feeling that somehow gpu_mem settings (<64MB) could be the cause of the issue... |
Still no GPU hangs (great!) but on some RPIs found following message invalid space 0x0 in free list at ...
Could this be a hint? Or is it a harmless info? RPIs in test were playing random mp3s from java application. |
vcdbg reads gpu memory in a cache incoherent, non-locking way, so its output needs to be taken with a pinch of salt. Always run:
and only believe it if you see the same results twice. |
Can anyone with this issue try an rpi-update? I think the key piece of information is that I2C causes the issue. |
I also use I²C and get worker Problems.... interesting! My Pi is still available over SSh but the time is frozen and a reboot will let it crash until i power it down. See my Log:
|
@Denyuu the kernel log doesn't tell us much - just that the GPU didn't respond to mailbox requests. |
I have the same problem on a RasPi A. Mine is running a small python script which communicates with a RFM69 radio module over I2C. Will do a firmware update and report back... |
@ingoha can you post the script in its entirety? One of the discussions here was whether simply hammering I2C was enough to crash the GPU, or if some other action was necessary. |
#!/usr/bin/env python2
from RFM69 import RFM69
from RFM69.RFM69registers import *
import paho.mqtt.client as mqtt
import struct
import sys
import signal
import time
import Queue
VERSION = "2.1"
NETWORKID = 100
KEY = "blanked"
FREQ = RF69_868MHZ #options are RF69_915MHZ, RF69_868MHZ, RF69_433MHZ, RF69_315MHZ
writeQ = Queue.Queue()
class Message(object):
def __init__(self, message = None):
self.nodeID = 0
self.sensorID = 0
self.uptime = 0L
self.data = 0.0
self.battery = 0.0
self.s = struct.Struct('hhLff')
self.message = message
if message:
self.getMessage()
def setMessage(self, nodeID = None, sensorID = None, uptime = 0, data = 0.0, battery = 0.0):
if nodeID:
self.message = self.s.pack(nodeID, sensorID, uptime, data, battery)
else:
self.message = self.s.pack(self.nodeID, self.sensorID, self.uptime, self.data, self.battery)
self.getMessage()
def getMessage(self):
try:
self.nodeID, self.sensorID, self.uptime, self.data, self.battery = \
self.s.unpack_from(buffer(self.message))
except:
print "could not extract message"
class Gateway(object):
def __init__(self, freq, networkID, key):
self.mqttc = mqtt.Client()
self.mqttc.on_connect = self.mqttConnect
self.mqttc.on_message = self.mqttMessage
self.mqttc.connect("127.0.0.1", 1883, 60)
self.mqttc.loop_start()
print "mqtt init complete"
self.radio = RFM69.RFM69(freq, 1, networkID, True)
self.radio.rcCalibration()
self.radio.encrypt(key)
print "radio init complete"
def receiveBegin(self):
self.radio.receiveBegin()
def receiveDone(self):
return self.radio.receiveDone()
def mqttConnect(self, client, userdata, flags, rc):
self.mqttc.subscribe("home/rfm_gw/sb/#")
def mqttMessage(self, client, userdata, msg):
message = Message()
if len(msg.topic) == 27:
message.nodeID = int(msg.topic[19:21])
message.devID = int(msg.topic[25:27])
message.payload = str(msg.payload)
if message.payload == "READ":
message.cmd = 1
statMess = message.devID in [5, 6, 8] + range(16, 31)
realMess = message.devID in [0, 2, 3, 4] + range(40, 71) and message.cmd == 1
intMess = message.devID in [1, 7] + range(32, 39)
strMess = message.devID == 72
if message.nodeID == 1:
if message.devID == 0:
try:
with open('/proc/uptime', 'r') as uptime_file:
uptime = int(float(uptime_file.readline().split()[0]) / 60)
except:
uptime = 0
self.mqttc.publish("home/rfm_gw/nb/node01/dev00", uptime)
elif message.devID == 3:
self.mqttc.publish("home/rfm_gw/nb/node01/dev03", VERSION)
return
else:
if statMess:
if message.payload == "ON":
message.intVal = 1
elif message.payload == "OFF":
message.intVal = 0
else:
#invalid status command
self.error(3, message.nodeID)
return
elif realMess:
try:
message.fltVal = float(message.payload)
except:
pass
elif intMess:
if message.cmd == 0:
message.intVal = int(message.payload)
elif strMess:
pass
else:
#invalid devID
self.error(4, message.nodeID)
return
message.setMessage()
writeQ.put(message)
def processPacket(self, packet):
message = Message(packet)
print "Message from node %d, sensorID %d, uptime %u, data %e, battery %e" % (message.nodeID, message.sensorID, message.uptime, message.data, message.battery);
# send sensor data
self.mqttc.publish("home/rfm_gw/nb/node%02d/dev%02d/data" % (message.nodeID, message.sensorID), message.data)
# send uptime
self.mqttc.publish("home/rfm_gw/nb/node%02d/dev%02d/uptime" % (message.nodeID, message.sensorID), message.uptime)
# send battery state
self.mqttc.publish("home/rfm_gw/nb/node%02d/dev%02d/battery" % (message.nodeID, message.sensorID), message.battery)
def sendMessage(self, message):
if not self.radio.sendWithRetry(message.nodeID, message.message, 5, 30):
self.mqttc.publish("home/rfm_gw/nb/node%02d/dev90" % (message.nodeID, ),
"connection lost node %d" % (message.nodeID))
def error(self, code, dest):
self.mqttc.publish("home/rfm_gw/nb/node01/dev91", "syntax error %d for node %d" % (code, dest))
def stop(self):
print "shutting down mqqt"
self.mqttc.loop_stop()
print "shutting down radio"
self.radio.shutdown()
def handler(signum, frame):
print "\nExiting..."
gw.stop()
sys.exit(0)
signal.signal(signal.SIGINT, handler)
gw = Gateway(FREQ, NETWORKID, KEY)
if __name__ == "__main__":
while True:
gw.receiveBegin()
while not gw.receiveDone():
try:
message = writeQ.get(block = False)
gw.sendMessage(message)
except Queue.Empty:
pass
time.sleep(.1)
if gw.radio.ACK_RECEIVED:
continue
packet = bytearray(gw.radio.DATA)
if gw.radio.ACKRequested():
gw.radio.sendACK()
gw.processPacket(packet) |
If you need source of the RFM module, go to https://github.com/ingoha/RFM69 |
On a small tested (two RPi B+, 5 various I2C sensors connected to each RPi) I have just done rpi-update. Measurements from the sensors are collected as fast as possible. |
@ingoha When you say "hung", we're trying to see if the GPU has crashed - not whether the SD card has gone away. In the hung state, can you do |
Hm, when my Pi is in the "hung" state, I am not able to connect via ssh (it is headless). |
@ingoha, you definitely have a different issue then. |
ok :-( |
@popcornmix The bug seems fixed, thank you! |
Actually we found another instance that can cause this bug, so there will be a firmware update later today with an additional fix in. |
See: #192 firmware: threadx: Avoid calling a NULL interrupt handler See: #192 firmware: arm_loader: Load standard touchscreen overlay firmware: di_adv: Remove dma and copy non interlaced lines from shader firmware: di_adv: Add vector code for copying top/bottom lines firmware: di_adv: Do not deinterlace first frame and have one less frame of latency
See: raspberrypi/firmware#192 firmware: threadx: Avoid calling a NULL interrupt handler See: raspberrypi/firmware#192 firmware: arm_loader: Load standard touchscreen overlay firmware: di_adv: Remove dma and copy non interlaced lines from shader firmware: di_adv: Add vector code for copying top/bottom lines firmware: di_adv: Do not deinterlace first frame and have one less frame of latency
Another fix has just been pushed, so please rpi-update and test again. |
@popcornmix Now running with updated Firmware (and regular temperature checks) again since about 20hrs without issue. Will report again after 1 week! |
@popcornmix Without freeze for more than 73 hours by now. |
rpi-update done yesterday, no issue since then! |
Good to hear. Our test board is fine too. |
Three positive results and fixed in our use case. Closing |
See: raspberrypi#192 firmware: threadx: Avoid calling a NULL interrupt handler See: raspberrypi#192 firmware: arm_loader: Load standard touchscreen overlay firmware: di_adv: Remove dma and copy non interlaced lines from shader firmware: di_adv: Add vector code for copying top/bottom lines firmware: di_adv: Do not deinterlace first frame and have one less frame of latency
I left a headless 512MB Pi (16MB GPU, wired network on eth0) running with the latest firmware for several days, and it has just hung with the following backtrace:
Current firmware:
As with issue #132, I'm polling the temperature every 10 seconds, and the backtrace references bcm2835_get_temp so maybe there's still a problem there? However this doesn't appear to be a regression of #132 - running multiple concurrent processes reading the temperature do not provoke a backtrace.
A few minutes prior to this latest backtrace, the NUT client began reporting a communication problem with the network UPS which may be related (there is nothing wrong with the UPS, so it looks like an eth0 communication problem on the Pi side).
The Pi didn't actually crash, but it did lose all network access so had to be power cycled.
The text was updated successfully, but these errors were encountered: