Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception on shutdown #9

Open
skuntz opened this issue Nov 11, 2021 · 16 comments
Open

Exception on shutdown #9

skuntz opened this issue Nov 11, 2021 · 16 comments
Assignees
Labels
wontfix This will not be worked on

Comments

@skuntz
Copy link

skuntz commented Nov 11, 2021

We are sometimes seeing an exception on shutdown with the latest redisgraph. (@jasonriedy can chime in). For reference, I do not have this issue on condor6 with my old build.

condor6:n0: /root/skuntz/RG_bulk
redis-server v6.0.9
https://github.com/skuntz/RedisGraph/tree/skkBulkTiming
I'm not sure how to tell which sc-gc api version since it was compiled into the redisgraph.so May 26.
I tried two different LGB versions (May version, the one Minoo compiled a week or so ago) and both worked, so that doesn't appear to be the issue.

@jasonriedy
Copy link

I am 99.9% certain this is in my changes to the RedisGraph dumping code and not the latest sc-driver, @yujenjuan & @mparivash1234 . The latest sc-driver recovers from this. I need to poke at more of the test suite and ensure it uses shutdown nosave or the Python equivalent.

How to reproduce given the attached:
hang-test.tar.gz

. ./condor6-setup.sh

# rsync -a /var/log/emutechnology log-0
./redis-server --loadmodule module/redisgraph.so >server-log.1 2>&1 &
sleep 30
cat <<EOF | redis-cli # This works.
graph.query test "CREATE (a:person {name:'A'}), (b:person {name:'B'})"
shutdown
EOF
wait
ls -l dump.rdb 

# rsync -a /var/log/emutechnology log-1
./redis-server --loadmodule module/redisgraph.so >server-log.2 2>&1 &
sleep 30
cat <<EOF | redis-cli # This dies.
graph.query test "CREATE (a:person {name:'A'}), (b:person {name:'B'})"
graph.query test "MATCH (src:person) CREATE (src)-[e:knows]->(dest {name:'C'}) RETURN src,e,dest ORDER BY ID(src) DESC LIMIT 1"
shutdown
EOF
wait
rm -f dump.rdb

# rsync -a /var/log/emutechnology log-2
./redis-server --loadmodule module/redisgraph.so >server-log.3 2>&1 &
sleep 30
cat <<EOF | redis-cli # This also dies.
graph.query test "CREATE (a:person {name:'A'}), (b:person {name:'B'})"
graph.query test "MATCH (src:person) CREATE (src)-[e:knows]->(dest {name:'C'}) RETURN src,e,dest ORDER BY ID(src) DESC LIMIT 1"
shutdown
EOF
wait

./redis-server --loadmodule module/redisgraph.so 2>&1 | tee server-log.4 # crashes because dump.rdb is junk.

@jasonriedy
Copy link

I spoke too soon. Something still is leaving this in the "application is already running" state. augh. I don't know how that didn't show up above...
OH! That last segfault is after loading the redisgraph.so module, so after the mwx side is run. But then redis crashes. It's fancy ASCII graphic crash thingy runs, but it doesn't bother to unload the modules. So I need to fetch clean logs before and after that step if needed.
So sc-driver needs to track the process using it, or we need a SC-side wrapper like multinode_exec that handles things properly when the child dies.

@yujenjuan
Copy link

Please leave system in that hang state the next time you run into it so I can do a quick examination and figure out what the problem is.

@jasonriedy
Copy link

Ok, node 7 is in that state.

@yujenjuan
Copy link

I'd like to see when it gets into that state the first time, not recreated afterwards. Basically, the is a mutex that isn't given up when something crashes.

@jasonriedy
Copy link

I rebooted before recreating it. All I need for recreating it is the dump.rdb in /root/jriedy/hang-test .

@jasonriedy
Copy link

A simplified way to re-create this issue... From the directory in the attached file, ./redis-server --loadmodule module/redisgraph.so . That will try to load the broken dump file, will segfault, and will leave the SC/GC interface unable to run another file. Probably needs the LD_LIBRARY_PATH addition.
hang-test-simplified.tar.gz

@yujenjuan
Copy link

When redis-server crashes w/ segfault, nothing is sent to sc-gc-api to tell sc-gc-api to shut things down. As a result, 2 named semaphores are left open in /dev/shm. I don't see a way for sc-gc-api to intercept such segfault signals and so I think you may have to clean up manually by doing rm -f /dev/shm/*

@jasonriedy
Copy link

It looks like the two semaphores in question could be replaced by files with locking. Those would clean themselves up on a crash.

@yujenjuan
Copy link

That's a good bit of low level surgery to do for sc-driver. I don't want to revisit this at this point with other priorities.

@jasonriedy
Copy link

If I find myself waiting on other things, I may poke at it. Assuming it's only used in sc-driver, there aren't many calls to sem_*. sem_wait becomes flock(fd, LOCK_EX) and sem_post becomes flock(fd, LOCK_UN). sem_trywait ors in LOCK_NB and checks errno.

@yujenjuan
Copy link

yujenjuan commented Nov 16, 2021

These two are not the semaphores used in sc-gc-api. These 2 are used at the lowest level of sc-driver to govern access to hardware.

@jasonriedy
Copy link

So what does deleting files in /dev/shm do? Those are only used in sc-driver. The x10_device one is low-level, I'm sure, but the gossamer64 one is what controls kernel launching.

@yujenjuan
Copy link

These 2 are not deleted by sc-driver programs at exit on purpose since I have no way of knowing whether there is another process that's running (for example, running emu_diagnostic_tool to peek & poke registers and memories while emu_handler_and_loader is running) and have these open. When everything behaves properly, the semaphores are closed on exit.

What happens w/ redisgraph.so segfaulting is that the gossamer semaphore is open, but not closed (since it never got to run its exit function to close them. When you delete them from command line, you force them closed. Since there are no other processes using them, they get closed and deleted. These are then recreated and opened on next invocation of a sc-driver program.

The x10_device semaphore governs individual register & memory accesses while the gossamer64 semaphore governs one level above it w/ gossamer interface. Both are used by sc-driver at low level and they are in many ways redundant, but since we are moving to pfi on x86 and a good amount of low level code will be completely different, I figure I'll think about whether we can do with just one.

@yujenjuan
Copy link

After examining the code further, I think the gossamer semaphore can be replaced with an advisory file lock easily, so I am going to make the change.

@yujenjuan yujenjuan added the wontfix This will not be worked on label Nov 16, 2021
@yujenjuan
Copy link

Well, changing from semaphore to lock doesn't make it any more robust. When redis-server segfaults, there is always the possibility that sc-driver is in the middle of accessing a register on hardware and thus an x10 semaphore may not be given up. Since I've already done the work to convert gossamer semaphore to file locks, I am not reverting back.

sc-driver is made to give you more clue that an interface isn't shutdown cleanly.

You'll have to manually delete the lock file at /var/lock/gossamer64.lock and the semaphore /dev/shm/sem.x10_device_semaphore to clean up. If not, the next time you run, you could hang (but you can ctrl-C to get out)

The bottom line is that the calling program has to be bug free. It should not be sc-driver or sc-gc-api's job to clean up when the caller crashes.

@jasonriedy jasonriedy removed their assignment Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants