-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel compile_sql requests to dbt RPC server cause tasks to never exit the running state. #2484
Comments
Thank you for this very detailed writeup and excellent repro instructions! I'm able to reproduce this, though I have no idea what the cause could be. |
Happy to help solve the mystery with you @beckjake, but I'm not a python native. So, the best I can probably do is verify any patches you release. Looking forward to a fix. 🙏 |
@beckjake, curious, in your experience do you think it's possible for us to hook-into the > python3 compile_sql.py "select {{ 1 + 1 }} as id"
select 2 as id This could be a stop-gap for us, depending on how complex the rpc fix is. |
If your queries are all simple and don't use |
I've had very little time to look at this, but: this works fine on macos (once you raise the ulimit high enough to not get errors about too many open file decriptors). A key feature difference between running on macos and linux is that on macos we use the Some things to explore as part of tracking this down:
|
This gives us a great starting point @beckjake. thx! 🙇♂️ |
We modified |
That is great news, thank you for reporting back! |
@beckjake & @drewbanin, excited to see this is tagged w 0.17.1 milestone! Myself and @hochoy came across another problem with dbt RPC today; even with the spawn/fork patch applied. We suspect it's related to underlying issues described above. In this instance the dbt RPC server will return compiled SQL for some other request that is either happening concurrently, or has happened in the past. First, you must force spawn in the container to prevent #2484. Then, re-using most of the setup above, make some changes to the
#!/bin/bash
xargs -I % -P 8 ./compile_and_poll.sh % < <(printf '%s\n' {1..10}) I wanted to prove to myself it was an issue outside of JS, Python etc, so as before, went for simple
#!/bin/bash
id=$1
compile_task_id="${id}_compile"
compile_input_sql="select ${id}"
compile_b64=$(echo $compile_input_sql | base64)
compile_query_json=$(cat my_first_query.json | sed --expression "s/TASK_ID/$compile_task_id/g" | sed --expression "s/B64_STR/$compile_b64/g")
compile_token=$(curl -s -X POST -H 'Content-Type: application/json' -d "$compile_query_json" http://localhost:8580/jsonrpc | jq -r .result.request_token)
sleep 5
# now poll, everything will have finished...
poll_task_id="${id}_poll"
poll_json=$(cat poll.json | sed --expression "s/TOKEN/$compile_token/g" | sed --expression "s/TASK_ID/$poll_task_id/g")
poll_response=$(curl -s -X POST -H 'Content-Type: application/json' -d "$poll_json" http://localhost:8580/jsonrpc | jq -r .)
poll_output_sql=$(echo $poll_response | jq -r .result.results[0].compiled_sql)
echo "input: ${compile_input_sql} output: ${poll_output_sql}"
# if this conditional executes, ruh roh.
if [[ "$compile_input_sql" != "$poll_output_sql" ]]
then
echo $poll_response > "${compile_task_id}.json"
fi You'll also need to modify your
{
"jsonrpc": "2.0",
"method": "compile_sql",
"id": "TASK_ID",
"params": {
"timeout": 60,
"sql": "B64_STR",
"name": "TASK_ID"
}
} Now, execute > ./parallel.sh
input: select 2 output: select 2
input: select 4 output: select 4
input: select 3 output: select 3
input: select 6 output: select 4 <--
input: select 5 output: select 8 <--
input: select 7 output: select 4 <--
input: select 8 output: select 8
input: select 1 output: select 8 <--
input: select 9 output: select 9
input: select 10 output: select 10 Note lines 4, 5, 6, and 8 in the output. These are instances when the input select statement did not match the output ( The script above also writes the problematic polling payloads to disk. You should be able to see their We first observed this using JavaScript. Thanks again! |
I've spent some time on this and I figured I'd get it up here for you all to chew on! There are really two issues here: One, there is definitely a deadlock going on here on process fork. This has to do with a number of apparently well-known issues in Python where forking a new thread in a multithreaded environment is bad. I thought this was only true on macos, but it's actually a problem on all OSes that support fork() - macos is just more obvious. Basically, if python does a fork()-without-exec() in one thread while another is holding an important internal lock, the forked process will copy the memory, but not the thread. Crucially, the lock will still be held (because that's a process-level memory item) but the thread that will unlock it in the parent doesn't exist in the child. There are assorted places this happens in Python, and recent releases of Python 3.x appear to have been a long game of wack-a-mole on the relevant bugs, culminating in a documentation note that threads forking processes is, basically a terrible idea. Noted! This is a fundamental design flaw in the RPC server. We're going to talk in the coming days about how exactly to mitigate this/come up with a timeline. As fork()'s copy-on-write semantics were really desirable (it's much faster for large manifests!), I'm thinking of a custom fork-server-like model. The idea is we'd fork a process early to fork new tasks for requests before spinning up the webserver (and therefore threads!). We'd make that early-forked process control the manifest so when it called fork() its children would receive it. Two, the results are mismatching the inputs even when using Update: And it does! The issue is that there's a race where |
closing via #2554, but we may need to revisit this issue if the implemented fix results in intractable performance characteristics |
thx @beckjake & @drewbanin !! |
^ Thanks everyone, amazing request and response! |
Describe the bug
Parallel
compile_sql
requests (at just moderate volumes) to dbt RPC server cause tasks to never exit therunning
state. We also notice several child dbt rpc processes getting created, which may or may not be symptomatic of the perpetualrunning
state.Steps To Reproduce
Use the following Dockerfile, which uses the dbt official image, simply extend it to use the root user, install
ps
linux utility, and bootstrap a project w/dbt init
:On the host, cd into the directory that contains this
Dockerfile
and run:With the rpc server running, in another terminal, find the running Docker container, and bash-in so we can watch the process list with
ps
:Now, inside the container "watch" the instance of dbt rpc, which there will be only one, for now:
watch 'ps aux | grep rpc'
Now, back on the host machine, let's hit the RPC server w/ multiple, serial
compile_sql
requests to show it works.Create a file called
my_first_query.json
w/ contents (straight out of the official docs) - Note theTASK_ID
we will fill-in w/ a random number for each request, usingsed
, in a moment.Then create a
poll.json
:And two shell scripts, one to
compile_sql
and the other topoll
for results:compile_sql.sh
:poll.sh
(which takes a token as its only arg):Now, chmod em' and try them:
Great! Also note that the process list still contains a single dbt rpc.
Now, let's go parallel. Make a
parallel.sh
script w/ the following contents. This will invoke thecompile_sql.sh
script 20 times with max processes of 8.Then run it:
Now, notice your
ps
output. It will likely contain child dbt rpc processes. This could be expected behaviour, we're not sure; but we did find that once this happens, thats when the polls go into perpetual running state. If you do not have multiple dbt rpc processes, run this again until you do.Finally, now for the star of the show... try polling again. Try some of the tokens printed to your CLI:
./poll.sh 5b1d98de-7d71-40c7-b0ff-4ce9296db977 "running" 89.967127
Note, this is
my_first_query.sql
i.e, a simpleselect {{ 1 + 1 }} as id
has been running for 89 seconds now, and will never terminate and provide results.In addition, the dbt rpc server is now toast. 🍞 Any subsequent
compile_sql
orpoll
requests will never return results.Sorry, that is a bit long-winded. It's a specific workflow that needs to be followed to produce this behaviour.
Expected behavior
At any volume, dbt rpc should eventually complete
compile_sql
requests.System information
Which database are you using dbt with?
Technically redshift, but only because the default
dbt init
sets up a RedShiftprofiles.yml
by default. We are runningcompile_sql
dbt RPC function only, so, there are no realdbt runs
happening on any warehouse.The output of
dbt --version
:docker run dbt_issue_2484 dbt --version installed version: 0.16.1 latest version: 0.16.1 Up to date!
The operating system you're using:
The one that underlies dbt blessed Docker:
The output of
python --version
:Additional context
This is actually effecting a (small, but important part of a) production application for us. We're happy to hot-patch files rather than wait until the next release, if possible. So, if you have any easy fixes, please let us know.
Possibly related to #1848
Happy Friday 🎉
The text was updated successfully, but these errors were encountered: