Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel compile_sql requests to dbt RPC server cause tasks to never exit the running state. #2484

Closed
1 task done
jars opened this issue May 22, 2020 · 13 comments
Closed
1 task done
Labels
bug Something isn't working rpc Issues related to dbt's RPC server
Milestone

Comments

@jars
Copy link

jars commented May 22, 2020

Describe the bug

Parallel compile_sql requests (at just moderate volumes) to dbt RPC server cause tasks to never exit the running state. We also notice several child dbt rpc processes getting created, which may or may not be symptomatic of the perpetual running state.

Steps To Reproduce

Use the following Dockerfile, which uses the dbt official image, simply extend it to use the root user, install ps linux utility, and bootstrap a project w/ dbt init:

FROM fishtownanalytics/dbt:0.16.1
USER root
# We need ps to see the spawned dbt rpc processes.
RUN apt-get update && apt-get install -y procps
WORKDIR /
RUN dbt init my_project
WORKDIR /my_project

On the host, cd into the directory that contains this Dockerfile and run:

# build the image:
docker build --no-cache -t dbt_issue_2484 .
# run the dbt rpc server, exposing port 8580 to the host
docker run --publish 8580:8580 -it dbt_issue_2484 dbt rpc

With the rpc server running, in another terminal, find the running Docker container, and bash-in so we can watch the process list with ps:

[sudo] docker ps
<get container id>
[sudo] docker exec -it <container id> bash 

Now, inside the container "watch" the instance of dbt rpc, which there will be only one, for now:

watch 'ps aux | grep rpc'

Now, back on the host machine, let's hit the RPC server w/ multiple, serial compile_sql requests to show it works.

Create a file called my_first_query.json w/ contents (straight out of the official docs) - Note the TASK_ID we will fill-in w/ a random number for each request, using sed, in a moment.

{
    "jsonrpc": "2.0",
    "method": "compile_sql",
    "id": "TASK_ID",
    "params": {
        "timeout": 60,
        "sql": "c2VsZWN0IHt7IDEgKyAxIH19IGFzIGlk",
        "name": "my_first_query"
    }
}

Then create a poll.json:

{
    "jsonrpc": "2.0",
    "method": "poll",
    "id": "TASK_ID",
    "params": {
        "request_token": "TOKEN",
        "logs": false,
        "logs_start": 0
    }
}

And two shell scripts, one to compile_sql and the other to poll for results:

compile_sql.sh:

#!/bin/bash

id=$RANDOM
# get the contents of my_first_query and change TASK_ID to a random int.
query_json=$(cat my_first_query.json | sed --expression "s/TASK_ID/$id/g")
# we only really need the token back.
curl -s -X POST -H 'Content-Type: application/json' -d "$query_json" http://localhost:8580/jsonrpc | jq -r .result.request_token

poll.sh (which takes a token as its only arg):

#!/bin/bash

token=$1
id=$RANDOM

poll_json=$(cat poll.json | sed --expression "s/TOKEN/$token/g" | sed --expression "s/TASK_ID/$id/g")
# truncates response to just state and elapsed time.
curl -s -X POST -H 'Content-Type: application/json' -d "$poll_json" http://localhost:8580/jsonrpc | jq '.result | .state, .elapsed'

Now, chmod em' and try them:

> ./compile_sql.sh
1c5368d0-d610-4efd-99e7-dda256d3ede0
./poll.sh 1c5368d0-d610-4efd-99e7-dda256d3ede0
"success"
0.113002

Great! Also note that the process list still contains a single dbt rpc.

Now, let's go parallel. Make a parallel.sh script w/ the following contents. This will invoke the compile_sql.sh script 20 times with max processes of 8.

#!/bin/bash

xargs -I % -P 8 ./compile_sql.sh \
< <(printf '%s\n' {1..20})

Then run it:

./parallel.sh
a28e2ceb-a6c5-41d7-bbc6-b03eed263f1e
da6339a3-37ed-4e66-9fd4-d51dd247812a
450fd50d-5a4f-4649-b7d7-9ac2dda89f67
1cc385d0-7a27-4193-bf11-126b3f9b0490
93039ba2-d2fb-4b75-a84a-d7e1bd27b9ab
02fa72e4-c75e-42fe-ac60-36d184399031
920c0796-eac3-4106-b1ed-e62f2d29e498
e45bed44-3908-45b9-9a03-bc55bb19fcdb
788bb3f9-5e6c-4001-8990-cd74450117bf
6e41947d-0b23-4f01-9edb-8b823e86644a
b699d934-07ba-4378-9b4e-61d6805eb629
e37b1a70-a63b-4142-a8eb-434e6bb8d947
e7a9101c-5625-48d3-a313-4438bd77bee0
5b1d98de-7d71-40c7-b0ff-4ce9296db977
...
...

Now, notice your ps output. It will likely contain child dbt rpc processes. This could be expected behaviour, we're not sure; but we did find that once this happens, thats when the polls go into perpetual running state. If you do not have multiple dbt rpc processes, run this again until you do.

Finally, now for the star of the show... try polling again. Try some of the tokens printed to your CLI:

./poll.sh 5b1d98de-7d71-40c7-b0ff-4ce9296db977
"running"
89.967127

Note, this is my_first_query.sql i.e, a simple select {{ 1 + 1 }} as id has been running for 89 seconds now, and will never terminate and provide results.

In addition, the dbt rpc server is now toast. 🍞 Any subsequent compile_sql or poll requests will never return results.


Sorry, that is a bit long-winded. It's a specific workflow that needs to be followed to produce this behaviour.

Expected behavior

At any volume, dbt rpc should eventually complete compile_sql requests.

System information

Which database are you using dbt with?

  • redshift

Technically redshift, but only because the default dbt init sets up a RedShift profiles.yml by default. We are running compile_sql dbt RPC function only, so, there are no real dbt runs happening on any warehouse.

The output of dbt --version:

docker run dbt_issue_2484 dbt --version
installed version: 0.16.1
   latest version: 0.16.1

Up to date!

The operating system you're using:

The one that underlies dbt blessed Docker:

docker run dbt_issue_2484 lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 10 (buster)
Release:	10
Codename:	buster

The output of python --version:

docker run dbt_issue_2484 python --version
Python 3.8.1

Additional context

This is actually effecting a (small, but important part of a) production application for us. We're happy to hot-patch files rather than wait until the next release, if possible. So, if you have any easy fixes, please let us know.

Possibly related to #1848

Happy Friday 🎉

@jars jars added bug Something isn't working triage labels May 22, 2020
@beckjake
Copy link
Contributor

Thank you for this very detailed writeup and excellent repro instructions! I'm able to reproduce this, though I have no idea what the cause could be.

@jars
Copy link
Author

jars commented May 22, 2020

Happy to help solve the mystery with you @beckjake, but I'm not a python native. So, the best I can probably do is verify any patches you release. Looking forward to a fix. 🙏

@jars
Copy link
Author

jars commented May 22, 2020

@beckjake, curious, in your experience do you think it's possible for us to hook-into the compile_sql compilation process by importing some of the compilation classes/modules into a custom python script and invoking it like...

> python3 compile_sql.py "select {{ 1 + 1 }} as id"
select 2 as id

This could be a stop-gap for us, depending on how complex the rpc fix is.

@beckjake
Copy link
Contributor

If your queries are all simple and don't use ref or anything like that, you can hook into dbt.clients.jinja.get_rendered. If they aren't simple and you care about models/refs/sources, it's going to be a lot more complicated.

@drewbanin drewbanin removed the triage label May 26, 2020
@beckjake
Copy link
Contributor

I've had very little time to look at this, but: this works fine on macos (once you raise the ulimit high enough to not get errors about too many open file decriptors). A key feature difference between running on macos and linux is that on macos we use the spawn protocol, which behaves fundamentally differently from fork.

Some things to explore as part of tracking this down:

  • see how spawn on linux fares. Obviously performance will be much worse because we're copying the manifest around, but maybe something bad is happening with locks/forking?
  • alternative image bases. Maybe something is up with debian? Similarly, try using bare-metal linux, maybe this is a docker-level issue.
  • tweak gc settings and see if that fixes things. if we never gc, does this problem never happen? Then all we have to do is track down what's getting lost in gc-land. That would be nice!
  • add lots more logging around this stuff so we can trace what's happening to it.

@jars
Copy link
Author

jars commented May 26, 2020

This gives us a great starting point @beckjake. thx! 🙇‍♂️

@jars
Copy link
Author

jars commented May 27, 2020

We modified flags.py to force spawn, and although slower, the rpc server does not create additional copies of itself, and all compile_query calls do eventually finish w/ pollable results. Initial testing looks good!

@beckjake
Copy link
Contributor

That is great news, thank you for reporting back!

@drewbanin drewbanin modified the milestones: Marian Anderson, 0.17.1 Jun 8, 2020
@drewbanin drewbanin added the rpc Issues related to dbt's RPC server label Jun 8, 2020
@jars
Copy link
Author

jars commented Jun 10, 2020

@beckjake & @drewbanin, excited to see this is tagged w 0.17.1 milestone!

Myself and @hochoy came across another problem with dbt RPC today; even with the spawn/fork patch applied. We suspect it's related to underlying issues described above.

In this instance the dbt RPC server will return compiled SQL for some other request that is either happening concurrently, or has happened in the past.


First, you must force spawn in the container to prevent #2484.

Then, re-using most of the setup above, make some changes to the parallel.sh script, allowing it to pass the integer 1-10 into a new script compile_and_poll.sh:

parallel.sh:

#!/bin/bash
xargs -I % -P 8 ./compile_and_poll.sh % < <(printf '%s\n' {1..10})

I wanted to prove to myself it was an issue outside of JS, Python etc, so as before, went for simple curl commands. Makes the shell script a bit messy, but still illustrates issue. This script takes an integer as input, and carries it through the compile and poll process:

compile_and_poll.sh as:

#!/bin/bash

id=$1


compile_task_id="${id}_compile"
compile_input_sql="select ${id}"
compile_b64=$(echo $compile_input_sql | base64)
compile_query_json=$(cat my_first_query.json | sed --expression "s/TASK_ID/$compile_task_id/g" | sed --expression "s/B64_STR/$compile_b64/g")
compile_token=$(curl -s -X POST -H 'Content-Type: application/json' -d "$compile_query_json" http://localhost:8580/jsonrpc | jq -r .result.request_token)

sleep 5

# now poll, everything will have finished...
poll_task_id="${id}_poll"
poll_json=$(cat poll.json | sed --expression "s/TOKEN/$compile_token/g" | sed --expression "s/TASK_ID/$poll_task_id/g")
poll_response=$(curl -s -X POST -H 'Content-Type: application/json' -d "$poll_json" http://localhost:8580/jsonrpc | jq -r .)
poll_output_sql=$(echo $poll_response | jq -r .result.results[0].compiled_sql)

echo "input: ${compile_input_sql} output: ${poll_output_sql}"

# if this conditional executes, ruh roh.
if [[ "$compile_input_sql" != "$poll_output_sql" ]]
then
  echo $poll_response > "${compile_task_id}.json"
fi

You'll also need to modify your my_first_query.json to allow base64 encoded dynamic SQL:

my_first_query.json:

{
    "jsonrpc": "2.0",
    "method": "compile_sql",
    "id": "TASK_ID",
    "params": {
        "timeout": 60,
        "sql": "B64_STR",
        "name": "TASK_ID"
    }
}

Now, execute parallel.sh again:

> ./parallel.sh
input: select 2 output: select 2
input: select 4 output: select 4
input: select 3 output: select 3
input: select 6 output: select 4 <--
input: select 5 output: select 8 <--
input: select 7 output: select 4 <--
input: select 8 output: select 8
input: select 1 output: select 8 <--
input: select 9 output: select 9
input: select 10 output: select 10

Note lines 4, 5, 6, and 8 in the output. These are instances when the input select statement did not match the output (compiled_sql) select statement.

The script above also writes the problematic polling payloads to disk. You should be able to see their id and result.results[0].compiled_sql fields do not match.


We first observed this using JavaScript.

Thanks again!

@drewbanin drewbanin modified the milestones: Marian Anderson, 0.17.1 Jun 10, 2020
@beckjake
Copy link
Contributor

beckjake commented Jun 16, 2020

I've spent some time on this and I figured I'd get it up here for you all to chew on! There are really two issues here:

One, there is definitely a deadlock going on here on process fork. This has to do with a number of apparently well-known issues in Python where forking a new thread in a multithreaded environment is bad. I thought this was only true on macos, but it's actually a problem on all OSes that support fork() - macos is just more obvious. Basically, if python does a fork()-without-exec() in one thread while another is holding an important internal lock, the forked process will copy the memory, but not the thread. Crucially, the lock will still be held (because that's a process-level memory item) but the thread that will unlock it in the parent doesn't exist in the child. There are assorted places this happens in Python, and recent releases of Python 3.x appear to have been a long game of wack-a-mole on the relevant bugs, culminating in a documentation note that threads forking processes is, basically a terrible idea. Noted! This is a fundamental design flaw in the RPC server.

We're going to talk in the coming days about how exactly to mitigate this/come up with a timeline. As fork()'s copy-on-write semantics were really desirable (it's much faster for large manifests!), I'm thinking of a custom fork-server-like model. The idea is we'd fork a process early to fork new tasks for requests before spinning up the webserver (and therefore threads!). We'd make that early-forked process control the manifest so when it called fork() its children would receive it.

Two, the results are mismatching the inputs even when using spawn (and forkserver). I've only started on tracking this down, but here's what I do know: Internally the server wraps every request/task/result in a RequestTaskHandler object, and it knows the arguments, and I've determined that it definitely does have the correct arguments for each! However, the results for each task are unique but reflect the "wrong argument" that it got, so the issue must happen somewhere between handling the http request and performing the task dispatch.

Update: And it does! The issue is that there's a race where set_args is writing to the same object for each task, because it hasn't yet kicked off the process. I'll open a PR to fix this for 0.17.1, at least.

@drewbanin
Copy link
Contributor

closing via #2554, but we may need to revisit this issue if the implemented fix results in intractable performance characteristics

@jars
Copy link
Author

jars commented Jun 21, 2020

thx @beckjake & @drewbanin !!

@hochoy
Copy link
Contributor

hochoy commented Jun 21, 2020

^ Thanks everyone, amazing request and response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rpc Issues related to dbt's RPC server
Projects
None yet
Development

No branches or pull requests

4 participants