Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local Docker ETL with local inputs/outputs #1606

Closed
4 tasks done
Tracked by #1177
zaneselvans opened this issue Apr 29, 2022 · 13 comments
Closed
4 tasks done
Tracked by #1177

Local Docker ETL with local inputs/outputs #1606

zaneselvans opened this issue Apr 29, 2022 · 13 comments
Assignees
Labels
cloud Stuff that has to do with adapting PUDL to work in cloud computing context.

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Apr 29, 2022

Given a Docker container with our CI environment (#1605):

  • Add local volumes to the container to point at PUDL_IN and PUDL_OUT. Maybe with docker compose?
  • Get the container to run the equivalent of tox -e ci while reading & writing data on the local volume.
  • Capture the logs and other outputs from the ETL for later review.
  • Run equivalent of tox -e nuke: all CI, full ETL, and data validation, reading & writing data on the local volume.
@zaneselvans zaneselvans self-assigned this Apr 29, 2022
@zaneselvans zaneselvans added the cloud Stuff that has to do with adapting PUDL to work in cloud computing context. label Apr 29, 2022
@zaneselvans zaneselvans mentioned this issue Apr 29, 2022
9 tasks
@zaneselvans
Copy link
Member Author

Should we actually use Tox (and yet another layer of virtual Python environments) or should we have a separate script (that would duplicate a lot of what Tox is doing and thus be at risk getting out of sync)?

@zaneselvans
Copy link
Member Author

I've got it running and able to do the equivalent of tox -e ci inside the container, but it doesn't produce any logging output until the entire process has finished, which isn't going to be helpful for longer running processes (and remote processes). Need to study up on Docker logging output tomorrow.

@zaneselvans zaneselvans linked a pull request May 3, 2022 that will close this issue
@zaneselvans
Copy link
Member Author

zaneselvans commented May 3, 2022

A good overview of docker logging. They outline several available logging strategies. It seems like using a docker logging driver is probably the right option for us, and there is a dedicated Google Cloud Logging driver

In theory the stdout and stderr from the container are sent to the logs, but maybe this only works while the container is running? I don't seem to be able to get anything out of the logs locally by doing e.g.

docker logs pudl_etl

either when the tests are running or after they've completed.

I wonder if this might be affected by the fact that tox/pytest are sitting between the process and the logging? Maybe I should try running pudl_etl directly (and also have it generate some real outputs).

@zaneselvans
Copy link
Member Author

In order to get some direct logging output (not going through tox/pytest) and also to test whether the ETL scripts can write to the mounted PUDL_OUT directory I decided to run the ETL scripts rather than the tests in the docker-compose.yml command.

For some reason the scripts run fine when they're writing to a directory inside the container, but when an external directory is mounted they have trouble writing there. SQLAlchemy complains that it can't open the ferc1.sqlite database.

However, log files do get written into the PUDL_OUT directory, and the scripts can successfully create the epacems and sqlite output directories on the host filesystem. So it seems like:

  • The conda environment and .pudl.yml paths appear to be functioning as intended (the software runs, and it knows where to try and read and write).
  • There seems to be no problem reading raw data from PUDL_IN.
  • The catalyst user seems to have write permissions in the mounted directory.
  • However in some contexts the path inside the container and the path to a mounted external directory are acting differently.

@zaneselvans
Copy link
Member Author

Logs are still being buffered and not getting output until the container stops. I've tried:

  • Setting PYTHONUNBUFFERED=1 environment variable in the container
  • Switching to local rather than json-file logging.

And I still get the same behavior no matter what: no logs are output until the container shuts down.

I'm tailing the logs using

docker logs -f pudl_etl

It turns out that PYTHONUNBUFFERED was not getting set by the environment: dictionary in the docker-compose.yaml file, for reasons I don't understand. Here's what the file looks like:

services:
  pudl-etl:
    environment:
      - PYTHONUNBUFFERED=1
      - API_KEY_EIA
    image: catalystcoop/pudl-etl:hello-docker
    container_name: pudl_etl
    logging:
      driver: local
    command: bash -c "for i in `ls`; do echo $i; sleep 1; done"

However, if I set PYTHONUNBUFFERED=1 in my shell, and pass it through to the container by simply listing it in the environment:, then it does get set, which seems not to line up with the docker compose documentation. Setting it explicitly is supposed to work. But at the same time, passing in PYTHONUNBUFFERED=1 from my shell doesn't actually fix the delayed logging problem.

The for loop command up there also has weird behavior. It does provide outputs in real time! One line per second But it does not set the variable $i to anything, so rather than a slow directory listing, I end up getting a bunch of blank lines, which makes no sense to me.

So it seems like:

  • there's some kind of environment variable problem inside the container.
  • the problem isn't that the container is failing to send the logs as they're generated, it's that they're all getting generated "at once" as far as the container is concerned, and also that's not because of python log buffering. 😭 wtf

@zaneselvans
Copy link
Member Author

zaneselvans commented May 3, 2022

A 2-year long thread about possibly related environment variable issues: docker/compose#7423

Note: I'm using docker compose with compose version 2.4.1

@zaneselvans
Copy link
Member Author

However, if I run docker compose run pudl-etl env I get

PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=26a15196c191
TERM=xterm
PYTHONUNBUFFERED=1
API_KEY_EIA=*****REDACTED*****
CONDA_DIR=/opt/conda
LANG=C.UTF-8
LC_ALL=C.UTF-8
HOME=/home/catalyst

@bendnorman
Copy link
Member

bendnorman commented May 3, 2022

I think your bash for loop and PYTHONUNBUFFERED variables weren't being evaluated as variables because you need an additional dollar sign:

You can use a $$ (double-dollar sign) when your configuration needs a literal dollar sign. This also prevents Compose from interpolating a value, so a $$ allows you to refer to environment variables that you don’t want processed by Compose.

I was able to remove the variable warning by adding an additional $ sign.

@zaneselvans
Copy link
Member Author

Ahhh, well. I guess that's one less thing to be confused by!

@bendnorman
Copy link
Member

I think conda is the culprit for holding back the logs. When I run:

python -c 'from time import sleep

for i in range(5):
    print(i)
    sleep(1)'

in the container, it outputs the logs in real time. When I run the same code in the conda env the logs get held back:

conda run --prefix /home/catalyst/env python -c 'from time import sleep

for i in range(5):
    print(i)
    sleep(1)'

@bendnorman
Copy link
Member

Adding conda run --no-capture-output outputs the logs in real time.

@zaneselvans
Copy link
Member Author

How did none of my searching find this.

@zaneselvans
Copy link
Member Author

Okay, the problem with not being able to write to PUDL_OUT was that there was no pre-populated sqlite or parquet output folders in there. Because the pudl_setup script is written during the build of the container, inside the container. Duh.

@zaneselvans zaneselvans changed the title PUDL Docker local ETL Local Docker ETL with local inputs/outputs May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud Stuff that has to do with adapting PUDL to work in cloud computing context.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants