-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1281: [C++/Python] Add Docker setup for testing HDFS IO in C++ and Python #895
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some initial comments but it seems to be still quite rough so I would wait for more comments until it's settled.
python/testing/hdfs/Dockerfile
Outdated
FROM cpcloud86/impala:metastore | ||
|
||
RUN sudo apt-add-repository -y ppa:ubuntu-toolchain-r/test && \ | ||
sudo apt-get update && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you normally run as root inside of docker, sudo should not be required here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I had to change the user back to root from the base image
export ARROW_TEST_WEBHDFS_USER=ubuntu | ||
|
||
docker stop $ARROW_TEST_NN_HOST | ||
docker rm $ARROW_TEST_NN_HOST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding --rm
to docker run
should make this line redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some issues when I removed this line and then added --rm
to the docker run command. Perhaps we can tweak in a subsequent PR when adding more features to these ad hoc Docker tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'm adding the Python HDFS tests and will remove the WIP
python/testing/hdfs/Dockerfile
Outdated
FROM cpcloud86/impala:metastore | ||
|
||
RUN sudo apt-add-repository -y ppa:ubuntu-toolchain-r/test && \ | ||
sudo apt-get update && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I had to change the user back to root from the base image
export ARROW_TEST_WEBHDFS_USER=ubuntu | ||
|
||
docker stop $ARROW_TEST_NN_HOST | ||
docker rm $ARROW_TEST_NN_HOST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Change-Id: I40d3acd46802ecb2a37f4d83ed08a841645772ba
This is ready to go. This is good for one-off use but we should see about refactoring our CI scripts to be able to share code more easily with this kind of thing. It's a little bit tricky because of the various interdependent environment variables As part of ARROW-1213 I will add S3 testing to this setup so you can check things out locally against an access/secret key for a cloud bucket |
Change-Id: I9819fb4f79ae202164dc4cf41c8d35961cff2589
Change-Id: I820f8eb707df50c6d12602fe2d816c80b1402ee1
Change-Id: Ib247a679667a40365846507b6ea9795660226272
@xhochy let me know any more comments on this. I'm going to look at the Parquet RC in the meantime |
+1; I'm going to keep adding some more ad hoc tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments about best practices I gather from developing a lot with Docker comtainers
# under the License. | ||
|
||
use_gcc() { | ||
export CC=gcc-4.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to set these variables in The Dockerfile via ENV CC gcc-4.9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure about this as the idea was to be able to easily switch between compilers depending on the test script
set -e | ||
|
||
export PATH="$MINICONDA/bin:$PATH" | ||
conda update -y -q conda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my dockerfiles, i normally include the conda installation in the image. Thus I get faster iteration times on repeated test runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also helpful to separate the base installation and the project specific dependencies into different layers of the docker image so they are shared between similar images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I will do that in the next patch
set -ex | ||
|
||
docker build -t arrow-hdfs-test -f hdfs/Dockerfile . | ||
bash hdfs/restart_docker_container.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multiple docker containers, have a look at docker-compose. This lets you start and plug multiple containers together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thank you
We aren't testing this in Travis CI because spinning up an HDFS cluster is a bit heavy weight, but this will at least enable us to do easier ongoing validation that this functionality is working properly.