-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gluepyspark errors on local development #25
Comments
I have almost the same issue and I have spent a fair bit of time tweakng things, but with no resolution. With the following pyspark script:
I can get the above pyspark script to work with ./bin/gluesparksubmit if I don't add the glue jars to the spark dirver classpath, by commenting out the following line in ./bin/glue-setup.sh:
Obviously, I can then not use any glue objects. If I do add the glue jars to the spark driver classpath, I then get the following error:
I am trying to get the whole thing running inside a centos docker container, but I don't think that makes a difference. |
I found a hacky workaround: @rairaman I was able to resolve However, @darrenleeweber's issue seems to be a bit different. For don't forget to comment out aws-glue-libs/bin/glue-setup.sh Line 17 in 968179f
./bin/gluepyspark or ./bin/gluesparksubmit as this will re-download jars.
|
@svajiraya Thanks for having a look. I did try something similar, deleted the netty jars in SPARK_HOME instead, but no luck. When I tried your workaround, I am however getting a different issue:
Googling for that error, it seems to do with some incompatible jars between hadoop and spark. I will try and give the potential solutions for that issue a try someday. I've put my Dockerfile here if you're interested: https://github.com/rairaman/docker-aws-glue/blob/master/Dockerfile |
Hi @rairaman, Did you by any chance add duplicate jars manually into classpath? This happens usually when you have two or more classes in the same package with different signature data. Can you try using the URLs from latest commit? I took a look at the
I built docker images for glue-0.9 and glue-1.0 with OpenJDK as base image (avoids the hassle of setting up JAVA Vars). You can find them here: https://cloud.docker.com/repository/docker/svajiraya/glue-dev-0.9 and https://cloud.docker.com/repository/docker/svajiraya/glue-dev-1.0 |
Had the exact same issue. As mentioned by @svajiraya it seems to be caused by duplicate jars on the classpath. The spark distribution comes with its' own jars in
I believe in general having same jars on classpath is not a problem if they are identical, or different versions. But a bunch of
I suppose the ones on maven are different from the ones frozen in EDIT: as for a workaround, I've decided to |
For version glue-1.0 modifying
|
I forked this repo and made small changes in files https://github.com/akoltsov-spoton/aws-glue-libs/tree/glue-1.0,
a bit ugly but it works, at this moment I'm trying to understand how I can connect with aws cli (boot3) to it. |
There's no need to implement workarounds anymore as AWS Glue team has updated the dependency pom files.
This fixes this issue permanently. The change that fixes netty jar related issue is the exclusion of
@dazza-codes can you please test this and mark the issue as resolved? |
Can confirm that the issue has been fixed for v1.0 Thank you! However the pattern of overwriting the old versions (in this case v1.0.0) with new is very disturbing. This fix has broken our releases (because of the implemented workarounds). This shouldn't happen. And wouldn't if the version was bumped whenever changes were made. |
@svajiraya can you update the Dockerfile to reflect this fix? |
When download from pom I get this error: Downloading from central: https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.pom (6.6 kB at 22 kB/s)
Downloading from aws-glue-etl-artifacts: s3://aws-glue-etl-artifacts/release/org/apache/httpcomponents/httpcomponents-client/4.5.9/httpcomponents-client-4.5.9.pom
[WARNING] s3://aws-glue-etl-artifacts/release - Connection refused
[INFO] Logged off - aws-glue-etl-artifacts |
Hi @webysther I have updated the
|
Great job @svajiraya , today I will publish a new repository to more complete glue dev with support for Glue Python Shell and another add-on. I will use you docker image as part of this. |
@dazza-codes can you confirm this is fixed as cited by @GytisZ ? |
Hey @svajiraya I created a new docker image with multiples versions as tag based on you docker with python version for glue python shell and some improvements as mentioned on README: https://github.com/webysther/aws-glue-docker webysther/aws-glue tags: Glue Spark: dont have pip installed packages like pandas, this avoid misleading development Python ShellSparkGetting started# register alias
alias glue='docker run -v $PWD:/app -v ~/.aws:/home/docker/.aws -u $(id -u ${USER}):$(id -g ${USER}) -it webysther/aws-glue "$@"'
alias glue-spark='docker run -v $PWD:/app -v ~/.aws:/home/docker/.aws -u $(id -u ${USER}):$(id -g ${USER}) -it webysther/aws-glue:spark "$@"'
# bash
glue
# Glue Python Shell
# /app is you current folder
glue python
glue python /app/script.py
# Glue PySpark (REPL)
glue-spark pyspark
# Glue PySpark
# /app is you current folder
glue-spark sparksubmit /app/spark_script.py
# Pyspark vanilla (without glue lib)
glue-spark
$ ./${SPARK_HOME}/bin/spark-submit spark_script.py
# Test
glue pytest
# aliases (backwards compatibility)
gluesparksubmit == sparksubmit
gluepyspark == pyspark
gluepytest == pytest |
Since the issue has been already solved in the above conversation, resolving. |
errors
versions and tracebacks
Using
master
and the bare instructions fromThe text was updated successfully, but these errors were encountered: