Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gluepyspark errors on local development #25

Closed
dazza-codes opened this issue Sep 3, 2019 · 16 comments
Closed

gluepyspark errors on local development #25

dazza-codes opened this issue Sep 3, 2019 · 16 comments

Comments

@dazza-codes
Copy link

dazza-codes commented Sep 3, 2019

errors

[WARNING] Could not transfer metadata net.minidev:json-smart/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory
[WARNING] Could not transfer metadata commons-codec:commons-codec/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory
19/09/02 20:34:14 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.

versions and tracebacks

Using master and the bare instructions from

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.3 LTS"

$ ls -1 /opt/
apache-maven-3.6.0
spark-2.2.1-bin-hadoop2.7

$ echo $SPARK_HOME
/opt/spark-2.2.1-bin-hadoop2.7

$ which mvn
/opt/apache-maven-3.6.0/bin/mvn
$ mvn --version
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-24T11:41:47-07:00)
Maven home: /opt/apache-maven-3.6.0
Java version: 1.8.0_201, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-58-generic", arch: "amd64", family: "unix"

$ which spark-shell
/opt/spark-2.2.1-bin-hadoop2.7/bin/spark-shell

$ git remote -v
origin	git@github.com:awslabs/aws-glue-libs.git (fetch)
origin	git@github.com:awslabs/aws-glue-libs.git (push)

$ git ll
* 968179f - (HEAD -> master, origin/master, origin/HEAD) Use AWSGlueETL jars to run the glue python shell/submit locally (5 days ago) <Vinay Kumar Vavili>
* 19c4d84 - Update year to 2019. (7 months ago) <Ben Sowell>
* 7e76cc9 - Update AWS Glue ETL Library to latest version (01/2019). (7 months ago) <Ben Sowell>
* 21ff9e2 - Adding standard files (1 year, 2 months ago) <Henri Yandell>


$ ./bin/gluepyspark 

...

[WARNING] Could not transfer metadata net.minidev:json-smart/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory
[WARNING] Could not transfer metadata commons-codec:commons-codec/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory

...

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.911 s
[INFO] Finished at: 2019-09-02T20:34:12-07:00
[INFO] ------------------------------------------------------------------------
mkdir: cannot create directory ‘/home/joe/src/jupiter/jupiter-glue/aws-glue-libs/conf’: File exists
Python 3.6.7 | packaged by conda-forge | (default, Jul  2 2019, 02:18:42) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/dlweber/src/jupiter/jupiter-glue/aws-glue-libs/jars/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark-2.2.1-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/09/02 20:34:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/09/02 20:34:14 WARN Utils: Your hostname, weber-jupiter resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
19/09/02 20:34:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/09/02 20:34:14 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/shell.py", line 45, in <module>
    spark = SparkSession.builder\
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 334, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 180, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 273, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoSuchMethodError: io.netty.util.ResourceLeakDetector.addExclusions(Ljava/lang/Class;[Ljava/lang/String;)V
	at io.netty.buffer.AbstractByteBufAllocator.<clinit>(AbstractByteBufAllocator.java:34)
	at org.apache.spark.network.util.NettyUtils.createPooledByteBufAllocator(NettyUtils.java:112)
	at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:107)
	at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
	at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:70)
	at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:453)
	at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:56)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:246)
	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:432)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:236)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/shell.py", line 54, in <module>
    spark = SparkSession.builder.getOrCreate()
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 334, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 180, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 273, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class io.netty.buffer.PooledByteBufAllocator
	at org.apache.spark.network.util.NettyUtils.createPooledByteBufAllocator(NettyUtils.java:112)
	at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:107)
	at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
	at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:70)
	at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:453)
	at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:56)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:246)
	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:432)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:236)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

>>> 
@rairaman
Copy link

I have almost the same issue and I have spent a fair bit of time tweakng things, but with no resolution.

With the following pyspark script:

import sys
from pyspark.context import SparkContext, SparkConf

sc = SparkContext.getOrCreate()

fileContents = sc.textFile ("file:///tmp/plain.txt")
wordCounts = fileContents.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("output")

I can get the above pyspark script to work with ./bin/gluesparksubmit if I don't add the glue jars to the spark dirver classpath, by commenting out the following line in ./bin/glue-setup.sh:

echo "spark.driver.extraClassPath $GLUE_JARS_DIR/*" >> $SPARK_CONF_DIR/spark-defaults.conf

Obviously, I can then not use any glue objects. If I do add the glue jars to the spark driver classpath, I then get the following error:

Traceback (most recent call last):
  File "/tmp/testsp.py", line 4, in <module>
    sc = SparkContext.getOrCreate()
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 334, in getOrCreate
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 118, in __init__
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 180, in _do_init
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 273, in _initialize_context
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.AbstractMethodError: io.netty.util.concurrent.MultithreadEventExecutorGroup.newChild(Ljava/util/concurrent/Executor;[Ljava/lang/Object;)Lio/netty/util/concurrent/EventExecutor;
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:84)
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:58)
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:47)
	at io.netty.channel.MultithreadEventLoopGroup.<init>(MultithreadEventLoopGroup.java:49)
	at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:61)
	at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:52)
	at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:51)
	at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:103)
	at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
	at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:70)
	at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:453)
	at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:56)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:246)
	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:432)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:236)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

I am trying to get the whole thing running inside a centos docker container, but I don't think that makes a difference.

@svajiraya
Copy link
Contributor

I found a hacky workaround:

@rairaman I was able to resolve
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.AbstractMethodError: io.netty.util.concurrent.MultithreadEventExecutorGroup.newChild and get it working by deleting aws-glue-libs/jars/netty* jars. Ideally, we should not be doing this.

However, @darrenleeweber's issue seems to be a bit different. For Could not initialize class io.netty.buffer.PooledByteBufAllocator, one reason could be multiple versions of the same jar being imported in to classpath. Compare jars in $SPARK_HOME/jars and aws-glue-libs/jars/ and remove any conflicting jars from classpath.

don't forget to comment out

mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jars dependency:copy-dependencies
before running ./bin/gluepyspark or ./bin/gluesparksubmit as this will re-download jars.

@rairaman
Copy link

rairaman commented Sep 18, 2019

@svajiraya Thanks for having a look. I did try something similar, deleted the netty jars in SPARK_HOME instead, but no luck. When I tried your workaround, I am however getting a different issue:

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package

Googling for that error, it seems to do with some incompatible jars between hadoop and spark. I will try and give the potential solutions for that issue a try someday.

I've put my Dockerfile here if you're interested: https://github.com/rairaman/docker-aws-glue/blob/master/Dockerfile

@svajiraya
Copy link
Contributor

svajiraya commented Sep 23, 2019

Hi @rairaman,

Did you by any chance add duplicate jars manually into classpath? This happens usually when you have two or more classes in the same package with different signature data. Can you try using the URLs from latest commit?

I took a look at the Dockerfile you posted and made some changes. The modified version is working from me. Here's the Dockerfile:

FROM centos as builder

#Dependencies
RUN yum -y update && yum install -y python java-1.8.0-openjdk-devel curl git zip vim

#Maven install
RUN curl -fsSL https://archive.apache.org/dist/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.tar.gz -o /opt/apache-maven-3.6.0-bin.tar.gz
RUN tar -xvf /opt/apache-maven-3.6.0-bin.tar.gz -C /opt

#Spark install
RUN curl -fsSL https://archive.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz -o /opt/spark-2.2.1-bin-hadoop2.7.tgz
RUN tar -xvf /opt/spark-2.2.1-bin-hadoop2.7.tgz -C /opt

#AWS glue scripts
WORKDIR /opt
RUN git clone https://github.com/awslabs/aws-glue-libs.git

# #Env setup
ENV M2_HOME=/opt/apache-maven-3.6.0
RUN echo "export JAVA_HOME=$(ls -d /usr/lib/jvm/*openjdk*) >> ~/.bash_profile"
ENV SPARK_HOME=/opt/spark-2.2.1-bin-hadoop2.7
ENV PATH="${PATH}:${M2_HOME}/bin"

#Run gluepysparksubmit once to download dependent jars
RUN echo "print('Get Dependencies')" > /tmp/maven.py
RUN bash -l -c "bash ~/.bash_profile" && /opt/aws-glue-libs/bin/gluesparksubmit /tmp/maven.py

# Create final image
FROM centos
RUN yum -y update && yum install -y python java-1.8.0-openjdk-devel zip
COPY --from=builder /opt/ /opt/
COPY --from=builder /root/.m2/ /root/.m2/
RUN rm -rf /opt/aws-glue-libs/conf

# Wacky workaround to get past issue with p4j error (credit @svajiraya - https://github.com/awslabs/aws-glue-libs/issues/25)
RUN rm -rf /opt/aws-glue-lib/jars/netty*
RUN sed -i /^mvn/s/^/#/ /opt/aws-glue-libs/bin/glue-setup.sh

# Env VAR setup
ENV M2_HOME=/opt/apache-maven-3.6.0
ENV SPARK_HOME=/opt/spark-2.2.1-bin-hadoop2.7
RUN echo 'export JAVA_HOME=$(ls -d /usr/lib/jvm/*openjdk*) >> ~/.bash_profile' && sed -i -e "/enableHiveSupport()/d" $SPARK_HOME/python/pyspark/shell.py && rm -vf /opt/aws-glue-libs/jars/netty-all-4.0.23.Final.jar
ENV PATH="${PATH}:${M2_HOME}/bin"
WORKDIR /opt/aws-glue-libs/
CMD ["bash", "-l", "-c", "./bin/gluepyspark"]
#Entrypoint for submitting scripts
# ENTRYPOINT ["/opt/aws-glue-libs/bin/gluesparksubmit"]
# CMD []

I built docker images for glue-0.9 and glue-1.0 with OpenJDK as base image (avoids the hassle of setting up JAVA Vars). You can find them here: https://cloud.docker.com/repository/docker/svajiraya/glue-dev-0.9 and https://cloud.docker.com/repository/docker/svajiraya/glue-dev-1.0

@GytisZ
Copy link

GytisZ commented Oct 5, 2019

Had the exact same issue. As mentioned by @svajiraya it seems to be caused by duplicate jars on the classpath. The spark distribution comes with its' own jars in $SPARK_HOME/jars, and then ./glue-setup.sh gets a bunch of duplicates into aws-glue-libs/jarsv1, and adds jarsv1 to the CLASSPATH via spark-defaults.conf.

mv aws-glue-lib/jarsv1/* $SPARK_HOME/jars and commenting mvn part in ./glue-setup.sh solved the main issue. There are still some copies of different version jars, among them different versions of netty causing multiple bindings message. But most importantly, pyspark works, and from my limited tests - glue works as well.

I believe in general having same jars on classpath is not a problem if they are identical, or different versions. But a bunch of spark-* jars have the same version, while their md5 hashes are different:

-a9d47c1cc3c880bdce35f44209fd494d  ./spark-catalyst_2.11-2.4.3.jar
-27a72c8e2695f3f520291d1fe6f80e78  ./spark-core_2.11-2.4.3.jar
-543aff0bec30760e3c3e9910de68d965  ./spark-hive_2.11-2.4.3.jar
-0498c1676777f9b0e2266be7415f5e24  ./spark-kvstore_2.11-2.4.3.jar
-e2638fbb848805fd4102ff328bcc1064  ./spark-launcher_2.11-2.4.3.jar
-9863f17dafa6b7b1e36e75f0f3a1be7b  ./spark-network-common_2.11-2.4.3.jar
-cb6ee31bec38d8a9bbe28acdab7cc77c  ./spark-network-shuffle_2.11-2.4.3.jar
-b028c7f7ce0efe03b05449c2bc224d81  ./spark-sketch_2.11-2.4.3.jar
-9246bc79591546d2835714cca403ee88  ./spark-sql_2.11-2.4.3.jar
-023c465d826f0e71f817ec4d1a09dbc0  ./spark-tags_2.11-2.4.3.jar
-cebbaa18189e2506ea5d4deeb2ea3a84  ./spark-unsafe_2.11-2.4.3.jar
+438c00c0ff216504d568fb6d10aed85a  ./spark-catalyst_2.11-2.4.3.jar
+a934c360f4acb5840ff8a9d42bd722a2  ./spark-core_2.11-2.4.3.jar
+891e0d9a59434c33e61d9d7e5d7ae4e5  ./spark-hive_2.11-2.4.3.jar
+dda80bc2470a3c022ff5a403b39e33c3  ./spark-kvstore_2.11-2.4.3.jar
+33600f3910e97a28a934503d1f3b18a8  ./spark-launcher_2.11-2.4.3.jar
+63630a2f4d744263fd94037681d43e24  ./spark-network-common_2.11-2.4.3.jar
+d94c909b2cc70986d37610585e6ccba4  ./spark-network-shuffle_2.11-2.4.3.jar
+7ddc9cd0905e3e4e392225f8c74bb4b7  ./spark-sketch_2.11-2.4.3.jar
+6d26de0366bd2fa76e1dfbd1d9293c40  ./spark-sql_2.11-2.4.3.jar
+a32c98eab49c90aeecb3e88ac46f9614  ./spark-tags_2.11-2.4.3.jar
+b02b201d604454b63a668715ae0adcc5  ./spark-unsafe_2.11-2.4.3.jar

I suppose the ones on maven are different from the ones frozen in gluesparkhadoop.tar.gz.

EDIT: as for a workaround, I've decided to ln -s ${SPARK_HOME}/jars ${AWSGLUELIB_DIR}/jarsv1, since we're installing it in Dockerfile, and that seems a bit more straightforward than copy & comment out mvn

@mvaniterson
Copy link

For version glue-1.0 modifying glue-setup.sh fix it for me. Just rm two netty jars from jarsv1 after the mvn-command.

# Run mvn copy-dependencies target to get the Glue dependencies locally
mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jarsv1 dependency:copy-dependencies

# ADD THESE TWO LINES
rm ${GLUE_JARS_DIR}/netty-3.6.2.Final.jar
rm ${GLUE_JARS_DIR}/netty-all-4.0.23.Final.jar

@akoltsov-spoton
Copy link

I forked this repo and made small changes in files https://github.com/akoltsov-spoton/aws-glue-libs/tree/glue-1.0,
this is my Dockerfile

FROM openjdk:8-jdk-slim-buster

#Dependencies
RUN apt-get update -y && apt-get install -y python3 curl git zip vim
RUN cp /usr/bin/python3 /usr/bin/python
#Maven install
ADD https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz  /opt/apache-maven-3.6.0-bin.tar.gz
RUN tar -xvf /opt/apache-maven-3.6.0-bin.tar.gz -C /opt

#Spark install
ADD https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz  /opt/spark-2.4.3-bin-hadoop2.8.tgz
RUN tar -xvf /opt/spark-2.4.3-bin-hadoop2.8.tgz -C /opt

#AWS glue scripts
WORKDIR /opt
RUN git clone -b glue-1.0 --single-branch https://github.com/akoltsov-spoton/aws-glue-libs.git

# #Env setup
ENV M2_HOME=/opt/apache-maven-3.6.0
ENV SPARK_HOME=/opt/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8
ENV PATH="${PATH}:${M2_HOME}/bin"

#Run gluepysparksubmit once to download dependent jars
# RUN echo "print('Get Dependencies')" > /tmp/maven.py
# RUN bash -l -c /opt/aws-glue-libs/bin/gluesparksubmit /tmp/maven.py

# Wacky workaround to get past issue with p4j error (credit @svajiraya - https://github.com/awslabs/aws-glue-libs/issues/25)
RUN rm -rf /opt/aws-glue-lib/jars/netty*
RUN sed -i /^mvn/s/^/#/ /opt/aws-glue-libs/bin/glue-setup.sh

# Env VAR setup
# RUN echo 'export JAVA_HOME=$(ls -d /usr/lib/jvm/*openjdk*) >> ~/.bash_profile' && sed -i -e "/enableHiveSupport()/d" $SPARK_HOME/python/pyspark/shell.py
WORKDIR /opt/aws-glue-libs/
CMD ["bash", "-l", "-c", "./bin/gluepyspark"]
#Entrypoint for submitting scripts
# ENTRYPOINT ["/opt/aws-glue-libs/bin/gluesparksubmit"]
# CMD []

a bit ugly but it works, at this moment I'm trying to understand how I can connect with aws cli (boot3) to it.

@svajiraya
Copy link
Contributor

There's no need to implement workarounds anymore as AWS Glue team has updated the dependency pom files.

This fixes this issue permanently.

The change that fixes netty jar related issue is the exclusion of io.netty:netty-all from org.apache.hadoop:hadoop-hdfs:

    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>2.8.5</version>
      <exclusions>
        <exclusion>
          <groupId>io.netty</groupId>
          <artifactId>netty-all</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

@dazza-codes can you please test this and mark the issue as resolved?

@GytisZ
Copy link

GytisZ commented May 7, 2020

Can confirm that the issue has been fixed for v1.0

Thank you!

However the pattern of overwriting the old versions (in this case v1.0.0) with new is very disturbing. This fix has broken our releases (because of the implemented workarounds). This shouldn't happen. And wouldn't if the version was bumped whenever changes were made.

@webysther
Copy link

@svajiraya can you update the Dockerfile to reflect this fix?

@webysther
Copy link

When download from pom I get this error:

Downloading from central: https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.pom (6.6 kB at 22 kB/s)
Downloading from aws-glue-etl-artifacts: s3://aws-glue-etl-artifacts/release/org/apache/httpcomponents/httpcomponents-client/4.5.9/httpcomponents-client-4.5.9.pom
[WARNING] s3://aws-glue-etl-artifacts/release - Connection refused
[INFO] Logged off - aws-glue-etl-artifacts

@svajiraya
Copy link
Contributor

@svajiraya can you update the Dockerfile to reflect this fix?

Hi @webysther

I have updated the Dockerfile to reflect this change and i have pushed new versions of the image to DockerHub.

@webysther
Copy link

Great job @svajiraya , today I will publish a new repository to more complete glue dev with support for Glue Python Shell and another add-on. I will use you docker image as part of this.

@webysther
Copy link

@dazza-codes can you confirm this is fixed as cited by @GytisZ ?

@webysther
Copy link

webysther commented May 15, 2020

Hey @svajiraya I created a new docker image with multiples versions as tag based on you docker with python version for glue python shell and some improvements as mentioned on README:

https://github.com/webysther/aws-glue-docker

webysther/aws-glue tags:

Glue Spark: dont have pip installed packages like pandas, this avoid misleading development

Python Shell

Spark

Getting started

# register alias
alias glue='docker run -v $PWD:/app -v ~/.aws:/home/docker/.aws -u $(id -u ${USER}):$(id -g ${USER}) -it webysther/aws-glue "$@"'
alias glue-spark='docker run -v $PWD:/app -v ~/.aws:/home/docker/.aws -u $(id -u ${USER}):$(id -g ${USER}) -it webysther/aws-glue:spark "$@"'

# bash
glue

# Glue Python Shell
# /app is you current folder
glue python
glue python /app/script.py

# Glue PySpark (REPL) 
glue-spark pyspark

# Glue PySpark
# /app is you current folder
glue-spark sparksubmit /app/spark_script.py

# Pyspark vanilla (without glue lib)
glue-spark
$ ./${SPARK_HOME}/bin/spark-submit spark_script.py

# Test
glue pytest

# aliases (backwards compatibility)
gluesparksubmit == sparksubmit
gluepyspark == pyspark
gluepytest == pytest

@moomindani
Copy link
Contributor

Since the issue has been already solved in the above conversation, resolving.
Note: The issue won't happen in newer images: https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants