gluepyspark errors on local development #25

dazza-codes · 2019-09-03T03:53:27Z

errors

[WARNING] Could not transfer metadata net.minidev:json-smart/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory
[WARNING] Could not transfer metadata commons-codec:commons-codec/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory

19/09/02 20:34:14 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)

  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.

versions and tracebacks

Using master and the bare instructions from

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-python

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.3 LTS"

$ ls -1 /opt/
apache-maven-3.6.0
spark-2.2.1-bin-hadoop2.7

$ echo $SPARK_HOME
/opt/spark-2.2.1-bin-hadoop2.7

$ which mvn
/opt/apache-maven-3.6.0/bin/mvn
$ mvn --version
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-24T11:41:47-07:00)
Maven home: /opt/apache-maven-3.6.0
Java version: 1.8.0_201, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-58-generic", arch: "amd64", family: "unix"

$ which spark-shell
/opt/spark-2.2.1-bin-hadoop2.7/bin/spark-shell

$ git remote -v
origin	git@github.com:awslabs/aws-glue-libs.git (fetch)
origin	git@github.com:awslabs/aws-glue-libs.git (push)

$ git ll
* 968179f - (HEAD -> master, origin/master, origin/HEAD) Use AWSGlueETL jars to run the glue python shell/submit locally (5 days ago) <Vinay Kumar Vavili>
* 19c4d84 - Update year to 2019. (7 months ago) <Ben Sowell>
* 7e76cc9 - Update AWS Glue ETL Library to latest version (01/2019). (7 months ago) <Ben Sowell>
* 21ff9e2 - Adding standard files (1 year, 2 months ago) <Henri Yandell>


$ ./bin/gluepyspark 

...

[WARNING] Could not transfer metadata net.minidev:json-smart/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory
[WARNING] Could not transfer metadata commons-codec:commons-codec/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory

...

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.911 s
[INFO] Finished at: 2019-09-02T20:34:12-07:00
[INFO] ------------------------------------------------------------------------
mkdir: cannot create directory ‘/home/joe/src/jupiter/jupiter-glue/aws-glue-libs/conf’: File exists
Python 3.6.7 | packaged by conda-forge | (default, Jul  2 2019, 02:18:42) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/dlweber/src/jupiter/jupiter-glue/aws-glue-libs/jars/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark-2.2.1-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/09/02 20:34:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/09/02 20:34:14 WARN Utils: Your hostname, weber-jupiter resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
19/09/02 20:34:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/09/02 20:34:14 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/shell.py", line 45, in <module>
    spark = SparkSession.builder\
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 334, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 180, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 273, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoSuchMethodError: io.netty.util.ResourceLeakDetector.addExclusions(Ljava/lang/Class;[Ljava/lang/String;)V
	at io.netty.buffer.AbstractByteBufAllocator.<clinit>(AbstractByteBufAllocator.java:34)
	at org.apache.spark.network.util.NettyUtils.createPooledByteBufAllocator(NettyUtils.java:112)
	at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:107)
	at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
	at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:70)
	at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:453)
	at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:56)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:246)
	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:432)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:236)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/shell.py", line 54, in <module>
    spark = SparkSession.builder.getOrCreate()
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 334, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 180, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 273, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class io.netty.buffer.PooledByteBufAllocator
	at org.apache.spark.network.util.NettyUtils.createPooledByteBufAllocator(NettyUtils.java:112)
	at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:107)
	at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
	at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:70)
	at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:453)
	at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:56)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:246)
	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:432)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:236)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

>>>

The text was updated successfully, but these errors were encountered:

rairaman · 2019-09-14T12:22:58Z

I have almost the same issue and I have spent a fair bit of time tweakng things, but with no resolution.

With the following pyspark script:

import sys
from pyspark.context import SparkContext, SparkConf

sc = SparkContext.getOrCreate()

fileContents = sc.textFile ("file:///tmp/plain.txt")
wordCounts = fileContents.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("output")

I can get the above pyspark script to work with ./bin/gluesparksubmit if I don't add the glue jars to the spark dirver classpath, by commenting out the following line in ./bin/glue-setup.sh:

echo "spark.driver.extraClassPath $GLUE_JARS_DIR/*" >> $SPARK_CONF_DIR/spark-defaults.conf

Obviously, I can then not use any glue objects. If I do add the glue jars to the spark driver classpath, I then get the following error:

Traceback (most recent call last):
  File "/tmp/testsp.py", line 4, in <module>
    sc = SparkContext.getOrCreate()
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 334, in getOrCreate
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 118, in __init__
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 180, in _do_init
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 273, in _initialize_context
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.AbstractMethodError: io.netty.util.concurrent.MultithreadEventExecutorGroup.newChild(Ljava/util/concurrent/Executor;[Ljava/lang/Object;)Lio/netty/util/concurrent/EventExecutor;
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:84)
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:58)
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:47)
	at io.netty.channel.MultithreadEventLoopGroup.<init>(MultithreadEventLoopGroup.java:49)
	at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:61)
	at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:52)
	at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:51)
	at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:103)
	at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
	at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:70)
	at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:453)
	at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:56)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:246)
	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:432)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:236)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

I am trying to get the whole thing running inside a centos docker container, but I don't think that makes a difference.

svajiraya · 2019-09-17T03:44:53Z

I found a hacky workaround:

@rairaman I was able to resolve
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.AbstractMethodError: io.netty.util.concurrent.MultithreadEventExecutorGroup.newChild and get it working by deleting aws-glue-libs/jars/netty* jars. Ideally, we should not be doing this.

However, @darrenleeweber's issue seems to be a bit different. For Could not initialize class io.netty.buffer.PooledByteBufAllocator, one reason could be multiple versions of the same jar being imported in to classpath. Compare jars in $SPARK_HOME/jars and aws-glue-libs/jars/ and remove any conflicting jars from classpath.

don't forget to comment out

aws-glue-libs/bin/glue-setup.sh

Line 17 in 968179f

    
           mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jars dependency:copy-dependencies

before running ./bin/gluepyspark or ./bin/gluesparksubmit as this will re-download jars.

rairaman · 2019-09-18T11:53:31Z

@svajiraya Thanks for having a look. I did try something similar, deleted the netty jars in SPARK_HOME instead, but no luck. When I tried your workaround, I am however getting a different issue:

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package

Googling for that error, it seems to do with some incompatible jars between hadoop and spark. I will try and give the potential solutions for that issue a try someday.

I've put my Dockerfile here if you're interested: https://github.com/rairaman/docker-aws-glue/blob/master/Dockerfile

svajiraya · 2019-09-23T09:32:39Z

Hi @rairaman,

Did you by any chance add duplicate jars manually into classpath? This happens usually when you have two or more classes in the same package with different signature data. Can you try using the URLs from latest commit?

I took a look at the Dockerfile you posted and made some changes. The modified version is working from me. Here's the Dockerfile:

FROM centos as builder

#Dependencies
RUN yum -y update && yum install -y python java-1.8.0-openjdk-devel curl git zip vim

#Maven install
RUN curl -fsSL https://archive.apache.org/dist/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.tar.gz -o /opt/apache-maven-3.6.0-bin.tar.gz
RUN tar -xvf /opt/apache-maven-3.6.0-bin.tar.gz -C /opt

#Spark install
RUN curl -fsSL https://archive.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz -o /opt/spark-2.2.1-bin-hadoop2.7.tgz
RUN tar -xvf /opt/spark-2.2.1-bin-hadoop2.7.tgz -C /opt

#AWS glue scripts
WORKDIR /opt
RUN git clone https://github.com/awslabs/aws-glue-libs.git

# #Env setup
ENV M2_HOME=/opt/apache-maven-3.6.0
RUN echo "export JAVA_HOME=$(ls -d /usr/lib/jvm/*openjdk*) >> ~/.bash_profile"
ENV SPARK_HOME=/opt/spark-2.2.1-bin-hadoop2.7
ENV PATH="${PATH}:${M2_HOME}/bin"

#Run gluepysparksubmit once to download dependent jars
RUN echo "print('Get Dependencies')" > /tmp/maven.py
RUN bash -l -c "bash ~/.bash_profile" && /opt/aws-glue-libs/bin/gluesparksubmit /tmp/maven.py

# Create final image
FROM centos
RUN yum -y update && yum install -y python java-1.8.0-openjdk-devel zip
COPY --from=builder /opt/ /opt/
COPY --from=builder /root/.m2/ /root/.m2/
RUN rm -rf /opt/aws-glue-libs/conf

# Wacky workaround to get past issue with p4j error (credit @svajiraya - https://github.com/awslabs/aws-glue-libs/issues/25)
RUN rm -rf /opt/aws-glue-lib/jars/netty*
RUN sed -i /^mvn/s/^/#/ /opt/aws-glue-libs/bin/glue-setup.sh

# Env VAR setup
ENV M2_HOME=/opt/apache-maven-3.6.0
ENV SPARK_HOME=/opt/spark-2.2.1-bin-hadoop2.7
RUN echo 'export JAVA_HOME=$(ls -d /usr/lib/jvm/*openjdk*) >> ~/.bash_profile' && sed -i -e "/enableHiveSupport()/d" $SPARK_HOME/python/pyspark/shell.py && rm -vf /opt/aws-glue-libs/jars/netty-all-4.0.23.Final.jar
ENV PATH="${PATH}:${M2_HOME}/bin"
WORKDIR /opt/aws-glue-libs/
CMD ["bash", "-l", "-c", "./bin/gluepyspark"]
#Entrypoint for submitting scripts
# ENTRYPOINT ["/opt/aws-glue-libs/bin/gluesparksubmit"]
# CMD []

I built docker images for glue-0.9 and glue-1.0 with OpenJDK as base image (avoids the hassle of setting up JAVA Vars). You can find them here: https://cloud.docker.com/repository/docker/svajiraya/glue-dev-0.9 and https://cloud.docker.com/repository/docker/svajiraya/glue-dev-1.0

GytisZ · 2019-10-05T16:13:53Z

Had the exact same issue. As mentioned by @svajiraya it seems to be caused by duplicate jars on the classpath. The spark distribution comes with its' own jars in $SPARK_HOME/jars, and then ./glue-setup.sh gets a bunch of duplicates into aws-glue-libs/jarsv1, and adds jarsv1 to the CLASSPATH via spark-defaults.conf.

mv aws-glue-lib/jarsv1/* $SPARK_HOME/jars and commenting mvn part in ./glue-setup.sh solved the main issue. There are still some copies of different version jars, among them different versions of netty causing multiple bindings message. But most importantly, pyspark works, and from my limited tests - glue works as well.

I believe in general having same jars on classpath is not a problem if they are identical, or different versions. But a bunch of spark-* jars have the same version, while their md5 hashes are different:

-a9d47c1cc3c880bdce35f44209fd494d  ./spark-catalyst_2.11-2.4.3.jar
-27a72c8e2695f3f520291d1fe6f80e78  ./spark-core_2.11-2.4.3.jar
-543aff0bec30760e3c3e9910de68d965  ./spark-hive_2.11-2.4.3.jar
-0498c1676777f9b0e2266be7415f5e24  ./spark-kvstore_2.11-2.4.3.jar
-e2638fbb848805fd4102ff328bcc1064  ./spark-launcher_2.11-2.4.3.jar
-9863f17dafa6b7b1e36e75f0f3a1be7b  ./spark-network-common_2.11-2.4.3.jar
-cb6ee31bec38d8a9bbe28acdab7cc77c  ./spark-network-shuffle_2.11-2.4.3.jar
-b028c7f7ce0efe03b05449c2bc224d81  ./spark-sketch_2.11-2.4.3.jar
-9246bc79591546d2835714cca403ee88  ./spark-sql_2.11-2.4.3.jar
-023c465d826f0e71f817ec4d1a09dbc0  ./spark-tags_2.11-2.4.3.jar
-cebbaa18189e2506ea5d4deeb2ea3a84  ./spark-unsafe_2.11-2.4.3.jar
+438c00c0ff216504d568fb6d10aed85a  ./spark-catalyst_2.11-2.4.3.jar
+a934c360f4acb5840ff8a9d42bd722a2  ./spark-core_2.11-2.4.3.jar
+891e0d9a59434c33e61d9d7e5d7ae4e5  ./spark-hive_2.11-2.4.3.jar
+dda80bc2470a3c022ff5a403b39e33c3  ./spark-kvstore_2.11-2.4.3.jar
+33600f3910e97a28a934503d1f3b18a8  ./spark-launcher_2.11-2.4.3.jar
+63630a2f4d744263fd94037681d43e24  ./spark-network-common_2.11-2.4.3.jar
+d94c909b2cc70986d37610585e6ccba4  ./spark-network-shuffle_2.11-2.4.3.jar
+7ddc9cd0905e3e4e392225f8c74bb4b7  ./spark-sketch_2.11-2.4.3.jar
+6d26de0366bd2fa76e1dfbd1d9293c40  ./spark-sql_2.11-2.4.3.jar
+a32c98eab49c90aeecb3e88ac46f9614  ./spark-tags_2.11-2.4.3.jar
+b02b201d604454b63a668715ae0adcc5  ./spark-unsafe_2.11-2.4.3.jar

I suppose the ones on maven are different from the ones frozen in gluesparkhadoop.tar.gz.

EDIT: as for a workaround, I've decided to ln -s ${SPARK_HOME}/jars ${AWSGLUELIB_DIR}/jarsv1, since we're installing it in Dockerfile, and that seems a bit more straightforward than copy & comment out mvn

mvaniterson · 2020-01-16T15:52:02Z

For version glue-1.0 modifying glue-setup.sh fix it for me. Just rm two netty jars from jarsv1 after the mvn-command.

# Run mvn copy-dependencies target to get the Glue dependencies locally
mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jarsv1 dependency:copy-dependencies

# ADD THESE TWO LINES
rm ${GLUE_JARS_DIR}/netty-3.6.2.Final.jar
rm ${GLUE_JARS_DIR}/netty-all-4.0.23.Final.jar

akoltsov-spoton · 2020-05-06T10:52:31Z

I forked this repo and made small changes in files https://github.com/akoltsov-spoton/aws-glue-libs/tree/glue-1.0,
this is my Dockerfile

FROM openjdk:8-jdk-slim-buster

#Dependencies
RUN apt-get update -y && apt-get install -y python3 curl git zip vim
RUN cp /usr/bin/python3 /usr/bin/python
#Maven install
ADD https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz  /opt/apache-maven-3.6.0-bin.tar.gz
RUN tar -xvf /opt/apache-maven-3.6.0-bin.tar.gz -C /opt

#Spark install
ADD https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz  /opt/spark-2.4.3-bin-hadoop2.8.tgz
RUN tar -xvf /opt/spark-2.4.3-bin-hadoop2.8.tgz -C /opt

#AWS glue scripts
WORKDIR /opt
RUN git clone -b glue-1.0 --single-branch https://github.com/akoltsov-spoton/aws-glue-libs.git

# #Env setup
ENV M2_HOME=/opt/apache-maven-3.6.0
ENV SPARK_HOME=/opt/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8
ENV PATH="${PATH}:${M2_HOME}/bin"

#Run gluepysparksubmit once to download dependent jars
# RUN echo "print('Get Dependencies')" > /tmp/maven.py
# RUN bash -l -c /opt/aws-glue-libs/bin/gluesparksubmit /tmp/maven.py

# Wacky workaround to get past issue with p4j error (credit @svajiraya - https://github.com/awslabs/aws-glue-libs/issues/25)
RUN rm -rf /opt/aws-glue-lib/jars/netty*
RUN sed -i /^mvn/s/^/#/ /opt/aws-glue-libs/bin/glue-setup.sh

# Env VAR setup
# RUN echo 'export JAVA_HOME=$(ls -d /usr/lib/jvm/*openjdk*) >> ~/.bash_profile' && sed -i -e "/enableHiveSupport()/d" $SPARK_HOME/python/pyspark/shell.py
WORKDIR /opt/aws-glue-libs/
CMD ["bash", "-l", "-c", "./bin/gluepyspark"]
#Entrypoint for submitting scripts
# ENTRYPOINT ["/opt/aws-glue-libs/bin/gluesparksubmit"]
# CMD []

a bit ugly but it works, at this moment I'm trying to understand how I can connect with aws cli (boot3) to it.

svajiraya · 2020-05-06T15:21:29Z

There's no need to implement workarounds anymore as AWS Glue team has updated the dependency pom files.

This fixes this issue permanently.

The change that fixes netty jar related issue is the exclusion of io.netty:netty-all from org.apache.hadoop:hadoop-hdfs:

    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-hdfs</artifactId>
      <version>2.8.5</version>
      <exclusions>
        <exclusion>
          <groupId>io.netty</groupId>
          <artifactId>netty-all</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

@dazza-codes can you please test this and mark the issue as resolved?

GytisZ · 2020-05-07T15:57:49Z

Can confirm that the issue has been fixed for v1.0

Thank you!

However the pattern of overwriting the old versions (in this case v1.0.0) with new is very disturbing. This fix has broken our releases (because of the implemented workarounds). This shouldn't happen. And wouldn't if the version was bumped whenever changes were made.

webysther · 2020-05-09T20:12:29Z

@svajiraya can you update the Dockerfile to reflect this fix?

webysther · 2020-05-09T21:58:15Z

When download from pom I get this error:

Downloading from central: https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.pom
Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.pom (6.6 kB at 22 kB/s)
Downloading from aws-glue-etl-artifacts: s3://aws-glue-etl-artifacts/release/org/apache/httpcomponents/httpcomponents-client/4.5.9/httpcomponents-client-4.5.9.pom
[WARNING] s3://aws-glue-etl-artifacts/release - Connection refused
[INFO] Logged off - aws-glue-etl-artifacts

svajiraya · 2020-05-13T15:24:20Z

@svajiraya can you update the Dockerfile to reflect this fix?

Hi @webysther

I have updated the Dockerfile to reflect this change and i have pushed new versions of the image to DockerHub.

https://hub.docker.com/r/svajiraya/glue-dev-1.0 - (tag: latest or 20200513_151648UTC)
https://hub.docker.com/r/svajiraya/glue-dev-0.9 - (tag: latest or 20200513_105653UTC)

webysther · 2020-05-13T21:47:51Z

Great job @svajiraya , today I will publish a new repository to more complete glue dev with support for Glue Python Shell and another add-on. I will use you docker image as part of this.

webysther · 2020-05-13T21:48:58Z

@dazza-codes can you confirm this is fixed as cited by @GytisZ ?

webysther · 2020-05-15T06:00:11Z

Hey @svajiraya I created a new docker image with multiples versions as tag based on you docker with python version for glue python shell and some improvements as mentioned on README:

https://github.com/webysther/aws-glue-docker

webysther/aws-glue tags:

Glue Spark: dont have pip installed packages like pandas, this avoid misleading development

Python Shell

Spark

Getting started

# register alias
alias glue='docker run -v $PWD:/app -v ~/.aws:/home/docker/.aws -u $(id -u ${USER}):$(id -g ${USER}) -it webysther/aws-glue "$@"'
alias glue-spark='docker run -v $PWD:/app -v ~/.aws:/home/docker/.aws -u $(id -u ${USER}):$(id -g ${USER}) -it webysther/aws-glue:spark "$@"'

# bash
glue

# Glue Python Shell
# /app is you current folder
glue python
glue python /app/script.py

# Glue PySpark (REPL) 
glue-spark pyspark

# Glue PySpark
# /app is you current folder
glue-spark sparksubmit /app/spark_script.py

# Pyspark vanilla (without glue lib)
glue-spark
$ ./${SPARK_HOME}/bin/spark-submit spark_script.py

# Test
glue pytest

# aliases (backwards compatibility)
gluesparksubmit == sparksubmit
gluepyspark == pyspark
gluepytest == pytest

moomindani · 2022-09-09T10:52:24Z

Since the issue has been already solved in the above conversation, resolving.
Note: The issue won't happen in newer images: https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/

GytisZ mentioned this issue Nov 7, 2019

Can't start gluepyspark without manually changing dependencies #34

Closed

svajiraya mentioned this issue May 1, 2020

Running gluepyspark results to Py4JJavaError while calling JavaSparkContext #46

Closed

hector-ps mentioned this issue May 19, 2020

Aliases in Readme webysther/aws-glue-docker#1

Closed

ckingbailey mentioned this issue Jul 6, 2020

Exception: Java gateway process exited before sending its port number webysther/aws-glue-docker#2

Open

moomindani closed this as completed Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gluepyspark errors on local development #25

gluepyspark errors on local development #25

dazza-codes commented Sep 3, 2019 •

edited

Loading

rairaman commented Sep 14, 2019

svajiraya commented Sep 17, 2019

rairaman commented Sep 18, 2019 •

edited

Loading

svajiraya commented Sep 23, 2019 •

edited

Loading

GytisZ commented Oct 5, 2019 •

edited

Loading

mvaniterson commented Jan 16, 2020

akoltsov-spoton commented May 6, 2020

svajiraya commented May 6, 2020

GytisZ commented May 7, 2020

webysther commented May 9, 2020

webysther commented May 9, 2020

svajiraya commented May 13, 2020

webysther commented May 13, 2020

webysther commented May 13, 2020

webysther commented May 15, 2020 •

edited

Loading

moomindani commented Sep 9, 2022

gluepyspark errors on local development #25

gluepyspark errors on local development #25

Comments

dazza-codes commented Sep 3, 2019 • edited Loading

errors

versions and tracebacks

rairaman commented Sep 14, 2019

svajiraya commented Sep 17, 2019

rairaman commented Sep 18, 2019 • edited Loading

svajiraya commented Sep 23, 2019 • edited Loading

GytisZ commented Oct 5, 2019 • edited Loading

mvaniterson commented Jan 16, 2020

akoltsov-spoton commented May 6, 2020

svajiraya commented May 6, 2020

GytisZ commented May 7, 2020

webysther commented May 9, 2020

webysther commented May 9, 2020

svajiraya commented May 13, 2020

webysther commented May 13, 2020

webysther commented May 13, 2020

webysther commented May 15, 2020 • edited Loading

Python Shell

Spark

Getting started

moomindani commented Sep 9, 2022

dazza-codes commented Sep 3, 2019 •

edited

Loading

rairaman commented Sep 18, 2019 •

edited

Loading

svajiraya commented Sep 23, 2019 •

edited

Loading

GytisZ commented Oct 5, 2019 •

edited

Loading

webysther commented May 15, 2020 •

edited

Loading