Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark patch #139

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Spark patch #139

wants to merge 15 commits into from

Conversation

DoubleMindy
Copy link

No description provided.

@CLAassistant
Copy link

CLAassistant commented Sep 19, 2023

CLA assistant check
All committers have signed the CLA.

@nickitat nickitat self-assigned this Sep 23, 2023
spark/.gitkeep Outdated
@@ -0,0 +1 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this file could be removed

StructField("RefererHash", LongType, nullable = false),
StructField("URLHash", LongType, nullable = false),
StructField("CLID", IntegerType, nullable = false))
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no way to create an index, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val timeElapsed = (end - start) / 1000000
println(s"Query $itr | Time: $timeElapsed ms")
itr += 1
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls upload the results

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uploaded in log.txt file

spark/benchmark.sh Outdated Show resolved Hide resolved

# For Spark3.0.1 installation:
# wget --continue https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
# tar -xzf spark-3.0.1-bin-hadoop2.7.tgz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tar -xzf spark*

wget --continue 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz'
#gzip -d hits.tsv.gz
chmod 777 ~ hits.tsv
$HADOOP_HOME/bin/hdfs dfs -put hits.tsv /
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how do I set this variable?

$ echo $HADOOP_HOME

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find it:

find spark-3.5.0-bin-hadoop3 -name hdfs

@DoubleMindy
Copy link
Author

Added Spark & HDFS deployment details in benchmark.sh script. Added example of log.txt file from HPC-environment.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Feb 7, 2024

The script benchmark.sh is incomplete:

ubuntu@ip-172-31-17-231:~$ mv hadoop-3* /usr/local/hadoop
mv: target '/usr/local/hadoop' is not a directory

Should be:

sudo mkdir -p /usr/local/hadoop
sudo mv hadoop-3* /usr/local/hadoop

@alexey-milovidov
Copy link
Member

ubuntu@ip-172-31-17-231:~$ cd /usr/local/hadoop/etc/hadoop
-bash: cd: /usr/local/hadoop/etc/hadoop: No such file or directory

@DoubleMindy
Copy link
Author

Updated Spark&HDFS directories creation

@alexey-milovidov
Copy link
Member

I started editing your script to make it self-sufficient, but after fixing the errors, it does not work.

ubuntu@ip-172-31-40-233:~$ hdfs namenode -format
ERROR: JAVA_HOME is not set and could not be found.

Screenshot_20240207_223918

Then:

ubuntu@ip-172-31-40-233:~$ export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
ubuntu@ip-172-31-40-233:~$ hdfs namenode -format
WARNING: /usr/local/hadoop/logs does not exist. Creating.
2024-02-07 21:36:45,767 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ip-172-31-40-233.eu-central-1.compute.internal/172.31.40.233
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.3.6
STARTUP_MSG:   classpath = /usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-smtp-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-net-3.9.0.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-databind-2.12.7.1.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-codec-1.15.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-transport-rxtx-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-io-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-3.3.6.jar:/usr/local/hadoop/share/hadoop/common/lib/kerby-config-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-core-1.19.4.jar:/usr/local/hadoop/share/hadoop/common/lib/audience-annotations-0.5.0.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-http2-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-stomp-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-compress-1.21.jar:/usr/local/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-resolver-dns-native-macos-4.1.89.Final-osx-x86_64.jar:/usr/local/hadoop/share/hadoop/common/lib/jsch-0.1.55.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-common-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-handler-ssl-ocsp-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-transport-classes-kqueue-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-beanutils-1.9.4.jar:/usr/local/hadoop/share/hadoop/common/lib/httpclient-4.5.13.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-util-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-security-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-math3-3.1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-resolver-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/gson-2.9.0.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-json-1.20.jar:/usr/local/hadoop/share/hadoop/common/lib/hadoop-annotations-3.3.6.jar:/usr/local/hadoop/share/hadoop/common/lib/javax.servlet-api-3.1.0.jar:/usr/local/hadoop/share/hadoop/common/lib/curator-framework-5.2.0.jar:/usr/local/hadoop/share/hadoop/common/lib/avro-1.7.7.jar:/usr/local/hadoop/share/hadoop/common/lib/jakarta.activation-api-1.2.1.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-xml-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/jaxb-api-2.2.11.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-handler-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-mqtt-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-transport-native-unix-common-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/token-provider-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/kerb-common-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-http-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/common/lib/kerb-client-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/dnsjava-2.1.7.jar:/usr/local/hadoop/share/hadoop/common/lib/j2objc-annotations-1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-lang3-3.12.0.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-core-2.12.7.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-resolver-dns-native-macos-4.1.89.Final-osx-aarch_64.jar:/usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar:/usr/local/hadoop/share/hadoop/common/lib/jettison-1.5.4.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/kerby-util-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-transport-udt-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-dns-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/nimbus-jose-jwt-9.8.1.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-socks-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-configuration2-2.8.0.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-http-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-xml-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-servlet-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-transport-native-epoll-4.1.89.Final-linux-x86_64.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-server-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-text-1.10.0.jar:/usr/local/hadoop/share/hadoop/common/lib/kerb-server-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/slf4j-api-1.7.36.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-all-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/kerby-xdr-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/usr/local/hadoop/share/hadoop/common/lib/failureaccess-1.0.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-transport-native-kqueue-4.1.89.Final-osx-x86_64.jar:/usr/local/hadoop/share/hadoop/common/lib/jsr311-api-1.1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/kerby-pkix-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-resolver-dns-classes-macos-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/jul-to-slf4j-1.7.36.jar:/usr/local/hadoop/share/hadoop/common/lib/checker-qual-2.5.2.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-memcache-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-servlet-1.19.4.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-util-ajax-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/common/lib/re2j-1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/kerb-util-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/curator-client-5.2.0.jar:/usr/local/hadoop/share/hadoop/common/lib/kerb-crypto-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/curator-recipes-5.2.0.jar:/usr/local/hadoop/share/hadoop/common/lib/hadoop-shaded-protobuf_3_7-1.1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/hadoop-shaded-guava-1.1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/common/lib/woodstox-core-5.4.0.jar:/usr/local/hadoop/share/hadoop/common/lib/kerb-core-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:/usr/local/hadoop/share/hadoop/common/lib/kerb-identity-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/animal-sniffer-annotations-1.17.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-daemon-1.0.13.jar:/usr/local/hadoop/share/hadoop/common/lib/stax2-api-4.2.1.jar:/usr/local/hadoop/share/hadoop/common/lib/zookeeper-3.6.3.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-buffer-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/kerb-simplekdc-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-transport-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-server-1.19.4.jar:/usr/local/hadoop/share/hadoop/common/lib/jcip-annotations-1.0-1.jar:/usr/local/hadoop/share/hadoop/common/lib/jsr305-3.0.2.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-transport-native-kqueue-4.1.89.Final-osx-aarch_64.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-collections-3.2.2.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/kerby-asn1-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-resolver-dns-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/usr/local/hadoop/share/hadoop/common/lib/zookeeper-jute-3.6.3.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-transport-sctp-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-annotations-2.12.7.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-redis-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/common/lib/snappy-java-1.1.8.2.jar:/usr/local/hadoop/share/hadoop/common/lib/httpcore-4.4.13.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-transport-classes-epoll-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-codec-haproxy-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/metrics-core-3.2.4.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-webapp-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-handler-proxy-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-logging-1.1.3.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-transport-native-epoll-4.1.89.Final-linux-aarch_64.jar:/usr/local/hadoop/share/hadoop/common/lib/reload4j-1.2.22.jar:/usr/local/hadoop/share/hadoop/common/lib/kerb-admin-1.0.1.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-io-2.8.0.jar:/usr/local/hadoop/share/hadoop/common/hadoop-registry-3.3.6.jar:/usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.6-tests.jar:/usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar:/usr/local/hadoop/share/hadoop/common/hadoop-nfs-3.3.6.jar:/usr/local/hadoop/share/hadoop/common/hadoop-kms-3.3.6.jar:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-smtp-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-net-3.9.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-databind-2.12.7.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-codec-1.15.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-transport-rxtx-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-io-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kotlin-stdlib-common-1.4.10.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/hadoop-auth-3.3.6.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerby-config-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jersey-core-1.19.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/audience-annotations-0.5.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-http2-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-stomp-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-compress-1.21.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/paranamer-2.3.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-resolver-dns-native-macos-4.1.89.Final-osx-x86_64.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jsch-0.1.55.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-common-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-handler-ssl-ocsp-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-transport-classes-kqueue-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-beanutils-1.9.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/httpclient-4.5.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-util-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-security-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-math3-3.1.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/okio-2.8.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-resolver-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/gson-2.9.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/okhttp-4.9.3.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jersey-json-1.20.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/hadoop-annotations-3.3.6.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/javax.servlet-api-3.1.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/curator-framework-5.2.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/avro-1.7.7.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jakarta.activation-api-1.2.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-xml-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jaxb-api-2.2.11.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-handler-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-mqtt-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-transport-native-unix-common-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/token-provider-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerb-common-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-http-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerb-client-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/dnsjava-2.1.7.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/j2objc-annotations-1.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-lang3-3.12.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-core-2.12.7.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-resolver-dns-native-macos-4.1.89.Final-osx-aarch_64.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/guava-27.0-jre.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jettison-1.5.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/leveldbjni-all-1.8.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerby-util-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-transport-udt-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-dns-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/nimbus-jose-jwt-9.8.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-socks-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-configuration2-2.8.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-http-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-xml-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-servlet-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-transport-native-epoll-4.1.89.Final-linux-x86_64.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-server-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-text-1.10.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerb-server-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-all-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerby-xdr-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/failureaccess-1.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-transport-native-kqueue-4.1.89.Final-osx-x86_64.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jsr311-api-1.1.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerby-pkix-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-resolver-dns-classes-macos-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/checker-qual-2.5.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-memcache-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jersey-servlet-1.19.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-util-ajax-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/re2j-1.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerb-util-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/json-simple-1.1.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/curator-client-5.2.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerb-crypto-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/curator-recipes-5.2.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/hadoop-shaded-protobuf_3_7-1.1.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/hadoop-shaded-guava-1.1.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/woodstox-core-5.4.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerb-core-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerb-identity-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/animal-sniffer-annotations-1.17.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/stax2-api-4.2.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/zookeeper-3.6.3.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-buffer-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerb-simplekdc-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-transport-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jersey-server-1.19.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jcip-annotations-1.0-1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jsr305-3.0.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/HikariCP-java7-2.4.12.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kotlin-stdlib-1.4.10.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-transport-native-kqueue-4.1.89.Final-osx-aarch_64.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-collections-3.2.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerby-asn1-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-resolver-dns-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jaxb-impl-2.2.3-1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/zookeeper-jute-3.6.3.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-transport-sctp-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-annotations-2.12.7.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-redis-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/snappy-java-1.1.8.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/httpcore-4.4.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-transport-classes-epoll-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-codec-haproxy-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/metrics-core-3.2.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-webapp-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-handler-proxy-4.1.89.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-transport-native-epoll-4.1.89.Final-linux-aarch_64.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/reload4j-1.2.22.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/kerb-admin-1.0.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-3.10.6.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-io-2.8.0.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-3.3.6-tests.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-nfs-3.3.6.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-native-client-3.3.6.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-rbf-3.3.6-tests.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-client-3.3.6.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-native-client-3.3.6-tests.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-client-3.3.6-tests.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-rbf-3.3.6.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-httpfs-3.3.6.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-3.3.6.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.6-tests.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-3.3.6.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-3.3.6.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-app-3.3.6.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-common-3.3.6.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.6.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-3.3.6.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-nativetask-3.3.6.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-uploader-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/jetty-plus-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-jaxrs-base-2.12.7.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jakarta.xml.bind-api-2.3.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/aopalliance-1.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/ehcache-3.3.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/websocket-api-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jetty-annotations-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/yarn/lib/javax.websocket-client-api-1.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/fst-2.50.jar:/usr/local/hadoop/share/hadoop/yarn/lib/bcprov-jdk15on-1.68.jar:/usr/local/hadoop/share/hadoop/yarn/lib/bcpkix-jdk15on-1.68.jar:/usr/local/hadoop/share/hadoop/yarn/lib/javax.websocket-api-1.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/websocket-common-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-jaxrs-json-provider-2.12.7.jar:/usr/local/hadoop/share/hadoop/yarn/lib/javax.inject-1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jetty-client-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jline-3.9.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/websocket-server-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/yarn/lib/objenesis-2.6.jar:/usr/local/hadoop/share/hadoop/yarn/lib/swagger-annotations-1.5.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/guice-4.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/asm-tree-9.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jetty-jndi-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/yarn/lib/javax-websocket-client-impl-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/yarn/lib/websocket-servlet-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/yarn/lib/json-io-2.5.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/mssql-jdbc-6.2.1.jre7.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-module-jaxb-annotations-2.12.7.jar:/usr/local/hadoop/share/hadoop/yarn/lib/snakeyaml-2.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/websocket-client-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/yarn/lib/java-util-1.9.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/asm-commons-9.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/geronimo-jcache_1.0_spec-1.0-alpha-1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/javax-websocket-server-impl-9.4.51.v20230217.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-guice-1.19.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-client-1.19.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/guice-servlet-4.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jna-5.2.0.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-services-api-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-common-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-tests-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-router-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-nodemanager-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-common-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-mawo-core-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-api-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-services-core-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-timeline-pluginstorage-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-web-proxy-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-client-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-3.3.6.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-registry-3.3.6.jar
STARTUP_MSG:   build = https://github.com/apache/hadoop.git -r 1be78238728da9266a4f88195058f08fd012bf9c; compiled by 'ubuntu' on 2023-06-18T08:22Z
STARTUP_MSG:   java = 1.8.0_392
************************************************************/
2024-02-07 21:36:45,773 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2024-02-07 21:36:45,824 INFO namenode.NameNode: createNameNode [-format]
2024-02-07 21:36:46,061 INFO namenode.NameNode: Formatting using clusterid: CID-f777c88a-1c41-460b-ab1c-cb7cd004ce76
2024-02-07 21:36:46,088 INFO namenode.FSEditLog: Edit logging is async:true
2024-02-07 21:36:46,116 INFO namenode.FSNamesystem: KeyProvider: null
2024-02-07 21:36:46,117 INFO namenode.FSNamesystem: fsLock is fair: true
2024-02-07 21:36:46,118 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
2024-02-07 21:36:46,121 INFO namenode.FSNamesystem: fsOwner                = ubuntu (auth:SIMPLE)
2024-02-07 21:36:46,121 INFO namenode.FSNamesystem: supergroup             = supergroup
2024-02-07 21:36:46,121 INFO namenode.FSNamesystem: isPermissionEnabled    = true
2024-02-07 21:36:46,121 INFO namenode.FSNamesystem: isStoragePolicyEnabled = true
2024-02-07 21:36:46,121 INFO namenode.FSNamesystem: HA Enabled: false
2024-02-07 21:36:46,153 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2024-02-07 21:36:46,234 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit : configured=1000, counted=60, effected=1000
2024-02-07 21:36:46,235 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2024-02-07 21:36:46,237 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2024-02-07 21:36:46,237 INFO blockmanagement.BlockManager: The block deletion will start around 2024 Feb 07 21:36:46
2024-02-07 21:36:46,238 INFO util.GSet: Computing capacity for map BlocksMap
2024-02-07 21:36:46,238 INFO util.GSet: VM type       = 64-bit
2024-02-07 21:36:46,239 INFO util.GSet: 2.0% max memory 6.8 GB = 139.6 MB
2024-02-07 21:36:46,239 INFO util.GSet: capacity      = 2^24 = 16777216 entries
2024-02-07 21:36:46,249 INFO blockmanagement.BlockManager: Storage policy satisfier is disabled
2024-02-07 21:36:46,249 INFO blockmanagement.BlockManager: dfs.block.access.token.enable = false
2024-02-07 21:36:46,253 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.999
2024-02-07 21:36:46,253 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
2024-02-07 21:36:46,253 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
2024-02-07 21:36:46,253 INFO blockmanagement.BlockManager: defaultReplication         = 1
2024-02-07 21:36:46,253 INFO blockmanagement.BlockManager: maxReplication             = 512
2024-02-07 21:36:46,253 INFO blockmanagement.BlockManager: minReplication             = 1
2024-02-07 21:36:46,253 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
2024-02-07 21:36:46,253 INFO blockmanagement.BlockManager: redundancyRecheckInterval  = 3000ms
2024-02-07 21:36:46,253 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
2024-02-07 21:36:46,253 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
2024-02-07 21:36:46,276 INFO namenode.FSDirectory: GLOBAL serial map: bits=29 maxEntries=536870911
2024-02-07 21:36:46,276 INFO namenode.FSDirectory: USER serial map: bits=24 maxEntries=16777215
2024-02-07 21:36:46,276 INFO namenode.FSDirectory: GROUP serial map: bits=24 maxEntries=16777215
2024-02-07 21:36:46,276 INFO namenode.FSDirectory: XATTR serial map: bits=24 maxEntries=16777215
2024-02-07 21:36:46,284 INFO util.GSet: Computing capacity for map INodeMap
2024-02-07 21:36:46,284 INFO util.GSet: VM type       = 64-bit
2024-02-07 21:36:46,284 INFO util.GSet: 1.0% max memory 6.8 GB = 69.8 MB
2024-02-07 21:36:46,284 INFO util.GSet: capacity      = 2^23 = 8388608 entries
2024-02-07 21:36:46,397 INFO namenode.FSDirectory: ACLs enabled? true
2024-02-07 21:36:46,397 INFO namenode.FSDirectory: POSIX ACL inheritance enabled? true
2024-02-07 21:36:46,397 INFO namenode.FSDirectory: XAttrs enabled? true
2024-02-07 21:36:46,397 INFO namenode.NameNode: Caching file names occurring more than 10 times
2024-02-07 21:36:46,401 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: false, skipCaptureAccessTimeOnlyChange: false, snapshotDiffAllowSnapRootDescendant: true, maxSnapshotLimit: 65536
2024-02-07 21:36:46,402 INFO snapshot.SnapshotManager: SkipList is disabled
2024-02-07 21:36:46,405 INFO util.GSet: Computing capacity for map cachedBlocks
2024-02-07 21:36:46,405 INFO util.GSet: VM type       = 64-bit
2024-02-07 21:36:46,406 INFO util.GSet: 0.25% max memory 6.8 GB = 17.4 MB
2024-02-07 21:36:46,406 INFO util.GSet: capacity      = 2^21 = 2097152 entries
2024-02-07 21:36:46,411 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2024-02-07 21:36:46,412 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2024-02-07 21:36:46,412 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2024-02-07 21:36:46,414 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2024-02-07 21:36:46,414 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2024-02-07 21:36:46,415 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2024-02-07 21:36:46,415 INFO util.GSet: VM type       = 64-bit
2024-02-07 21:36:46,415 INFO util.GSet: 0.029999999329447746% max memory 6.8 GB = 2.1 MB
2024-02-07 21:36:46,415 INFO util.GSet: capacity      = 2^18 = 262144 entries
2024-02-07 21:36:46,431 INFO namenode.FSImage: Allocated new BlockPoolId: BP-920567602-172.31.40.233-1707341806426
2024-02-07 21:36:46,519 INFO common.Storage: Storage directory /tmp/hadoop-ubuntu/dfs/name has been successfully formatted.
2024-02-07 21:36:46,639 INFO namenode.FSImageFormatProtobuf: Saving image file /tmp/hadoop-ubuntu/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2024-02-07 21:36:46,709 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-ubuntu/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 401 bytes saved in 0 seconds .
2024-02-07 21:36:46,719 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2024-02-07 21:36:46,741 INFO namenode.FSNamesystem: Stopping services started for active state
2024-02-07 21:36:46,742 INFO namenode.FSNamesystem: Stopping services started for standby state
2024-02-07 21:36:46,744 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2024-02-07 21:36:46,744 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ip-172-31-40-233.eu-central-1.compute.internal/172.31.40.233
************************************************************/
ubuntu@ip-172-31-40-233:~$ start-dfs.sh
Starting namenodes on [localhost]
localhost: Warning: Permanently added 'localhost' (ED25519) to the list of known hosts.
localhost: ubuntu@localhost: Permission denied (publickey).
Starting datanodes
localhost: ubuntu@localhost: Permission denied (publickey).
Starting secondary namenodes [ip-172-31-40-233]
ip-172-31-40-233: Warning: Permanently added 'ip-172-31-40-233' (ED25519) to the list of known hosts.
ip-172-31-40-233: ubuntu@ip-172-31-40-233: Permission denied (publickey).
ubuntu@ip-172-31-40-233:~$ sudo start-dfs.sh
sudo: start-dfs.sh: command not found
ubuntu@ip-172-31-40-233:~$ touch test.tsv
ubuntu@ip-172-31-40-233:~$ hdfs dfs -put test.tsv /
put: Call From ip-172-31-40-233.eu-central-1.compute.internal/172.31.40.233 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

PS. The current version of the script is:

#!/bin/bash

sudo apt-get update
sudo apt-get -y install openjdk-8-jdk-headless

export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"

### If there is no HDFS and Spark on your system:

wget --continue https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
wget --continue https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xzf hadoop-3.3.6.tar.gz
tar -xzf spark-3.5.0-bin-hadoop3.tgz
sudo mv hadoop-3.3.6 /usr/local/hadoop
sudo mv spark-3.5.0-bin-hadoop3 /usr/local/spark

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

### To configure HDFS:

cp /usr/local/hadoop/etc/hadoop/core-site.xml /usr/local/hadoop/etc/hadoop/core-site.xml.bak
cp /usr/local/hadoop/etc/hadoop/hdfs-site.xml /usr/local/hadoop/etc/hadoop/hdfs-site.xml.bak

echo "
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>
" | tee /usr/local/hadoop/etc/hadoop/core-site.xml

echo "
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
" | tee /usr/local/hadoop/etc/hadoop/hdfs-site.xml

### To configure Spark:

cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh

echo "export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop" | tee --append /usr/local/spark/conf/spark-env.sh
echo "export SPARK_MASTER_HOST=localhost" | tee --append /usr/local/spark/conf/spark-env.sh

### To run Spark and HDFS:

start-master.sh
start-slave.sh spark://localhost:7077
hdfs namenode -format
start-dfs.sh

wget --continue 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz'
gzip -d hits.tsv.gz
chmod 777 ~ hits.tsv
hdfs dfs -put hits.tsv /

$SPARK_HOME/bin/spark-shell --master local -i ClickBenchRunner.scala

@alexey-milovidov alexey-milovidov self-assigned this Feb 7, 2024
@DoubleMindy
Copy link
Author

@alexey-milovidov We assume that there is passless ssh connection defined on localhost (in other words, if we will use ssh localhost while being on localhost, it does not requires password to create new connection).

Please clarify the next details:

  1. Should we add ssh-keygen etc. procedures in this script?
  2. Do we assume that HDFS locates on current localhost or it is better to make target host as some external variable?

@alexey-milovidov
Copy link
Member

  1. Yes.
  2. Yes.

I do it in this way: create a fresh VM on AWS and run the commands one by one.
The "reproducibility" means that the commands should succeed. It should be self-contained to localhost.

@alexey-milovidov
Copy link
Member

@DoubleMindy, let's continue.

@DoubleMindy
Copy link
Author

Added full HDFS deployment, on "fresh" VM there is no problem with file putting

@alexey-milovidov
Copy link
Member

Sorry, but the script still does not reproduce. I'm copy-pasting the commands one by one, and getting this:
https://pastila.nl/?002ddf82/4eb61c74ae88429085d2360f926f231e#YmAhevhatmLeuU+SXJohJA==

@alexey-milovidov
Copy link
Member

We need a reproducible script to install Spark. It should run by itself.

@rschu1ze
Copy link
Member

@DoubleMindy It would be appreciated if you continue with this, but for now I'll close the PR (for cleanup reasons).

@rschu1ze rschu1ze closed this Sep 18, 2024
@alexey-milovidov
Copy link
Member

I'm very interested in the results of Spark, but we need at least one person who can install it.

If a system cannot be easily installed it is a game over.
I'm surprised that there are people who managed to use Spark.

SELECT TraficSourceID, SearchEngineID, AdvEngineID, CASE WHEN (SearchEngineID = 0 AND AdvEngineID = 0) THEN Referer ELSE '' END AS Src, URL AS Dst, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 GROUP BY TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst ORDER BY PageViews DESC LIMIT 10;
SELECT URLHash, EventDate, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 AND TraficSourceID IN (-1, 6) AND RefererHash = 3594120000172545465 GROUP BY URLHash, EventDate ORDER BY PageViews DESC LIMIT 10;
SELECT WindowClientWidth, WindowClientHeight, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 AND DontCountHits = 0 AND URLHash = 2868770270353813622 GROUP BY WindowClientWidth, WindowClientHeight ORDER BY PageViews DESC LIMIT 10;
SELECT DATE_TRUNC('minute', EventTime) AS M, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-14' AND EventDate <= '2013-07-15' AND IsRefresh = 0 AND DontCountHits = 0 GROUP BY DATE_TRUNC('minute', EventTime) ORDER BY DATE_TRUNC('minute', EventTime) LIMIT 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the OFFSET clause was removed, which is incorrect.

It should be either LIMIT 1010 to get the closest result or subqueries with ROW_NUMBER.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants