Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to run spark cluster #143

Merged
merged 10 commits into from
Dec 10, 2021

Conversation

chia7712
Copy link
Contributor

@chia7712 chia7712 commented Dec 8, 2021

這是為下一個題目做準備,這個腳本可以幫我們啟動spark cluster

@chia7712 chia7712 self-assigned this Dec 8, 2021
@chia7712
Copy link
Contributor Author

chia7712 commented Dec 8, 2021

@harryteng9527 麻煩幫忙測試一下

@harryteng9527
Copy link
Collaborator

@chia7712

測試環境:

docker version : 20.10.7
OS: ubuntu

Master與Worker通訊問題

建立Master、Worker後,Master的Web UI無法看到workers,如下圖
spark-master

下圖為Worker的WebUI
spark-slave1

連線似乎有異狀

Worker自動關閉

開啟worker持續10分鐘,worker會自動關閉,不知道此行為是否正常

@chia7712
Copy link
Contributor Author

chia7712 commented Dec 9, 2021

@harryteng9527 感謝測試

開啟worker持續10分鐘,worker會自動關閉,不知道此行為是否正常

可否貼上logs給我看?docker logs xxx

另外可否請你下載新版後再試試看?因為我有解決跟連線有關的bug

@harryteng9527
Copy link
Collaborator

@chia7712

可否貼上logs給我看?

剛剛看了docker logs worker_containerID後,發現自動關閉是worker與master沒有通訊到導致
worker會retry多次(16次)與master的連線,若沒有連線成功會自己跳error並關閉

下面附上log:

21/12/09 17:10:16 INFO Worker: Retrying connection to master (attempt # 16)
21/12/09 17:10:16 INFO Worker: Connecting to master 127.0.1.1:11587...
21/12/09 17:10:16 WARN Worker: Failed to connect to master 127.0.1.1:11587
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.worker.Worker$$anon$1.run(Worker.scala:298)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Failed to connect to /127.0.1.1:11587
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:287)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.1.1:11587
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
21/12/09 17:10:55 ERROR Worker: All masters are unresponsive! Giving up.

@harryteng9527
Copy link
Collaborator

@chia7712

請問需要修改ip來測試嗎??

address

@chia7712
Copy link
Contributor Author

chia7712 commented Dec 9, 2021

請問需要修改ip來測試嗎?

沒錯,這可能就是主因了

@harryteng9527
Copy link
Collaborator

@chia7712

將hostname改成主機的IP後能夠正常運行腳本
changedhostname

測試過程

運行一個Master與Worker node,並Submit一個SparkPi的範例,能在Web UI中看到Complete。

運行指令:

~/spark/bin$./spark-submit
--deploy-mode cluster
--master spark://192.168.103.64:17941
--class org.apache.spark.examples.SparkPi
~/spark/examples/jars/spark-examples_2.12-3.1.2.jar 10

Master Web UI

masternodeweb

Worker Web UI

workernodeweb

根據上面能夠submit job並可以在Web UI中檢視執行結果,我認為此腳本可以正常運作。

建立Worker之問題詢問

目前只能支援建立一個master跟一個worker嗎?
想要建立兩個worker時,發現不允許執行多個spark workers 在相同節點上,還是應該要設置什麼參數才可以建立多個worker

twoworker

@chia7712
Copy link
Contributor Author

目前只能支援建立一個master跟一個worker嗎?
想要建立兩個worker時,發現不允許執行多個spark workers 在相同節點上,還是應該要設置什麼參數才可以建立多個worker

這是刻意的限制,也符合spark的規範。因為spark是一種資源管理的服務,如果同一個節點部署多個會導致混亂,例如已經被佔用的資源被重複配置,這會有後續蠻多效能問題

因此要建立多個workers必須在不同節點上

@chia7712 chia7712 merged commit 3da644c into opensource4you:main Dec 10, 2021
@chia7712 chia7712 deleted the add_spark_script branch December 29, 2021 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants