文章目录
  1. 1. 错误集锦
    1. 1.1. ClosedChannelException
    2. 1.2. Lost Executors et. al.
    3. 1.3. 运行自带例子SparkPi时,NumberException
    4. 1.4. HDFS端口错误
    5. 1.5. java.net.BindException: 地址已在使用 Address already in use
    6. 1.6. InvalidResourceRequestException,核数过大
    7. 1.7. Incompatible clusterIDs
    8. 1.8. 程序运行中途出现java.nio.channels.ClosedChannelException的一种可能情况
    9. 1.9. Yarn资源分配小结
    10. 1.10. Remote RPC client disassociated
    11. 1.11. IDEA运行Spark程序出现NoClassDefFoundError
  2. 2. References

最近常跑Spark程序,
主要是分布式机器学习
和分布式深度学习这块,
因为模型经常很大,比如VGG等,
集群空余节点又不是很多,
跑起来有时候会吃力,
以及各种莫名其妙的安装问题,
配置问题,
运行问题,
运行中问题,
等等等等,
所以特地积累一下以备后查,
也方便后来者参考之。

错误集锦

ClosedChannelException

1
2
3
4
5
1 ERROR YarnClientSchedulerBackend:70 - Yarn application has already exited with state FINISHED!
2 ERROR SparkContext:91 - Error initializing SparkContext.
java.lang.IllegalStateException: Spark context stopped while waiting for backend
3 ERROR TransportClient:245 - Failed to send RPC 7202466410763583466 to /xx.xx.xx.xx:54864: java.nio.channels.ClosedChannelException
4 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint:91 - Sending RequestExecutors(0,0,Map()) to AM was unsuccessful

上面这几个错误通常一起爆出。
【原因分析】
可能是分配给node的内存太小,Spark默认启动两个executor,使用每个executor的内存为1G,而数据太大,导致yarn直接Kill掉了executor,IO也一并关闭,所以出现了ClosedChannelException异常。
这里的错误分析[错误1]也有可能是由于Java 8的excessive memory allocation strategy
【解决方案】
根据这篇文章yarn-site.xml中添加如下配置:

1
2
3
4
5
6
7
8
9
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>

<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>

或者在执行命令时附带参数: --driver-memory 5g --executor-memory 5g,将Job可用内存显式地增大。
或者在spark/conf/spark-defaults.conf添加如下Poperty:

1
2
spark.driver.memory              5g
spark.executor.memory 5g

甚至可以继续添加如下Property:

1
2
3
spark.yarn.executor.memoryOverhead          4096
spark.yarn.driver.memoryOverhead 8192
spark.akka.frameSize 700

Lost Executors et. al.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
5. ERROR YarnScheduler:70 - Lost executor 3 on simple23: Container marked as failed: container_1490797147995_0000004 on host: simple23. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

[Stage 16:===========================================> (6 + 2) / 8]
6. ERROR TaskSetManager:70 - Task stage 17.2 failed 4 times; aborting job
7. ERROR DistriOptimizer$:655 - Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task age 17.2 failed 4 times, most recent failure: Lost task 0.3 in stage 17.2 (TID 90, simple21, executor 4): java.util.concurrent.EnException:

[Stage 23:> (0 + 3) / 3]
8. ERROR YarnScheduler:70 - Lost executor 4 on simple21: Container marked as failed: container_1490797147995_0004_01_000005 on host: simple21. Exit status: 143. Diagn Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

[Stage 23:> (0 + 3) / 3]
9. ERROR TransportResponseHandl- Still have 1 requests outstanding when connection from /xx.xx.xx.22:51442 is closed

【原因分析】
由报错信息可以看出,yarn丢失了executor,极有可能还是因为executor被关闭了,所以还是要检查一下自己的driver-memory和executor-memory是不是够大。
【解决方案】
如上一个

运行自带例子SparkPi时,NumberException

在Spark下运行自带SparkPi例子时,按如下运行:

1
./bin/run-example org.apache.Spark.examples.SparkPi spark://10.39.2.31:7077

出现错误:

1
Exception in thread "main" Java.lang.NumberFormatException: For input string: "spark://10.39.2.31:7077"

【原因分析】
原因是SparkPi的参数错误,运行时第一个参数要么为核数,要么不加参数。
【解决方案】
按如下方式运行:

1
2
./bin/run-example org.apache.Spark.examples.SparkPi 2 spark://10.39.2.31:7077     # Standalone运行
./bin/run-example org.apache.Spark.examples.SparkPi 2 local # 本地运行


1
./bin/run-example org.apache.Spark.examples.SparkPi

HDFS端口错误

程序中读取HDFS文件,地址设为hdfs://122.3.2.20:2320/data,2320是hadoop client的端口,报错:

1
java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type.; Host Details : local host is: "Server-020/122.3.2.20"; destination host is: "Server-020":2320;

【原因分析】
hdfs在配置中的defaultFS地址为hdfs://122.3.2.20:9000,所以读取数据时端口不应为2320,应为9000。
【解决方案】
程序中读取数据语句改为;hdfs://122.3.2.20:9000/data

java.net.BindException: 地址已在使用 Address already in use

1
2
3
4
5
6
7
8
9
10
2017-07-08 15:14:41,668 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: 地址已在使用; For more details see:  http://wiki.apache.org/hadoop/BindException
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: 地址已在使用; For more details see: http://wiki.apache.org/hadoop/BindException
......
Caused by: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: 地址已在使用; For more details see: http://wiki.apache.org/hadoop/BindException
...
... 13 more
Caused by: java.net.BindException: 地址已在使用
...
... 21 more
2017-07-08 15:14:41,669 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...

【原因分析】
通过 netstat -ant | grep "8040" 查看占用8040端口的进程,发现如下:

1
tcp6       0      0 :::8040                 :::*                    LISTEN

很可能是非常规关闭(如kill -9)Nodemanager,Worker,Datanode等导致。

【解决方案】
由于未知进程号,但是想把它杀掉,采用

1
sudo fuser -k 8040/tcp

成功杀死。

InvalidResourceRequestException,核数过大

1
2
3
4
5
6
运行BigDL程序如下:
dist/bin/bigdl.sh -- spark-submit --master yarn --deploy-mode cluster --driver-memory 12g --driver-cores 12 --executor-memory 12g --num-executors 4 --executor-cores 16 --class com.intel.analytics.bigdl.models.resnet.Train dist/lib/bigdl-0.1.0-jar-with-dependencies.jar -f cifar-10/ --batchSize 2048 --optnet true --depth 20 --classes 10 --shortcutType A --nEpochs 100 --learningRate 0.1 --momentum 0.9

报错:
Exception in thread "main" org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=12, maxVirtualCores=8
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=12, maxVirtualCores=8

【原因分析】
使用cluster模式运行,提示最大虚拟核数为8,请求的虚拟核数 > 最大允许虚拟核数,故报错:非法资源请求
观察到yarn的参数:【摘自《YARN 资源分配的配置参数》

1
2
3
4
CPU 资源 
yarn.nodemanager.resource.cpu-vcores,该节点上 YARN 可使用的虚拟 CPU 个数,默认是8
yarn.scheduler.minimum-allocation-vcores,单个任务可申请的最小虚拟CPU个数, 默认是1
yarn.scheduler.maximum-allocation-vcores,单个任务可申请的最多虚拟CPU个数,默认是3

发现节点上 YARN 可使用的虚拟 CPU 个数默认是8,我们设置过大的时候就容易出现运行失败。

【解决方案】
在yarn-site.xml中设置yarn.nodemanager.resource.cpu-vcores参数,如机器有24个虚拟核,则可设置为23(留一核给其它程序)或者其它。

Incompatible clusterIDs

start-dfs.sh 出现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2017-07-09 16:58:57,796 WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Incompatible clusterIDs in /home/experiment/huqiu/hadoop-2.7.2/hdfs/datanode: namenode clusterID = CID-ba940db1-9873-4468-822b-34b591a95fcd; datanode clusterID = CID-010385dd-b520-482c-bd47-e69632479501
2017-07-09 16:58:57,796 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to slave025/192.168.100.35:9000. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:745)
2017-07-09 16:58:57,798 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to slave025/192.168.100.35:9000
2017-07-09 16:58:57,898 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool <registering> (Datanode Uuid unassigned)
2017-07-09 16:58:59,899 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2017-07-09 16:58:59,901 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2017-07-09 16:58:59,903 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:

【原因分析】
clusterID不匹配,可能是由于开启hdfs后又重新格式化了namenode,而格式化namenode不会影响datanode,所以导致了不匹配。

【解决方案】
1)关掉hdfs,然后删除旧的data和name目录(datanode和namenode数据存放的本地目录),重新格式化。
2)更新datanode目录下current目录下VERSION中的clusterID。

程序运行中途出现java.nio.channels.ClosedChannelException的一种可能情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
2017-07-09 18:00:15 ERROR TransportClient:245 - Failed to send RPC 6773876705814091919 to /192.168.100.36:57255: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2017-07-09 18:00:15 ERROR YarnScheduler:70 - Lost executor 2 on slave026: Slave lost
2017-07-09 18:00:15 ERROR TransportClient:245 - Failed to send RPC 6158813764515683544 to /192.168.100.36:57255: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2017-07-09 18:00:15 ERROR YarnScheduler:70 - Lost executor 6 on slave026: Slave lost
[Stage 59:> (0 + 7) / 8]2017-07-09 18:00:15 ERROR TransportClient:245 - Failed to send RPC 8408959789263061264 to /192.168.100.36:57255: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2017-07-09 18:00:15 ERROR YarnScheduler:70 - Lost executor 1 on slave027: Slave lost
2017-07-09 18:00:15 ERROR TransportClient:245 - Failed to send RPC 7607614848097733030 to /192.168.100.36:57255: java.nio.channels.ClosedChannelException
......

查看日志:yarn-experiment-nodemanager-slave029.log文件:

1
2
2017-07-09 19:54:49,434 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Unauthorized request to start container.
This token is expired. current time is 1499601289434 found 1499601228024

【原因分析】
这是由于master,slaves互相之间系统时间差距过大(超过10分钟),所以会出现token过期的错误,导致任务无法继续进行。

【解决方案】
1)过期时间由yarn.resourcemanager.rm.container-allocation.expiry-interval-ms参数决定,默认是10分钟。我们可以增大此过期时间。如下

1
2
3
4
<property>
<name>yarn.resourcemanager.rm.container-allocation.expiry-interval-ms</name>
<value>1000000</value>
</property>

2)其实更好的一种方式就是同步集群机器的系统时间,这个需要系统root权限,同步方式可以为设置每台机器的时间。
命令为:date -s "2017-09-08 20:00:00" 相同格式。

Yarn资源分配小结

【TODO】
见《Apache Spark Jobs 性能调优(二)》调试资源分配,

我们开始觉得

1
--driver-memory 12g --driver-cores 12 --executor-memory 12g --num-executors 4 --executor-cores 16

Remote RPC client disassociated

Spark Standalone模式下,报错

1
Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

【原因分析】内存不足导致,可能是运行时没有设置--executor-memory从而Spark使用了默认的内存量(1G),或者--executor-memory设得太小了,考虑增大。

【解决方案】显式增大使用内存量。

IDEA运行Spark程序出现NoClassDefFoundError

IDEA运行Spark程序出现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
at rftest$.main(rftest.scala:9)
at rftest.main(rftest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more

【原因分析】编译时可能使用了spark-core (provided),导致缺少spark-core相关包
【解决方案】在pom.xml中去掉spark-coreprovided scope

References