文章目录
  1. 1. 错误集锦
    1. 1.0.0.1. ClosedChannelException
    2. 1.0.0.2. Lost Executors et. al.
    3. 1.0.0.3. 运行自带例子SparkPi时,NumberException
    4. 1.0.0.4. HDFS端口错误
    5. 1.0.0.5. java.net.BindException: 地址已在使用 Address already in use
    6. 1.0.0.6. InvalidResourceRequestException,核数过大
    7. 1.0.0.7. Incompatible clusterIDs
    8. 1.0.0.8. 程序运行中途出现java.nio.channels.ClosedChannelException的一种可能情况
    9. 1.0.0.9. Yarn资源分配小结
    10. 1.0.0.10. Remote RPC client disassociated
    11. 1.0.0.11. IDEA运行Spark程序出现NoClassDefFoundError
    12. 1.0.0.12. 初次运行pyspark出现 key not found 错误
    13. 1.0.0.13. 运行pyspark示例出现
  • 2. References
  • 最近常跑Spark程序,
    主要是分布式机器学习
    和分布式深度学习这块,
    因为模型经常很大,比如VGG等,
    集群空余节点又不是很多,
    跑起来有时候会吃力,
    以及各种莫名其妙的安装问题,
    配置问题,
    运行问题,
    运行中问题,
    等等等等,
    所以特地积累一下以备后查,
    也方便后来者参考之。

    错误集锦

    ClosedChannelException
    1
    2
    3
    4
    5
    1 ERROR YarnClientSchedulerBackend:70 - Yarn application has already exited with state FINISHED!
    2 ERROR SparkContext:91 - Error initializing SparkContext.
    java.lang.IllegalStateException: Spark context stopped while waiting for backend
    3 ERROR TransportClient:245 - Failed to send RPC 7202466410763583466 to /xx.xx.xx.xx:54864: java.nio.channels.ClosedChannelException
    4 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint:91 - Sending RequestExecutors(0,0,Map()) to AM was unsuccessful

    上面这几个错误通常一起爆出。
    【原因分析】
    可能是分配给node的内存太小,Spark默认启动两个executor,使用每个executor的内存为1G,而数据太大,导致yarn直接Kill掉了executor,IO也一并关闭,所以出现了ClosedChannelException异常。
    这里的错误分析[错误1]也有可能是由于Java 8的excessive memory allocation strategy
    【解决方案】
    根据这篇文章yarn-site.xml中添加如下配置:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    <property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
    </property>

    <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
    </property>

    或者在执行命令时附带参数: --driver-memory 5g --executor-memory 5g,将Job可用内存显式地增大。
    或者在spark/conf/spark-defaults.conf添加如下Poperty:

    1
    2
    spark.driver.memory              5g
    spark.executor.memory 5g

    甚至可以继续添加如下Property:

    1
    2
    3
    spark.yarn.executor.memoryOverhead          4096
    spark.yarn.driver.memoryOverhead 8192
    spark.akka.frameSize 700

    Lost Executors et. al.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    5. ERROR YarnScheduler:70 - Lost executor 3 on simple23: Container marked as failed: container_1490797147995_0000004 on host: simple23. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
    Container exited with a non-zero exit code 143
    Killed by external signal

    [Stage 16:===========================================> (6 + 2) / 8]
    6. ERROR TaskSetManager:70 - Task stage 17.2 failed 4 times; aborting job
    7. ERROR DistriOptimizer$:655 - Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task age 17.2 failed 4 times, most recent failure: Lost task 0.3 in stage 17.2 (TID 90, simple21, executor 4): java.util.concurrent.EnException:

    [Stage 23:> (0 + 3) / 3]
    8. ERROR YarnScheduler:70 - Lost executor 4 on simple21: Container marked as failed: container_1490797147995_0004_01_000005 on host: simple21. Exit status: 143. Diagn Container killed on request. Exit code is 143
    Container exited with a non-zero exit code 143
    Killed by external signal

    [Stage 23:> (0 + 3) / 3]
    9. ERROR TransportResponseHandl- Still have 1 requests outstanding when connection from /xx.xx.xx.22:51442 is closed

    【原因分析】
    由报错信息可以看出,yarn丢失了executor,极有可能还是因为executor被关闭了,所以还是要检查一下自己的driver-memory和executor-memory是不是够大。
    【解决方案】
    如上一个

    运行自带例子SparkPi时,NumberException

    在Spark下运行自带SparkPi例子时,按如下运行:

    1
    ./bin/run-example org.apache.Spark.examples.SparkPi spark://10.39.2.31:7077

    出现错误:

    1
    Exception in thread "main" Java.lang.NumberFormatException: For input string: "spark://10.39.2.31:7077"

    【原因分析】
    原因是SparkPi的参数错误,运行时第一个参数要么为核数,要么不加参数。
    【解决方案】
    按如下方式运行:

    1
    2
    ./bin/run-example org.apache.Spark.examples.SparkPi 2 spark://10.39.2.31:7077     # Standalone运行
    ./bin/run-example org.apache.Spark.examples.SparkPi 2 local # 本地运行


    1
    ./bin/run-example org.apache.Spark.examples.SparkPi

    HDFS端口错误

    程序中读取HDFS文件,地址设为hdfs://122.3.2.20:2320/data,2320是hadoop client的端口,报错:

    1
    java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type.; Host Details : local host is: "Server-020/122.3.2.20"; destination host is: "Server-020":2320;

    【原因分析】
    hdfs在配置中的defaultFS地址为hdfs://122.3.2.20:9000,所以读取数据时端口不应为2320,应为9000。
    【解决方案】
    程序中读取数据语句改为;hdfs://122.3.2.20:9000/data

    java.net.BindException: 地址已在使用 Address already in use
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    2017-07-08 15:14:41,668 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: 地址已在使用; For more details see:  http://wiki.apache.org/hadoop/BindException
    org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: 地址已在使用; For more details see: http://wiki.apache.org/hadoop/BindException
    ......
    Caused by: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: 地址已在使用; For more details see: http://wiki.apache.org/hadoop/BindException
    ...
    ... 13 more
    Caused by: java.net.BindException: 地址已在使用
    ...
    ... 21 more
    2017-07-08 15:14:41,669 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...

    【原因分析】
    通过 netstat -ant | grep "8040" 查看占用8040端口的进程,发现如下:

    1
    tcp6       0      0 :::8040                 :::*                    LISTEN

    很可能是非常规关闭(如kill -9)Nodemanager,Worker,Datanode等导致。

    【解决方案】
    由于未知进程号,但是想把它杀掉,采用

    1
    sudo fuser -k 8040/tcp

    成功杀死。

    InvalidResourceRequestException,核数过大
    1
    2
    3
    4
    5
    6
    运行BigDL程序如下:
    dist/bin/bigdl.sh -- spark-submit --master yarn --deploy-mode cluster --driver-memory 12g --driver-cores 12 --executor-memory 12g --num-executors 4 --executor-cores 16 --class com.intel.analytics.bigdl.models.resnet.Train dist/lib/bigdl-0.1.0-jar-with-dependencies.jar -f cifar-10/ --batchSize 2048 --optnet true --depth 20 --classes 10 --shortcutType A --nEpochs 100 --learningRate 0.1 --momentum 0.9

    报错:
    Exception in thread "main" org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=12, maxVirtualCores=8
    Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=12, maxVirtualCores=8

    【原因分析】
    使用cluster模式运行,提示最大虚拟核数为8,请求的虚拟核数 > 最大允许虚拟核数,故报错:非法资源请求
    观察到yarn的参数:【摘自《YARN 资源分配的配置参数》

    1
    2
    3
    4
    CPU 资源 
    yarn.nodemanager.resource.cpu-vcores,该节点上 YARN 可使用的虚拟 CPU 个数,默认是8
    yarn.scheduler.minimum-allocation-vcores,单个任务可申请的最小虚拟CPU个数, 默认是1
    yarn.scheduler.maximum-allocation-vcores,单个任务可申请的最多虚拟CPU个数,默认是3

    发现节点上 YARN 可使用的虚拟 CPU 个数默认是8,我们设置过大的时候就容易出现运行失败。

    【解决方案】
    在yarn-site.xml中设置yarn.nodemanager.resource.cpu-vcores参数,如机器有24个虚拟核,则可设置为23(留一核给其它程序)或者其它。

    Incompatible clusterIDs

    start-dfs.sh 出现

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    2017-07-09 16:58:57,796 WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Incompatible clusterIDs in /home/experiment/huqiu/hadoop-2.7.2/hdfs/datanode: namenode clusterID = CID-ba940db1-9873-4468-822b-34b591a95fcd; datanode clusterID = CID-010385dd-b520-482c-bd47-e69632479501
    2017-07-09 16:58:57,796 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to slave025/192.168.100.35:9000. Exiting.
    java.io.IOException: All specified directories are failed to load.
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
    at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
    at java.lang.Thread.run(Thread.java:745)
    2017-07-09 16:58:57,798 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to slave025/192.168.100.35:9000
    2017-07-09 16:58:57,898 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool <registering> (Datanode Uuid unassigned)
    2017-07-09 16:58:59,899 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
    2017-07-09 16:58:59,901 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
    2017-07-09 16:58:59,903 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:

    【原因分析】
    clusterID不匹配,可能是由于开启hdfs后又重新格式化了namenode,而格式化namenode不会影响datanode,所以导致了不匹配。

    【解决方案】
    1)关掉hdfs,然后删除旧的data和name目录(datanode和namenode数据存放的本地目录),重新格式化。
    2)更新datanode目录下current目录下VERSION中的clusterID。

    程序运行中途出现java.nio.channels.ClosedChannelException的一种可能情况
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    2017-07-09 18:00:15 ERROR TransportClient:245 - Failed to send RPC 6773876705814091919 to /192.168.100.36:57255: java.nio.channels.ClosedChannelException
    java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
    2017-07-09 18:00:15 ERROR YarnScheduler:70 - Lost executor 2 on slave026: Slave lost
    2017-07-09 18:00:15 ERROR TransportClient:245 - Failed to send RPC 6158813764515683544 to /192.168.100.36:57255: java.nio.channels.ClosedChannelException
    java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
    2017-07-09 18:00:15 ERROR YarnScheduler:70 - Lost executor 6 on slave026: Slave lost
    [Stage 59:> (0 + 7) / 8]2017-07-09 18:00:15 ERROR TransportClient:245 - Failed to send RPC 8408959789263061264 to /192.168.100.36:57255: java.nio.channels.ClosedChannelException
    java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
    2017-07-09 18:00:15 ERROR YarnScheduler:70 - Lost executor 1 on slave027: Slave lost
    2017-07-09 18:00:15 ERROR TransportClient:245 - Failed to send RPC 7607614848097733030 to /192.168.100.36:57255: java.nio.channels.ClosedChannelException
    ......

    查看日志:yarn-experiment-nodemanager-slave029.log文件:

    1
    2
    2017-07-09 19:54:49,434 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Unauthorized request to start container.
    This token is expired. current time is 1499601289434 found 1499601228024

    【原因分析】
    这是由于master,slaves互相之间系统时间差距过大(超过10分钟),所以会出现token过期的错误,导致任务无法继续进行。

    【解决方案】
    1)过期时间由yarn.resourcemanager.rm.container-allocation.expiry-interval-ms参数决定,默认是10分钟。我们可以增大此过期时间。如下

    1
    2
    3
    4
    <property>
    <name>yarn.resourcemanager.rm.container-allocation.expiry-interval-ms</name>
    <value>1000000</value>
    </property>

    2)其实更好的一种方式就是同步集群机器的系统时间,这个需要系统root权限,同步方式可以为设置每台机器的时间。
    命令为:date -s "2017-09-08 20:00:00" 相同格式。

    Yarn资源分配小结

    【TODO】
    见《Apache Spark Jobs 性能调优(二)》调试资源分配,

    我们开始觉得

    1
    --driver-memory 12g --driver-cores 12 --executor-memory 12g --num-executors 4 --executor-cores 16

    Remote RPC client disassociated

    Spark Standalone模式下,报错

    1
    Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

    【原因分析】内存不足导致,可能是运行时没有设置--executor-memory从而Spark使用了默认的内存量(1G),或者--executor-memory设得太小了,考虑增大。

    【解决方案】显式增大使用内存量。

    IDEA运行Spark程序出现NoClassDefFoundError

    IDEA运行Spark程序出现:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
    at rftest$.main(rftest.scala:9)
    at rftest.main(rftest.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
    Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 7 more

    【原因分析】编译时可能使用了spark-core (provided),导致缺少spark-core相关包
    【解决方案】在pom.xml中去掉spark-coreprovided scope

    初次运行pyspark出现 key not found 错误

    本地运行pyspark:

    1
    2
    conf = SparkConf().setAppName("xx")
    sc = SparkContext(conf = conf)

    出现:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    19/04/24 16:03:22 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
    java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:59)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:59)
    at org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
    at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1228)
    at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
    at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Traceback (most recent call last):
    File "/home/huqiu/SPARE/EnronPySpark/project/test-driver.py", line 25, in <module>
    sc = SparkContext(conf = conf)
    File "/home/huqiu/programs/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 133, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    File "/home/huqiu/programs/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 316, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
    File "/home/huqiu/programs/anaconda3/lib/python3.6/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
    return _launch_gateway(conf)
    File "/home/huqiu/programs/anaconda3/lib/python3.6/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
    Exception: Java gateway process exited before sending its port number

    有可能是版本不统一的问题,pyspark版本2.4.1,而spark版本还停留在2.1.0。
    如果在PyCharm中遇到此问题,记得看看Edit Configuration有没有设置SPARK_HOME,或者直接在代码中设置os.environ["SPARK_HOME"] = '/path/to/spark'
    【解决方案】升级spark版本,或降低pyspark版本。如上设置SPARK_HOME。
    如有必要,还需设置

    1
    2
    os.environ['PYSPARK_PYTHON'] = '/path/to/python'
    os.environ['PYSPARK_DRIVER_PYTHON'] = '/path/to/python'

    运行pyspark示例出现

    运行示例:./bin/run_example SparkPi 10,出现:

    1
    java.io.FileNotFoundException: File file:/tmp/spark-events does not exist

    【解决方案】在/tmp下创建spark-events目录。
    再次运行,成功:

    1
    2
    3
    4
    5
    ./bin/run-example SparkPi 10
    19/04/24 16:01:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    19/04/24 16:01:53 WARN Utils: Your hostname, WHATBEG resolves to a loopback address: 127.0.1.1; using 114.212.83.163 instead (on interface enx502b73dc8a6a)
    19/04/24 16:01:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
    Pi is roughly 3.1404231404231404

    References