Table of Contents

背景

目前公司的离线调度系统使用的Azkaban+Spark 2.4,上午突然很多任务执行失败了,通过查看执行日志发现如下异常:

22-10-2024 10:28:51 CST task_name INFO - org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException): The directory item limit of /user/azkaban/spark2.4/history is exceeded: limit=1048576 items=1048576
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2248)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2336)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:2304)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:2087)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addFile(FSDirectory.java:390)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:3015)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2890)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2774)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:610)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:117)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2278)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2274)
22-10-2024 10:28:51 CST task_name INFO - 	at java.security.AccessController.doPrivileged(Native Method)
22-10-2024 10:28:51 CST task_name INFO - 	at javax.security.auth.Subject.doAs(Subject.java:422)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2272)
22-10-2024 10:28:51 CST task_name INFO - 
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.ipc.Client.call(Client.java:1470)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.ipc.Client.call(Client.java:1401)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
22-10-2024 10:28:51 CST task_name INFO - 	at com.sun.proxy.$Proxy14.create(Unknown Source)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:295)
22-10-2024 10:28:51 CST task_name INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
22-10-2024 10:28:51 CST task_name INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
22-10-2024 10:28:51 CST task_name INFO - 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
22-10-2024 10:28:51 CST task_name INFO - 	at java.lang.reflect.Method.invoke(Method.java:498)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
22-10-2024 10:28:51 CST task_name INFO - 	at com.sun.proxy.$Proxy15.create(Unknown Source)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1721)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1657)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1582)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:397)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:393)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:393)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:337)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:120)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.SparkContext.<init>(SparkContext.scala:522)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
22-10-2024 10:28:51 CST task_name INFO - 	at scala.Option.getOrElse(Option.scala:121)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.<init>(SparkSQLCLIDriver.scala:308)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:157)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
22-10-2024 10:28:51 CST task_name INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
22-10-2024 10:28:51 CST task_name INFO - 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
22-10-2024 10:28:51 CST task_name INFO - 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
22-10-2024 10:28:51 CST task_name INFO - 	at java.lang.reflect.Method.invoke(Method.java:498)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
22-10-2024 10:28:51 CST task_name INFO - 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

可以看到目录数超过了Hadoop设置中HDFS单文件夹文件个数限制。

解决方法

修改配置文件 ,重启namenode,datanode。修改hdfs-site.xml配置如下:

<property>
  <name>dfs.namenode.fs-limits.max-directory-items</name>
  <value>3200000</value>
  <description>Defines the maximum number of items that a directory may
      contain. Cannot set the property to a value less than 1 or more than
      6400000.</description>
</property>

插曲

本来想删除一些文件让任务先正常执行,在gateway执行机上执行:

[exakit@10.10.10.10 ~]$ hdfs dfs -rm -r /user/azkaban/spark2.4/history/*
24/10/22 10:31:23 INFO retry.RetryInvocationHandler: Exception while invoking getListing of class ClientNamenodeProtocolTranslatorPB over namenode.com/10.10.10.3:8020. Trying to fail over immediately.
java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:597)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

完成后重启Azkaban上的任务就可以正常运行了。

声明
1.本网站名称: 盲盒博客
2.本站永久网址:https://exakit.com
3.本网站的文章部分内容可能来源于网络,仅供大家学习与参考,如有侵权,请联系站长support@exakit.com
4.本站一切资源不代表本站立场,并不代表本站赞同其观点和对其真实性负责
5.本站一律禁止以任何方式发布或转载任何违法的相关信息,访客发现请向站长举报
6.本站资源大多存储在云盘,如发现链接失效,请联系我们我们会第一时间更新