Solved : Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

Sudhir Pradhan Troubleshoot July 9, 2018 | 0

Post Views: 2,805

Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.27.10.191:50010,DS-51d68378-35de-4c70-b27e-7a98a53919cc,DISK], DatanodeInfoWithStorage[172.27.10.223:50010,DS-f172c682-713d-4a8f-b8af-69198ddc6756,DISK]], original=[DatanodeInfoWithStorage[172.27.10.191:50010,DS-51d68378-35de-4c70-b27e-7a98a53919cc,DISK], DatanodeInfoWithStorage[172.27.10.223:50010,DS-f172c682-713d-4a8f-b8af-69198ddc6756,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via ‘dfs.client.block.write.replace-datanode-on-failure.policy’ in its configuration.

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:925)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:988)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1156)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:454)

[2017-11-10 07:08:03,097] ERROR Task hdfs-sink-prqt-stndln-rpro-fct-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:455)

java.lang.RuntimeException: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Invalid partition for default.revpro-perf-oracle-jdbc-stdln-raw-KFK_RPRO_RI_WF_SUMM_GV: partition=0

at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:226)

at io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:103)

at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:435)

at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:251)

at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:180)

at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:148)

at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)

at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.ExecutionException: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Invalid partition for default.revpro-perf-oracle-jdbc-stdln-raw-KFK_RPRO_RI_WF_SUMM_GV: partition=0

at java.util.concurrent.FutureTask.report(FutureTask.java:122)

at java.util.concurrent.FutureTask.get(FutureTask.java:192)

at io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:220)

… 12 more

Caused by: io.confluent.connect.hdfs.errors.HiveMetaStoreException: Invalid partition for default.revpro-perf-oracle-jdbc-stdln-raw-KFK_RPRO_RI_WF_SUMM_GV: partition=0

at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:107)

at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:662)

at io.confluent.connect.hdfs.TopicPartitionWriter$3.call(TopicPartitionWriter.java:659)

… 4 more

Caused by: InvalidObjectException(message:default.revpro-perf-oracle-jdbc-stdln-raw-KFK_RPRO_RI_WF_SUMM_GV table not found)

at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51619)

at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result$append_partition_by_name_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:51596)

at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$append_partition_by_name_with_environment_context_result.read(ThriftHiveMetastore.java:51519)

at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)

at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1667)

at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1651)

at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:606)

at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:600)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:152)

at com.sun.proxy.$Proxy51.appendPartition(Unknown Source)

at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:97)

at io.confluent.connect.hdfs.hive.HiveMetaStore$1.call(HiveMetaStore.java:91)

at io.confluent.connect.hdfs.hive.HiveMetaStore.doAction(HiveMetaStore.java:87)

at io.confluent.connect.hdfs.hive.HiveMetaStore.addPartition(HiveMetaStore.java:103)

… 6 more

Root cause :

This error specially occures when clusters with just 3 or less Datanodes may experience Append failures under heavy load. Here’s the config parameter to fix it.

Solution :

Append the following set of configuration to your client (not name node or data node) in hdfs-site.xml file

<property>
<name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
<value>true</value>
<description>
If there is a datanode/network failure in the write pipeline,
DFSClient will try to remove the failed datanode from the pipeline
and then continue writing with the remaining datanodes. As a result,
the number of datanodes in the pipeline is decreased. The feature is
to add new datanodes to the pipeline.

This is a site-wide property to enable/disable the feature.

When the cluster size is extremely small, e.g. 3 nodes or less, cluster
administrators may want to set the policy to NEVER in the default
configuration file or disable this feature. Otherwise, users may
experience an unusually high rate of pipeline failures since it is
impossible to find new datanodes for replacement.

See also dfs.client.block.write.replace-datanode-on-failure.policy
</description>
</property>

<property>
<name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
<value>DEFAULT</value>
<description>
This property is used only if the value of
dfs.client.block.write.replace-datanode-on-failure.enable is true.

ALWAYS: always add a new datanode when an existing datanode is removed.

NEVER: never add a new datanode.

DEFAULT:
Let r be the replication number.
Let n be the number of existing datanodes.
Add a new datanode only if r is greater than or equal to 3 and either
(1) floor(r/2) is greater than or equal to n; or
(2) r is greater than n and the block is hflushed/appended.
</description>
</property>

<property>
<name>dfs.client.block.write.replace-datanode-on-failure.best-effort</name>
<value>false</value>
<description>
This property is used only if the value of
dfs.client.block.write.replace-datanode-on-failure.enable is true.

Best effort means that the client will try to replace a failed datanode
in write pipeline (provided that the policy is satisfied), however, it
continues the write operation in case that the datanode replacement also
fails.

Suppose the datanode replacement fails.
false: An exception should be thrown so that the write will fail.
true : The write should be resumed with the remaining datandoes.

Note that setting this property to true allows writing to a pipeline
with a smaller number of datanodes. As a result, it increases the
probability of data loss.
</description>
</property>

This article explanation more regarding the properties.

You can enroll now !We are giving 30% discount on our Internship Program

Don’t miss the chance to participate in the upcoming Internship Program which will be done using Microsoft Dot Net Web Development Full Stack Technology. The new batch will be starting from May 20, 2024. We will have most experienced trainers for you to successfully complete the internship with live project experience.

Why to choose Our Internship Program?

Industry-Relevant Projects
Tailored Assignments: We offer projects that align with your academic background and career aspirations.
Real-World Challenges: Tackle industry-specific problems and contribute to meaningful projects that make a difference.

Professional Mentorship
Guidance from Experts: Benefit from one-on-one mentorship from seasoned professionals in your field.
Career Development Workshops: Participate in workshops that focus on resume building, interview skills, and career planning.

Networking Opportunities
Connect with Industry Leaders: Build relationships with professionals and expand your professional network.
Peer Interaction: Collaborate with fellow interns and exchange ideas, fostering a supportive and collaborative environment.

Skill Enhancement
Hands-On Experience: Gain practical skills and learn new technologies through project-based learning.
Soft Skills Development: Enhance communication, teamwork, and problem-solving skills essential for career success.

Book Your Seat Now

Free Demo Class Available

aws bad datanode confluent Confluent Kafka EMR hadoop hdfs java kafka linux