ERROR when writing file to S3 bucket from EMRFS enabled Spark cluster
ERROR :
18/03/02 01:42:17 INFO RetryInvocationHandler: Exception while invoking ConsistencyCheckerS3FileSystem.mkdirs over null. Retrying after sleeping for 10000ms. com.amazon.ws.emr.hadoop.fs.consistency.exception.ConsistencyException: Directory ‘bucket/folder/_temporary’ present in the metadata but not s3 at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:506)
Root cause :
Mostly the consistent problem comes due to
- Manual deletion of files and directory from S3 console
- retry logic in spark and hadoop systems.
- When a process of creating a file on s3 failed, but it already updated in the dynamodb.
- when the hadoop process restarts the process as the entry is already present in the dynamodb. It throws the consistent error.
Solution :
Try re-run your spark job by cleaning up the EMRFS metadata in dynamo db.
Follow the steps to clean-up & Restore the indended specific directory in the S3 bucket….
Deletes all the objects in the path, emrfs delete uses the hash function to delete the records, so it may delete unwanted entries also, so we are doing the import and sync in the consequent steps
Delete all the metadata
emrfs delete s3://<bucket>/path
Retrieves the metadata for the objects that are physically present in s3 into dynamo db
emrfs import s3://<bucket>/path
Sync the data between s3 and the metadata.
emrfs sync s3://<bucket>/path
After all the operations, to see whether that particular object is present in both s3 and metadata
emrfs diff s3://<bucket>/path