Tips & Tricks – Spark Streaming and Amazon S3
As we all know, the Amazon S3 is an amazing storage to deal with persisting the hot and cold data in this big-data era. It has 99.99% uptime which has been claimed by amazon. You can follow the documentation from amazon for more details.
When we all agreeing upon the S3 storage is easiest, resilient , accessibility and cost effectivity, but when using s3 as a persistence layer for some streaming techniques like Spark streaming / Kafka streaming becomes little bit tricky.
Let me explain my use case in production , problems I have faced and the best solution I have implemented to mitigate the pitfalls.
Use-case :
There is a Kafka streaming cluster running where data keep getting accumulated in a continuous basis. Hence to consume the data from topics, I had to write spark streaming service which consumes the data/logs from Kafka cluster and stores into amazon S3 bucket. Then the data will be indexed into the HBase storage.
Now you might be wondering if we already have HBase store, why we are dealing with the s3 storage again. I buy the argument, however as I said in s3 we can store hot and cold data. So s3 will keep the since launch data , where as Hbase with hold only the last 7 day’s data for dash board purpose.
Problems I have faced when dealing with streaming data in S3:
- Large number of small files getting created
- Duplicate data getting scattered all over the different files
- Continuous writing to same file
- Slow S3 directory listing
- Streaming data flushing issue to S3 store
- File / directory naming convention
Note: Amazon S3 is an object store and not a regular file system. Its immutable in nature.
So to overcome such difficulties as mentioned above , I am putting down couple of tips and tricks here together which might help on your way.
- Use correct S3 client object access URI schema
- S3a (s3a://) is the recommended S3 client over S3n for Hadoop 2.7 and later.S3a is more performant and supports larger files(upto 5TB) and has support for multipart upload. There were multiple bugs in s3n those have been fixed in the s3a client.
- Use correct version of library
- For Spark 2.0.1 work with S3a , use hadoop-aws-2.7.3.jar, aws-java-sdk-1.7.4.jar, joda-time-2.9.3.jar .
- Note : don’t hardcode any AWS keys and the S3A FileSystemClass in the conf file, rather set these as environment variable.
Spark.hadoop.fs.s3a.access.key XXXXXXX
spark.hadoop.fs.s3a.secret.key XXXXXXX
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
3. Try using the .cache or s3distcp for road performance
First write the files into local EMR cluster and copy to S3 using s3 distributed copy (s3distcp) to benefit from the better file read performance of a real file system.
The groupBy option of s3distcp is a great option to solve the small file problem by merging a large number of small files.
4. Customize you client code to deal with the small file problem
Try writing a data merging service which should run in a regular time interval and merge the small files to a specific threshold.
5. Tuning write performance
Set the configuration option
spark.sql.parquet.filterPushdown to true
spark.sql.parquet.mergeSchema to false
This configurations help to avoid schema merge process during writes which really slows down you write stage).
In the Spark 2.0 , the above configs are default.
Unlike regular file system, the s3 is not a file store rather its an object store. Hence the operation on the s3 storage is bit tricky compare to the regular one. To get the most out of the performance when operating on s3 storage , one should consider the following basic points into consideration,
- Choosing a region
- Choosing an Object Key
- Optimizing GET / PUT operations
- Optimizing list operations
Will discuss little bit deeper in a different thread. For now keep enjoying on exploring the AWS S3 storage.