Tips & Tricks - Spark Streaming and Amazon S3

Tips & Tricks – Spark Streaming and Amazon S3

Sudhir Pradhan Trending May 9, 2019 | 0

Post Views: 1,971

As we all know, the Amazon S3 is an amazing storage to deal with persisting the hot and cold data in this big-data era. It has 99.99% uptime which has been claimed by amazon. You can follow the documentation from amazon for more details.

When we all agreeing upon the S3 storage is easiest, resilient , accessibility and cost effectivity, but when using s3 as a persistence layer for some streaming techniques like Spark streaming / Kafka streaming becomes little bit tricky.

Let me explain my use case in production , problems I have faced and the best solution I have implemented to mitigate the pitfalls.

Use-case :

There is a Kafka streaming cluster running where data keep getting accumulated in a continuous basis. Hence to consume the data from topics, I had to write spark streaming service which consumes the data/logs from Kafka cluster and stores into amazon S3 bucket. Then the data will be indexed into the HBase storage.

Now you might be wondering if we already have HBase store, why we are dealing with the s3 storage again. I buy the argument, however as I said in s3 we can store hot and cold data. So s3 will keep the since launch data , where as Hbase with hold only the last 7 day’s data for dash board purpose.

Problems I have faced when dealing with streaming data in S3:

Large number of small files getting created
Duplicate data getting scattered all over the different files
Continuous writing to same file
Slow S3 directory listing
Streaming data flushing issue to S3 store
File / directory naming convention

Note: Amazon S3 is an object store and not a regular file system. Its immutable in nature.

So to overcome such difficulties as mentioned above , I am putting down couple of tips and tricks here together which might help on your way.

Use correct S3 client object access URI schema
1. S3a (s3a://) is the recommended S3 client over S3n for Hadoop 2.7 and later.S3a is more performant and supports larger files(upto 5TB) and has support for multipart upload. There were multiple bugs in s3n those have been fixed in the s3a client.
Use correct version of library
1. For Spark 2.0.1 work with S3a , use hadoop-aws-2.7.3.jar, aws-java-sdk-1.7.4.jar, joda-time-2.9.3.jar .
2. Note : don’t hardcode any AWS keys and the S3A FileSystemClass in the conf file, rather set these as environment variable.

Spark.hadoop.fs.s3a.access.key XXXXXXX
spark.hadoop.fs.s3a.secret.key XXXXXXX
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem

3. Try using the .cache or s3distcp for road performance

First write the files into local EMR cluster and copy to S3 using s3 distributed copy (s3distcp) to benefit from the better file read performance of a real file system.

The groupBy option of s3distcp is a great option to solve the small file problem by merging a large number of small files.

4. Customize you client code to deal with the small file problem

Try writing a data merging service which should run in a regular time interval and merge the small files to a specific threshold.

5. Tuning write performance

Set the configuration option

spark.sql.parquet.filterPushdown to true
spark.sql.parquet.mergeSchema to false

This configurations help to avoid schema merge process during writes which really slows down you write stage).

In the Spark 2.0 , the above configs are default.

Unlike regular file system, the s3 is not a file store rather its an object store. Hence the operation on the s3 storage is bit tricky compare to the regular one. To get the most out of the performance when operating on s3 storage , one should consider the following basic points into consideration,

Choosing a region
Choosing an Object Key
Optimizing GET / PUT operations
Optimizing list operations

Will discuss little bit deeper in a different thread. For now keep enjoying on exploring the AWS S3 storage.

You can enroll now !We are giving 30% discount on our Internship Program

Don’t miss the chance to participate in the upcoming Internship Program which will be done using Microsoft Dot Net Web Development Full Stack Technology. The new batch will be starting from May 20, 2024. We will have most experienced trainers for you to successfully complete the internship with live project experience.

Why to choose Our Internship Program?

Industry-Relevant Projects
Tailored Assignments: We offer projects that align with your academic background and career aspirations.
Real-World Challenges: Tackle industry-specific problems and contribute to meaningful projects that make a difference.

Professional Mentorship
Guidance from Experts: Benefit from one-on-one mentorship from seasoned professionals in your field.
Career Development Workshops: Participate in workshops that focus on resume building, interview skills, and career planning.

Networking Opportunities
Connect with Industry Leaders: Build relationships with professionals and expand your professional network.
Peer Interaction: Collaborate with fellow interns and exchange ideas, fostering a supportive and collaborative environment.

Skill Enhancement
Hands-On Experience: Gain practical skills and learn new technologies through project-based learning.
Soft Skills Development: Enhance communication, teamwork, and problem-solving skills essential for career success.

Book Your Seat Now

Free Demo Class Available

aws aws s3 s3 file system s3 object s3 tips and tricks smartechie spark + s3 spark 2.0 spark 2.0 s3 tips spark streaming spark streaming with s3 spark with s3