apache spark

Interview with Sean Li, New Apache Spark™ Committer

Sean Li of the Spark Technology Center was recently honored to be designated as an official committer for the Apache Spark™ project — one of just 47 committers worldwide. The honor reflects Sean's sustained and outstanding contributions to Spark, and his commitment to the community at large.

STC: In the past year, you've been an active contributor to Spark SQL. What motivates you to contribute?

Sean: I'll answer by giving a bit of my background. When I was a child, my parents founded a company called Richpeace Group. Over the last 24 years, we all experienced a lot of adversity — but through it all, one thing never changed: maintaining a startup spirit in the business. That might be why I enjoy working on Apache Spark, which of course is the most active big data open-source project. At all hours of the day and night, the passionate community is adding new contributions, and I love being actively involved in the discussions and contributing code to help improve the project at large.

STC: You've worked for IBM since 2010. What did you do before joining the Spark Technology Center — and how has that previous work experience guided you?

Sean: Six years ago, I started at IBM as a COOP with a focus on database replication. The replication team was small but highly efficient. In fact, it felt like a startup because the developers needed to take on many roles: product design, software implementation and maintenance, quality assurance, technical sales, customer service, and cross-organization collaboration.

Many Fortune 100 customers were using our product in mission-critical systems, so as a developer, the pressure was high. A major bug could provoke a lawsuit from a client, but even small bugs could seriously impact a customer's business. I remember a data loss bug impacting the production system of a Japanese bank. It was a particularly difficult bug because it was triggered only once per week — and we were only able to reproduce it by running four systems in parallel. After a whole month of effort — and many late nights — we finally found the root cause just before the contract was up for renewal.

Those experiences with traditional infrastructure products help a lot when I'm contributing to Apache Spark. Spark's quality control and test coverage are actually weaker than what we faced on that previous team. But as a committer, I know contributors understand the significance of test cases. They know that product quality really depends on the test case coverage. They also know that to make Apache Spark enterprise-ready, there's a lot still to do. I'm hoping my ongoing contributions can help Apache Spark become more and more enterprise-ready over the course of the next few years.

STC: Do you remember the first time you used Spark?

Sean: Yeah! I finished my first Spark application in an internal Spark hackathon. When the votes were tallied we'd placed third — and I was hooked. The application we created replicated data from a database on mainframe to Spark almost in real time. I was surprised how easy it was to learn Spark from scratch. (We even got some attention from the global sales team who wanted to know when the application would be ready to sell to customers.)

STC: Is that when you decided to join the Spark Technology Center team?

Sean: Many things triggered the decision. In 2015, our executive Rob Thomas regularly shared his thoughts about big data and open source through his personal blog. After attending both a Hadoop summit and a Spark summit, I could sense the changes happening across the data management community. I could see that Spark was likely to shake up data analytics, just as Linux had shaken up operating systems. I knew that the Spark Technology Center was being created to fulfill that vision — and I knew I wanted to be a part of the change.

STC: Your focus is Spark SQL. Tell us more about that.

Sean: First, I think the name Spark SQL is confusing. Many people incorrectly list Spark SQL and DataFrame/Dataset as separate modules, but in fact the DataFrame/Dataset APIs and the SQL interface are the interfaces of Spark SQL. They are user-friendly, high-level Spark system APIs. Internally, Spark SQL contains the Catalyst optimizer, the Tungsten execution engine, and data source integration support. It now powers all user-facing components, including the machine learning library, streaming support, and graph-parallel computation. In other words, our improvements to Spark SQL benefit all other components. That's why Spark SQL is the most active component in Spark. Maybe we should rename it to Spark Core. [SMILE]

STC: We've seen a lot of comparison between Spark SQL and traditional databases like Oracle and DB2. Can Spark SQL replace them?

Sean: Yes and no. It's worth noticing that Spark SQL doesn't target OLTP scenarios. Generally, it's impractical to use Spark as an OLTP engine. In the industry, more and more users are employing Spark to handle OLAP workloads, since the traditional database vendors are still expensive. With the Spark 2.0 release, Spark is becoming mature and stable. I expect more and more Hive users will switch to Spark SQL in the next few years.

But we shouldn't view Spark SQL as just a database. It's a new core of Spark, and Spark is a general-purpose data processing/analytics engine. We can think about Spark the way we think about smartphones: Smartphones don't have the best cameras, and they're not the best game consoles, eBook readers, or music/video players. But how many people are still buying cameras, game consoles, or MP3 or MP4 players?

STC: That's an interesting point. We've seen almost all the major IT players integrating Spark into their offerings or providing Spark as a service. Are you saying the Spark ecosystem resembles the Android ecosystem? If so, is there an Apple?

Sean: I don't think there's a Apple-like company in Big Data world. An Apple-like company would have a very hard time, I think, because the future belongs to open source. Eventually, Spark can be like Linux. That's my personal vision. I have a lot of confidence about the future of Spark.

Newsletter

You Might Also Enjoy

Kevin Bates
Kevin Bates
3 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
4 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More