Reddit AMA with Matei Zaharia, creator of Apache Spark

Apache Spark creator

Matei Zaharia is the creator of the Apache Spark project that is revolutionizing the world of Hadoop. He is the CTO of Databricks and an assistant professor at MIT.

On April 3rd, 2015 he gave an “Ask Me Anything” on the website reddit.com

Here are the main Questions and Answers from that AMA:

Q: I have have yet to learn Spark but plan to do so in the coming months.Have any advice for a novice like myself ?

A:You should try running Spark on your laptop even before the course using the steps at http://spark.apache.org/docs/latest/quick-start.html. There’s also a book on it now that might be worth a look at.

Q: Is your vision that Spark might eventually replace Hadoop (e.g. doing everything Hadoop does but better), or that it would be better to see it as a complement of Hadoop (e.g. improve on Hadoop in some areas but there is still room for Hadoop)?
By Hadoop, I mostly meant MapReduce/Hive/Pig, and basically the whole data processing layer. In your opinion, in a few years there would be no place for MapReduce anymore (aside maybe for some very specialized use cases), and the Spark processing layer would become the de facto standard?

A:Depends what you mean by Hadoop. The MapReduce part of Hadoop is likely to be replaced by new engines, and has already been replaced in various domains. However, “Hadoop” as a whole generally means an entire ecosystem of software, including HDFS and HBase for storage, data formats like Parquet and ORC, cluster schedulers like YARN, etc. Spark doesn’t come with its own variants of those (e.g. it doesn’t come with a distributed filesystem), so it’s not meant to replace those, but rather it interoperates with them. We designed it from the start to fit into this ecosystem.

I do think that Spark has a chance to replace those layers, though we’ll have to see how it plays out. The main benefits of Spark over the current layers are 1) unified programming model (you don’t need to stitch together many different APIs and languages, you can call everything as functions in one language) and 2) performance optimizations that you can get from seeing the code for a complete pipeline and optimizing across the different processing functions. People also like 1) because it means much fewer different tools to learn.

Q: As someone who does most of my data science work in the “medium data” world, where it can be reasonably batched in-memory on one or a few machines, but whose work is tending toward bigger and bigger data sets, one of my big concerns about big data technology is fragmentation. When I was a grad student, you had maybe one or two big, popular ways to parallelize things across multiple machines – run a Beowulf cluster or maybe Condor, and either distribute batch processes which save to files or use MPI to communicate.
Now, though, it seems like new technologies, all doing slightly different versions of the same thing, pop up nearly weekly. What are your thoughts on what this fragmentation means to the industry, and how it will affect companies who are trying to navigate the buzzwords and settle on a technology that will provide meaningful benefit in the medium-long term? Obviously, you would think Spark would be a good choice for a lot of people, but how do you see things shaking out in the industry, and does Databricks have any strategies for combating the fragmentation?

A:The big data space is indeed one with lots of software projects, which can make it confusing to pick the tool(s) to use. I personally do think efforts will consolidate and I hope many will consolidate around Spark :). But one of the reasons why there were so many projects is that people built new computing engines for each problem they wanted to parallelize (e.g. graph problems, batch processing, SQL, etc). In Spark we sought to instead design a general engine that unifies these computing models and captures the common pieces that you’d have to rebuild for each one (e.g. fault tolerance, locality-aware scheduling, etc). We were fairly successful, in that we can generally match the performance of specialized engines but we save a lot of programming effort by reusing 90% of our code across these. So I do think that unified / common engines will reduce the amount of fragmentation here.

Q: What is one thing that you know about Silicon Valley and startups that you did not know going in , and could be helpful to people looking to start a tech company ?

A: I guess I sort of knew this, but it’s the importance of a team and the huge amount of time you’ll have to spend recruiting (as you grow your company). If you do plan to hire people, expect to a lot of your time doing that as you ramp up, and really try to see which people from your network could help you. Having a great team is crucial to getting over any challenges the company will face. Luckily at Databricks we already had a great team that worked together in the past at Berkeley, so we were able to ramp up fairly quickly.

Q: What’s your expectation to the next generation of large scale data processing platform?

A: I think a few pieces of technology will play an important role:

  • Cloud: with prices of public clouds falling and more data moving there, designing systems that take advantage of the cloud (e.g. scale up and down elastically or request different machine types for different things) is likely to be interesting.

  • New storage media like SSDs and non-volatile RAM, which may allow low-latency access to larger data volumes.

  • New processing platforms like manycore chips and GPUs may also be interesting to incorporate, especially given the point about clouds (which means you might grab them just to do one specific computation).

Q: I love Spark–I think the computational model is fantastic but I am wondering why is the Spark project not using Impala as its SQL engine?

A:One of our goals in Spark is to have a unified engine where you can combine SQL with other types of processing (e.g. custom functions in Scala / Python or machine learning libraries), so we can’t use a separate engine just for SQL. However, Spark SQL uses a lot of the optimizations in modern analytical databases and Impala, such as columnar storage, code generation, etc.

Q: What would be the next gen. of Spark?

A: One of the largest areas of focus now is rich standard libraries that match the ease of use of data processing libraries for “small data” (e.g. pandas, R, scikit-learn). We had some cool ones in this space recently added to Spark, includingDataFrames and ML pipelines. The SparkR project for calling Spark from R is also likely to be merged soon.

Apart from that, I think there’s still a bunch of interesting stuff to do in the core engine that will help all the libraries. One of the main areas we’re working on at Databricks there is better debugging tools to let you see what your application is doing. We’ve also spent a bunch on memory efficiency and network performance and we plan to continue on those.


Here is the full Reddit AMA

He also gave an interesting keynote at StrataConf 2015