Apache Spark


This article is just an introduction of SPARK. In the last few articles, we studied about Hadoop. The main limitation of Hadoop is, it's a batch processing framework and not able to handle streaming data. What is batch processing? You have to submit a job and wait for the result. After a few minute's or hour's you will get the result, so it's a time consuming. Before SPARK, there are different systems like Giraph, Strom, Mahout, Tez, Graphlab, Impla, Kafka for performing operations like Iterative, interactive, machine learning, streaming, graph, SQL etc on data and get the result which is faster than Hadoop. But instead of learning everything from the specialized system, where each tool will take minimum 1 month to learn for specialization, SPARK will solve the problem by adding API's.

History Of SPARK:

SPARK project is started in 2009 as a research project in the UC Berkeley RAD Lab, later its handover to AMPLab. In 2011, the AMPLab started to develop higher-level components on SPARK, such as Shark (Hive on Spark) and Spark Streaming. SPARK was first open sourced in March 2010, and was transferred to the Apache Software Foundation in June 2013, where it is now a top-level project.

SPARK Overview:

SPARK is written in Scala and run on the JVM.
It is fast and general purpose cluster computing platform.
SPARK extends popular MapReduce model.
It provides a high-level API's in JAVA, SCALA, Python and SQL.
It is mainly used where placing the data in memory and perform computation, that's why it is faster than Hadoop.
SPARK is also used for the Iterative algorithm in machine learning,Data Mining, and Data processing.
The key benefit it offers is caching intermediate data in memory for better access time.
SPARK's primary abstraction is Resilient Distributed Dataset(RDD). What is RDD, that we will see in detail in future articles.

SPARK provides:

SparkSQL - For SQL and structured data processing.
MLib - For machine learning.
Graphx - For graph processing
Spark streaming


Post a Comment