Introduction to Apache Spark with Scala:


Apache Spark is a highly developed engine for data processing on large scale over thousands of compute engines in parallel. This allows maximizing processor capability over these compute engines. Spark has the capability to handle multiple data processing tasks including complex data analytics, streaming analytics, graph analytics as well as scalable machine learning on huge amount of data in the order of Terabytes, Zettabytes and much more.

What is Apache Spark?

Apache Spark is an open-source cluster computing framework that was initially developed at UC Berkeley in the AMPLab.
As compared to the disk-based, two-stage MapReduce of Hadoop, Spark provides up to 100 times faster performance for a few applications with in-memory primitives.
This makes it suitable for machine learning algorithms, as it allows programs to load data into the memory of a cluster and query the data constantly.
A Spark project contains various components such as Spark Core and Resilient Distributed Datasets or RDDs, Spark SQL, Spark Streaming, Machine Learning Library or Mllib, and GraphX.
What is Apache Scala?
Scala is a modern and multi-paradigm programming language. It has been designed for expressing general programming patterns in an elegant, precise, and type-safe way. One of the prime features is that it integrates the features of both object-oriented and functional languages smoothly.

Spark Data Processing Capabilities.

Structured SQL for Complex Analytics with basic SQL
A well-known capability of Apache Spark is how it allows data scientist to easily perform analysis in an SQL-like format over very large amount of data. Leveraging spark-core internals and abstraction over the underlying RDD, Spark provides what is known as DataFrames, an abstraction that integrates relational processing with Spark’s functional programming API. This is done, by adding structural information to the data to give semi-structure or full structure to the data using schema with column names and with this, a dataset can be directly queried using the column names opening another level to data processing.
GraphX Graph Processing Engine.
The fourth data processing capability is inherent in its capability to perform analysis on Graph data e.g in social network analysis. Spark’s GraphX API is a collection of ETL processing operations and graph algorithms that are optimized for large scale implementations on data.
Conclusion.
To conclude this introduction to Spark, a sample scala application — wordcount over tweets is provided, it is developed in the scala API. The application can be run in your favourite IDE such as InteliJ or a Notebook like in Databricks or Apache Zeppelin.
In this article, some major points covered are:
·         Description of Spark as a next generation data processing engine
·         The undelying technology that gives spark its capability
·         Data Processing APIs that exists in Spark
·         A knwoledge of how to work with the Data Processing APIs
·         A simple example to have a taste of spark processing power.



Comments

Post a Comment

Popular posts from this blog

Abstract class in Java

Java Array