Apache Spark is a powerful, fast and cost efficient tool for Big Data problems with having components like Spark Streaming, Spark SQL and Spark MLlib. Therefore Spark is like the Swiss army knife of the Big Data world.
Moreover Spark also has another very important feature which is horizontal scaling. In other words Spark supports standalone (deploy) cluster mode. A single Spark cluster has one Master and any number of Slaves or Workers. Workers can run their own individual processes on a horizontal Spark cluster on seperate machines as well as on the same machine with vertical scaling.
Technology is developing everyday, even in every second without stopping and it is also improving our lives in many different ways. In the 1990s, 1 megabyte chip ram capacity was a big revolution for Amiga 500 game computers so that we could play even popular games like Street Fighter or Mortal Kombat 😀 with that capacity.
A lot has changed since those years, from 1 megabyte to petabytes, even if to Exabytes… According to the CSIRO, in the next decade, astronomers expect to be processing 10 petabytes of data every hour from the Square Kilometre Array (SKA) telescope. …
In the previous article, we looked at Apache Spark Discretized Streams (DStreams) which is a basic concept of Spark Streaming. In this article we will look at the structured part of Spark Streaming.
Structured Streaming is built on top of SparkSQL engine of Apache Spark which will deal with running the stream as the data continues to recieve. Just like the other engines of Spark, it is scalable as well as it is fault-tolerant. Structured Streaming enhances Spark DataFrame APIs with streaming features.
Structured Streaming also ensures recovery of any fault as soon as possible with the help of checkpoints…
Try to imagine this; in every single second , nearly 9,000 tweets are sent , 1000 photos are uploaded on instagram, over 2,000,000 emails are sent and again nearly 80,000 searches are performed according to Internet Live Stats.
So many data is generated without stopping from many sources and sent to another sources simultaneously in small packages.
Many applications also generate consistently-updated data like sensors used in robotics, vehicles and many other industrial and electronical devices stream data for monitoring the progress and the performance.
That’s why great numbers of generated data in every second have to be processed and…
In the previous article, we looked at Spark RDDs which is the fundamental part (unstructured)of Spark core. In this article we will look at structured part of Spark core; SparkSQL and DataFrames. SparkSQL is the module in Spark for processing structured data also using DataFrames.
DataFrame is a structured data collection formed of rows which is distributed across worker nodes (executer) of Spark. Fundamentally DataFrames are like tables in a relational database with their own schemas and headers.
DataFrames consist of data rows created from different data formats like files (text,csv,json..) or Spark own RDDs.
In this article, i will…
Although it is recommended to learn and use High Level API(Dataframe-Sql-Dataset) for beginners, Low Level API -resilient distributed dataset (RDD) is the basics of Spark programming. Mainly, RDD is a collection of elements partitioned between the nodes (workers) of a cluster which easily provides parallel operation in the nodes.
RDDs can be created only in two ways: either parallelizing an already existing dataset, collection in your drivers and external storages which provides data sources like Hadoop InputFormats (HDFS,HBase,Cassandra..) or by tranforming from already created RDDs.
Spark RDDs can be created by two ways;
First way is to use
Data Scientist, Electrical Engineer and Commercial Pilot