MPP is an predestined tool for any Data Warehousing and Big Data use case. Amazon Red Shift overhaul all of its peers in its space due to its ease to use, performance and scalability. Optimization can be easily termed as the key steps in Data warehousing and Big Data world.

The Accompanying Red Shift primitive optimizations techniques will helps in tackling uneven query performance.

Prioritize Compression

Compression impacts the Redshift cluster performance in top level. Vital effects of compressions are:

  1. Reduce storage utilization.


Recently Apache Spark community releases the preview of Spark 3.0 which holds many significant new features that will help Spark to make a powerful mark, which already has a wide range of enterprise users and developers in this Big data and Data science era.

In the new release, spark community has ported some functions from Spark SQL to programmatic Scala API(org.apache.spark.sql.functions) …


As, Spark DataFrame becomes de-facto standard for data processing in Spark, it is a good idea to be aware key functions of Spark sql that most of the Data Engineers/Scientists might need to use in their data transformation journey.

callUDF

One of the most vital functions of Spark, which I found useful in my day to day usage. callUDF is used not only to call the user defined functions, but also to utilise the functions of Spark Sql which are not a part of Spark functions objects.

Example:

For example parse_url is a part of Spark Sql but not available…


Island of Isolation

The garbage collector is one of the major primitives in the JAVA world. The tool that clears the unused/unreachable objects from the memory. An object is said to be eligible for Garbage Collection if it holds no references. Contrary to this assertion, there is a state where objects can be Garbage collected even if they hold a reference. Such a scenario is known as the Island of Isolation.

To put it in layman’s terms Island of Isolation is a scenario where some objects hold each other’s references, but none of them can be accessed through the active application as the…


Developing a spark application is fairly simple and straightforward, as spark provides featured pack APIs. Be that as it may, the tedious task is to deploy it on the cluster in an optimal way which yields ideal performance and following the best practices when developing spark jobs. Here in DataKareSolutions, we often need to get our hands dirty to tune the spark application. Throughout this article I will put out all the best practices we follow in DataKareSolutions to optimize spark application.

Data Serialization

Spark bolsters two types of serialization, Java serialization which is the default one and Kryo serialization. Kryo serialization…


This is the second chapter under the series “Structured Streaming” which center around covering all the essential details to set up a Structured Streaming query. Peruse the previous chapter here for getting introduced to Structured Streaming.

Sources

Sources in Structured Streaming refers to the streaming data sources which brings data into Structured Streaming. As of spark 2.4 the built in data sources are as follows,

  • Kafka
  • File source


Introduction

As streaming frameworks are emerging gradually, it encourages the developers to concentrate on business challenges rather than focussing on potential streaming analytics issues. Structured Streaming is a part of the Apache Spark venture, which built on top of the Spark SQL engine for streaming analytics. Structured streaming is fault tolerant, scalable and mainly it conceals all the streaming complexities from the developer which makes the development easier.

Structured streaming was introduced in Spark 2.0 as a part of the Apache Spark project as a micro batch stream processing engine. It was marked as stable in Spark 2.2 which makes Structured…


This article focuses on explaining how to integrate Spark’s new stream processing engine Structured Streaming with Apache Kafka brokers 0.10 and higher along with all necessary configuration details.

Apache Kafka

Apache Kafka is a distributed reliable, fault tolerant publish-subscribe messaging system. Kafka works on top of two major primitives producer and subscriber clients. Kafka stores the data as topics, for parallelism the topics are divided into partitions.

Producer client publishes the messages to certain topics which are in turn consumed by one or more consumer groups. As records are published to Kafka topics, Kafka assigns a sequential “ID” to each record known…

Arun Jijo

Data engineer at DataKare Solutions who gained expertise at Apache Nifi, Kafka, Spark and passionate in Java.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store