Tech: Spark -

How to optimise loading partitioned JSON data in Spark ?

Posted on August 30, 2020 | 7 minutes | 1321 words | Wissem

In this tutorial we will explore ways to optimise loading partitioned JSON data in Spark.

I have used the SF Bay Area Bike Share dataset, you can find it here. The original data (status.csv) have gone through few transformations. The result looks like:

[Read More]

Tech: Spark Topic: Optimisation Format: Howto

How to add row numbers to a Spark DataFrame?

Posted on August 20, 2020 | 8 minutes | 1518 words | Wissem

In this tutorial, we will explore a couple of ways to add a sequential consecutive row number to a dataframe.

For example, let this be our dataframe (taken from Spark: The Definitive Guide github repo):

[Read More]

Tech: Spark Format: Howto

Spark DataFrame - two ways to count the number of rows per partition

Posted on August 15, 2020 | 2 minutes | 331 words | Wissem

Sometimes, we are required to compute the number of rows per each partition. To do this, there are two ways:

The first way is using Dataframe.mapPartitions().
The second way (the faster according to my observations) is using the spark_partition_id() function, followed by a grouping count aggregation.
[Read More]

Tech: Spark Format: Howto