Broken Backend
  • Blog
  • About
  • Tags
Broken Backend

Tech: Spark


How to optimise loading partitioned JSON data in Spark ?

 Posted on August 30, 2020  |  7 minutes  |  1321 words  |  Wissem

In this tutorial we will explore ways to optimise loading partitioned JSON data in Spark.

I have used the SF Bay Area Bike Share dataset, you can find it here. The original data (status.csv) have gone through few transformations. The result looks like:

[Read More]
Tech: Spark  Topic: Optimisation  Format: Howto 

How to add row numbers to a Spark DataFrame?

 Posted on August 20, 2020  |  8 minutes  |  1518 words  |  Wissem

In this tutorial, we will explore a couple of ways to add a sequential consecutive row number to a dataframe.

For example, let this be our dataframe (taken from Spark: The Definitive Guide github repo):

[Read More]
Tech: Spark  Format: Howto 

Spark DataFrame - two ways to count the number of rows per partition

 Posted on August 15, 2020  |  2 minutes  |  331 words  |  Wissem

Sometimes, we are required to compute the number of rows per each partition. To do this, there are two ways:

  • The first way is using Dataframe.mapPartitions().

  • The second way (the faster according to my observations) is using the spark_partition_id() function, followed by a grouping count aggregation.

    [Read More]
Tech: Spark  Format: Howto 

Wissem  • © 2025  •  Broken Backend

Hugo v0.151.0 powered  •  Theme Beautiful Hugo adapted from Beautiful Jekyll