How to optimise loading partitioned JSON data in Spark ?

In this tutorial we will explore ways to optimise loading partitioned JSON data in Spark.

I have used the SF Bay Area Bike Share dataset, you can find it here. The original data (status.csv) have gone through few transformations. The result looks like:

Partitioned JSON data

Loading from partitioned JSON files


We will load the data filtered by station and month :

val df1 = spark.read
	.json("file:///data/bike-data-big/partitioned_status.json")
	.filter("station_id = 10 and (month in ('2013-08', '2013-09'))")

Despite the fact that the code above does not contain any action yet, Spark starts three jobs that took few minutes to complete (on a local setting, with 8 cores and 32 Gigs of RAM): Slow JSON Loading

[Read More]