In this tutorial we will explore ways to optimise loading partitioned JSON data in Spark.
I have used the SF Bay Area Bike Share dataset, you can find it here. The original data (status.csv) have gone through few transformations. The result looks like:
Loading from partitioned JSON files
We will load the data filtered by station and month :
val df1 = spark.read
.json("file:///data/bike-data-big/partitioned_status.json")
.filter("station_id = 10 and (month in ('2013-08', '2013-09'))")
Despite the fact that the code above does not contain any action yet, Spark starts three jobs that took few minutes to complete (on a local setting, with 8 cores and 32 Gigs of RAM):