Sometimes, the data frame which we get by reading/parsing JSON, cannot be used as-is for our processing or analysis. Explode function to the rescue. When our df.printSchema( ) , returns as an array of structs, then using explode function is little tricky compared to using array of elements Sample script which worked for me to solve the explode for array of structs: """python from pyspark.sql import SQLContext, SparkSession from pyspark.sql.functions import explode spark = SparkSession.builder.appName('test-explode').getOrCreate() sqlContext = SQLContext(spark) df = sqlContext.read.json("<json file name>") exploded_df = df.select("id", explode("names")).select("id", "col.first_name", "col.middle_name", "col.last_name") exploded_df.show() """ To filter out based on a condition: male_names_list = exploded_df.filter(exploded_df.GENDER=='M')...
ETL, big-data, cloud(AWS/GCP/Azure) technologies and possibly share random stuff along the way!!