Skip to main content

Posts

Showing posts from January, 2020

Explode function using PySpark

Sometimes, the data frame which we get by reading/parsing JSON, cannot be used as-is for our processing or analysis. Explode function to the rescue. When our df.printSchema( ) , returns as an array of structs, then using explode function is little tricky compared to using array of elements Sample script which worked for me to solve the explode for array of structs: """python from pyspark.sql import SQLContext, SparkSession from pyspark.sql.functions import explode spark = SparkSession.builder.appName('test-explode').getOrCreate() sqlContext = SQLContext(spark) df = sqlContext.read.json("<json file name>") exploded_df = df.select("id", explode("names")).select("id", "col.first_name", "col.middle_name", "col.last_name") exploded_df.show() """ To filter out based on a condition: male_names_list = exploded_df.filter(exploded_df.GENDER=='M')...