Skip to main content

Using Glue as Common Metadata Store — for Hive, Presto, Redshift

From this post, you will learn how to use Glue to read the schema from the S3 file using its crawlers and build a common metadata store to other AWS services like Hive, Presto and Redshift Spectrum.

Before Glue Data Catalog (GDC) feature, it is painful to create schemas for the same file as the schema definitions vary from Hive and Redshift. Creating multiple schemas for the same file structure not only challenges the data integrity but also involves a lot of manual work. Most of us are already familiar with Glue as ETL framework on AWS. But its cataloging features will quickly add value to the business as well as help to streamline the technical workflows.

Populating Glue Data Catalog (Source: AWS documentation https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html)

In AWS, no matter what kind of data we deal with, S3 will be the primary ingestion/landing zone for our ETL processing. The Glue crawlers will read data from S3 and build a data catalog, and we can use the same data catalog as our hive meta-store which will be integrated with Presto as well. 
Here are the steps:
  • Drop the dataset on S3 or assume data is already there on S3.
  • Setup a Glue Crawler to read data from the file on S3.

Glue Crawler to read the schema from S3 file

Once the glue crawler is run, we must be able to see the glue database/catalog getting created. We can get the schema definition, by accessing the tables under that glue catalog.
  • Now its time to spin up your EMR cluster and make sure these configuration settings are set for the cluster: set up hive-site to use glue data catalog meta-store, enable presto-connector-hive and spark-hive-site as well.

EMR Cluster Configuration Details
EMR Cluster Configuration Settings — to integrate Hive and Presto with GDC(Glue Data Catalog)

Note:
~ If you are unable to use the Master public DNS to do SSH, please try SSH with the private IP address. (Get the IP address of the master node by going to Hardware tab -> Master Instance Details -> there you will find your EC2 instance ID along with status, public DNS name, private DNS name, public IP address, and private IP address).
~ Download the private key file (.pem) to launch the cluster. Then, use: ssh -i <name of your .pem file>.pem hadoop@<DNS Name/IP Address) 

SSH to the EMR Cluster

  • Access HIVE on the EMR cluster, we must be able to see the glue databases here and access the tables on glue as hive tables. 

Accessing Hive Console

Now, we must be able to see the Glue Data Catalog Database as one of the Hive Databases. Use that database, to run all your queries.

Running a simple count(*) on the hive table

  • Access PRESTO-CLI on EMR, and we can use glue catalogs as our catalog and start querying the dataset on S3 directly.

Accessing Presto-CLI on EMR
Running a simple select count(*) on presto

  • Now create external tables on redshift using IAM role (which should have permissions to access s3, glue services) as we will create and access redshift tables using glue catalog.
create external schema <Redshift schema name>
from data catalog
database ‘<glue catalog DB name>’
region ‘us-east-1’
iam_role ‘<iam role arn>’;
Note: Creating external schemas or tables on redshift, is nothing but we are using Redshift spectrum features, so we will be billed based on spectrum charges (usually based on a scanned query)
Thank you for your time. 

Comments

Popular posts from this blog

AWS Connect: Reporting and Visualizations

Amazon connect offers: - built in reports i.e., historical and real-time reports.  We can customize these reports, schedule them and can integrate with any BI tool of our requirement to query and view the connect data.  Sample solution provided by AWS: 1. Make sure Connect is exporting the CTR data using Kinesis Data Stream 2. Use Kinesis Firehose to deliver the CTR that are in KDS to S3. (CTR's can be delivered as batch of records, so one s3 object might have multiple CTR's). AWS Lambda is used to add a new line character to each record, which makes object easier to parse.  3. s3 Event Notifications are used to send an event to modify the CTR record and saves it in S3. 4. Athena queries the modified CTR's using SQL. Use partitions to restrict the amount of data scanned by each query, improving performance and reducing cost. Lambda function is used to maintain the partitions.  5. Quicksight is used to visualize the modified CTRs.  Solution variations: Convert re...

Databricks: Job aborted due to stage failure. Total size of serialized results is bigger that spark driver memory.

  While running a databricks job, especially running a job with large datasets and longer running queries that creates a lot of temp space - we might be facing below issue if we have a minimal configuration set to the cluster.  The simple way to fix this would be changing the spark driver config in the databricks cluster tab spark.driver.maxResultSize = 100G (change the GB based on your cluster size)

Terraform lifecycle

 If we are using terraform, terraform state file is the heart of all the infrastructure that we spin up using terraform templates.  There are several ways to deploy the infrastructure using terraform: 1. Using CLI (setup terraform and then run terraform commands) 2. Automated Build (terraform scripts integrated as part of your jenkins pipeline) No matter of the way we chose, we must make sure that we are using the same terraform state file, so that we are having a sync and proper checklists of the resources that we used.  I would like to share the terraform commands that we do on a daily basis: terraform init = the basic/starting command which initializes the terraform (make sure the proper provider is provided. In my case, I use AWS).  terraform workspace select <workspace name > (creates a new workspace, useful in scenarios where we have different terraform modules - database, servers, logs, storage) terraform state list = shows the list of terraform resour...