Hive(EMR) and DynamoDB Integration

Follow below steps to move data from Hive to Dynamo DB.
Source link: https://docs.amazonaws.cn/en_us/amazondynamodb/latest/developerguide/EMRforDynamoDB.html

Below is just a quick summary from the above documentation link.

1. Create a hive table. (Ignore this step if hive table is already created)

CREATE TABLE hive_features
(feature_id BIGINT,
feature_name STRING ,
feature_class STRING ,
state_alpha STRING,
prim_lat_dec DOUBLE ,
prim_long_dec DOUBLE ,
elev_in_ft BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n';

2. Now create another external hive table which just references to the hive table (but provides the mapping between Hive and DDB tables)

CREATE EXTERNAL TABLE ddb_features
(feature_id BIGINT,
feature_name STRING,
feature_class STRING,
state_alpha STRING,
prim_lat_dec DOUBLE,
prim_long_dec DOUBLE,
elev_in_ft BIGINT)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES(
"dynamodb.table.name" = "Features",
"dynamodb.column.mapping"="feature_id:Id,feature_name:Name,feature_class:Class,state_alpha:State,prim_lat_dec:Latitude,prim_long_dec:Longitude,elev_in_ft:Elevation"
);

3. Now insert data into Dynamo DB table from Hive table (for which Hive internally uses Map-Reduce Job to do this copy)

INSERT OVERWRITE TABLE ddb_features
SELECT
feature_id,
feature_name,
feature_class,
state_alpha,
prim_lat_dec,
prim_long_dec,
elev_in_ft
FROM hive_features;

4. we can use hive QL to query DDB table

SELECT DISTINCT feature_class
FROM ddb_features
ORDER BY feature_class;

SELECT feature_name, state_alpha
FROM ddb_features
WHERE feature_class = 'Lake'
AND feature_name LIKE 'M%'
ORDER BY feature_name;

SELECT state_alpha, feature_class, COUNT(*)
FROM ddb_features
WHERE elev_in_ft > 5280
GROUP by state_alpha, feature_class
HAVING COUNT(*) >= 3
ORDER BY state_alpha, feature_class;

Source link:

https://docs.amazonaws.cn/en_us/amazondynamodb/latest/developerguide/EMRforDynamoDB.html

Comments

AWS Connect: Reporting and Visualizations

Amazon connect offers: - built in reports i.e., historical and real-time reports. We can customize these reports, schedule them and can integrate with any BI tool of our requirement to query and view the connect data. Sample solution provided by AWS: 1. Make sure Connect is exporting the CTR data using Kinesis Data Stream 2. Use Kinesis Firehose to deliver the CTR that are in KDS to S3. (CTR's can be delivered as batch of records, so one s3 object might have multiple CTR's). AWS Lambda is used to add a new line character to each record, which makes object easier to parse. 3. s3 Event Notifications are used to send an event to modify the CTR record and saves it in S3. 4. Athena queries the modified CTR's using SQL. Use partitions to restrict the amount of data scanned by each query, improving performance and reducing cost. Lambda function is used to maintain the partitions. 5. Quicksight is used to visualize the modified CTRs. Solution variations: Convert re...

Must use VS Code Extensions for anyone working on Cloud

Here are the list of VS Code extensions that anyone working on cloud technologies can use to speed up their development. To download any extension, refer to the extension tab on your VS code window: As we will manage all our cloud resources using Terraform, we will start with Terraform Autocomplete Extension. 1. Terraform Extensions Terraform: to manage terraform resources directly from VS Code. Terraform Autocomplete : useful when we are creating terraform resources. 2. Docker : To build, manage and deploy docker containers from VS Code. 3. Python : extension that provides python interpreter 4. Prettier-Code formatter : 5. Markdown Preview : 6. Git : Git History : Git Graph : Now we can select the below extensions, and click on install. AWS VSCode Extensions: 1. AWS Toolkit : To interact with AWS resources directly from VS Code. Helpful in taking a look of AWS resources without having to login into console, provides us with a very cool UI to g...

SoleTechie: Setting up Gitlab

- Created a gitlab group: Group name : SoleTechie Group URL : http://gitlab.com/soletechie1 Visibility level : private Group ID : 52826632 - Creating a Gitlab project: Project name : cicd-demo Project URL : https://gitlab.com/soletechie/ Project slug : cicd-demo Project description : Setting up git lab runners and trying to implement the CI-CD flow Project deployment target (optional) : Infrastructure provider (Terraform) Visibility level : Private

soletechie

Search This Blog