Skip to main content

How to become a data engineer?

 To everyone out there, who wants to become a Data Engineer, keep following this blog as I am on the same path as you are. Interested in solving any data challenges (big/small). Having exposure on many tools and technologies is a nice to have, but what's must is to understand the underlying concepts or technical architectures or the internals of a tool. It makes us a better data engineer only if we try things out, learn something new, gain new tech experience. Only if we know what each tool does, the pros and cons of using it, only then we can select the right tools to solve the right problems. So I want to catalog all the learnings as it helps someone out there who is on the same path as me. Just sharing :) 


Primary skills to become a data engineer:

1. Programming skills (Java/Python/Scala)

2. Querying Skills (SQL/Hive QL/Spark SQL)

3. ETL architectures (Batch/Streaming)

4. Data warehousing concepts / Database Design

5. Cloud computing (AWS/GCP/Azure)

6. Big Data (Hadoop/Spark)

7. Familiarity with scripting/automation - Python/Shell


Nice to have skills:

1. Versioning tools (Git)

2. Automating deployments (Jenkins)

3. Writing efficient stored procedures, functions (SQL) - Yeah I meant those 100's of lines of SQL code

4. Tools (Databricks, Pentaho, Sqoop, Online Editors)

5. Building data lakes and DWH's (really helps if we build using traditional approach and then try to migrate the same to cloud). 


Comments

Popular posts from this blog

AWS Connect: Reporting and Visualizations

Amazon connect offers: - built in reports i.e., historical and real-time reports.  We can customize these reports, schedule them and can integrate with any BI tool of our requirement to query and view the connect data.  Sample solution provided by AWS: 1. Make sure Connect is exporting the CTR data using Kinesis Data Stream 2. Use Kinesis Firehose to deliver the CTR that are in KDS to S3. (CTR's can be delivered as batch of records, so one s3 object might have multiple CTR's). AWS Lambda is used to add a new line character to each record, which makes object easier to parse.  3. s3 Event Notifications are used to send an event to modify the CTR record and saves it in S3. 4. Athena queries the modified CTR's using SQL. Use partitions to restrict the amount of data scanned by each query, improving performance and reducing cost. Lambda function is used to maintain the partitions.  5. Quicksight is used to visualize the modified CTRs.  Solution variations: Convert re...

Must use VS Code Extensions for anyone working on Cloud

Here are the list of VS Code extensions that anyone working on cloud technologies can use to speed up their development.  To download any extension, refer to the extension tab on your VS code window: As we will manage all our cloud resources using Terraform, we will start with Terraform Autocomplete Extension.  1. Terraform Extensions Terraform: to manage terraform resources directly from VS Code.  Terraform Autocomplete : useful when we are creating terraform resources. 2. Docker : To build, manage and deploy docker containers from VS Code. 3. Python : extension that provides python interpreter 4. Prettier-Code formatter : 5. Markdown Preview :  6. Git :   Git History : Git Graph : Now we can select the below extensions, and click on install.  AWS VSCode Extensions: 1. AWS Toolkit : To interact with AWS resources directly from VS Code. Helpful in taking a look of AWS resources without having to login into console, provides us with a very cool UI to g...

Databricks: Job aborted due to stage failure. Total size of serialized results is bigger that spark driver memory.

  While running a databricks job, especially running a job with large datasets and longer running queries that creates a lot of temp space - we might be facing below issue if we have a minimal configuration set to the cluster.  The simple way to fix this would be changing the spark driver config in the databricks cluster tab spark.driver.maxResultSize = 100G (change the GB based on your cluster size)