Key Responsibilities
The following outlines the primary responsibilities for an AWS Data Engineer, but the role will require flexibility and adaptation as the business expands.
Software Engineering - Fundamentals
A fundamental software engineering skill set underpins all engineering work at cloudandthings.io.
- Experience with modern operating systems, particularly Linux.
- Experience working with terminals / cli.
- Experience with version control software, particularly Git.
- Software fundamentals, such as Problem-solving, data structures and algorithms, software development methodologies, common design patterns and best practices.
- Experience with the following programming languages, preferably more. Python and SQL.
Cloud - Core
- Ability to identify serverless options, managed options, and roll-your-own options, strengths and weaknesses.
- Development experience working with Terraform or Cloud Formation IAC to provision and maintain data infrastructure.
- Familiarity with AWS well architected principles and experience implementing them.
Data Engineering - General
- Working knowledge of Big Data: Volume, Variety, Velocity, etc.
Data Engineering - Collection
- Good experience collecting data in hybrid environments: on-premise to cloud, and cloud to on-premise.
- Real-time: AWS Kinesis Data Streams (KDS), Kafka / MSK.
- Near Real-time: AWS Kinesis Data Firehose (KDF).
- Batch: AWS DataSync, StorageGateway, TransferFamily (FTP / SFTP / MFT), Snowball.
- Databases: ODBC / JDBC, database replicas and replication tools, migration tools such as Database Migration Service (DMS) and SCT.
Data Engineering - Storage
- Basic experience working with on-premise storage solutions: NFS / SMB, NAS / DAS, etc.
- Cloud Storage: Amazon S3.
- Data Formats: Parquet, CSV, Avro, JSON etc., compression, partitioning.
- NoSQL Databases: AWS DynamoDB, MongoDB, etc.
- Relational Databases: AWS RDS or similar, MySQL / PostgreSQL, Aurora.
- Massively Parallel Processing: Redshift.
- Search Databases: AWS Elasticsearch / OpenSearch.
- Caching: Redis / Memcached.
Data Engineering - Processing
- Strong experience developing ETL processes, and integrating with source and destination systems.
- Strong experience developing using Python, Spark (e.g. PySpark), and SQL to work with data.
- Basic experience with Lakehouse technologies such as Apache Hudi, Apache Iceberg or Databricks DeltaLake.
- AWS Lambda for file / stream / event processing, ETL, and triggers.
- General ETL and cataloguing of data, access control: AWS Glue ETL, Glue catalog, LakeFormation.
- Hadoop-like processing: Mainly Spark and Hive, Instance types and cluster and job sizing; AWS Elastic MapReduce (EMR).
Data Engineering - Analysis
- Basic understanding of cloud data warehouse architecture and data integration: AWS Redshift and Redshift spectrum.
- Data modelling skills, normalisation, facts and dimensions.
- Experience in data quality.
- On-object-store querying: AWS Athena, Glue Crawlers.
Data Engineering - Security
- Basic experience with authentication and identity federation, authorisation and RBAC pertaining to data.
- Basic knowledge of cloud network security: AWS VPC, VPC endpoints, Subnets, DirectConnect.
- Identity and Access Management: AWS IAM, STS and cross-account access.
- Encryption for data at rest, and data in motion for all services used: AWS KMS / SSE, TLS, etc.
Data Engineering - Operations
- Orchestration of data pipelines: AWS Step Functions, Managed Apache Airflow, Glue, etc.
- Basic knowledge of good architecture pillars and how to apply them:
- Operational excellence, Security, and Reliability.
- Performance Efficiency, Cost Optimisation, and Sustainability.
Advantageous technical skills
- Any other data-related experience, e.g. working with Hadoop, databases, analytics software, etc.
- Experience working with a second cloud vendor, e.g. both AWS and Azure.
- Experience working with Docker / Containers / CICD pipelines for data.
- Experience working with and contributing to open-source projects.