Maincode logo

Senior Data Engineer

Maincode
Full-time
On-site
Melbourne, Victoria, Australia
$150,000 - $180,000 AUD yearly
Senior Jobs

Overview

Maincode is building sovereign AI models in Australia. We are training foundation models from scratch, designing new reasoning architectures, and deploying them on state-of-the-art GPU clusters. Our models are built on datasets we create ourselves, curated, cleaned, and engineered for performance at scale. This is not buying off-the-shelf corpora or scraping without thought. This is building world-class datasets from the ground up.

As a Senior Data Engineer, you will lead the design and construction of these datasets. You will work hands-on to source, clean, transform, and structure massive amounts of raw data into training-ready form. You will design the architecture that powers data ingestion, validation, and storage for multi-terabyte to petabyte-scale AI training. You will collaborate with AI Researchers and Engineers to ensure every byte is high quality, relevant, and optimised for training cutting-edge large language models and other architectures.

This is a deep technical role. You will be writing code, building pipelines, defining schemas, and debugging unusual data edge cases at scale. You will think like both a data scientist and a systems engineer, designing for correctness, scalability, and future proofing. If you want to build the datasets that power sovereign AI from first principles, this is your team.


What you’ll do

  • Design and build large-scale data ingestion and curation pipelines for AI training datasets

  • Source, filter, and process diverse data types including text, structured data, code, and multimodal, from raw form to model-ready format

  • Implement robust quality control and validation systems to ensure dataset integrity, relevance, and ethical compliance

  • Architect storage and retrieval systems optimised for distributed training at scale

  • Build tooling to track dataset lineage, reproducibility, and metadata at all stages of the pipeline

  • Work closely with AI Researchers to align datasets with evolving model architectures and training objectives

  • Collaborate with DevOps and ML engineers to integrate data systems into large-scale training workflows

  • Continuously improve ingestion speed, preprocessing efficiency, and data freshness for iterative training cycles


Who you are

  • Passionate about building world-class datasets for AI training from raw source to training-ready

  • Experienced in Python and data engineering frameworks such as Apache Spark, Ray, or Dask

  • Skilled in working with distributed data storage and processing systems such as S3, HDFS, or cloud object storage

  • Strong understanding of data quality, validation, and reproducibility in large-scale ML workflows

  • Familiar with ML frameworks like PyTorch or JAX, and how data pipelines interact with them

  • Comfortable working with multi-terabyte or larger datasets

  • Hands-on and pragmatic, you like solving real data problems with code and automation

  • Motivated to help build sovereign AI capability in Australia


Why Maincode

We are a small team building some of the most advanced AI systems in Australia. We create new foundation models from scratch, not just fine-tune existing ones, and we build the datasets they run on from the ground up.

We operate our own GPU clusters, run large-scale training, and integrate research and engineering closely to push the frontier of what is possible.

You will be surrounded by people who:

  • Care deeply about data quality and architecture, not just volume

  • Build systems that scale reliably and repeatably

  • Take pride in learning, experimenting, and shipping

  • Want to help Australia build independent, world-class AI systems