T

Principal / Senior Data Engineer

Troveo AI
Full-time
On-site
San Francisco, California, United States
Senior Jobs

Troveo is the largest licensable video library for AI model training. We partner with thousands of content licensors—ranging from top-tier studios and production houses to leading YouTube creators—to supply video content to the world’s foremost research labs. Our mission is to rapidly deliver massive volumes of video content, to exact specifications, fueling next-generation generative and world-understanding AI models.

Data Engineering is central to our success. Each week, we process petabytes of video data—quickly, cost-effectively, and with uncompromising quality. As a data engineer at Troveo, you’ll focus on:

  • Lowering costs and reducing turnaround times for processing content.

  • Enhancing and transforming video data for our customers, to make it easier to discover and more valuable.

We are seeking a Principal or Senior Data Engineer with demonstrated expertise in Python and large-scale data management. Practical experience with AWS services (S3, EC2, etc.), search, and large databases is essential. Familiarity with video data is a plus, but not required.

Responsibilities

  • Data Pipeline Development: Design, build, and maintain scalable, efficient data pipelines in Python.

  • AWS Ecosystem: Leverage services like S3 for data storage (including multiple tiers of storage) and EC2 for compute (currently running clusters of 50k G instances), retrieval, and processing in production environments.

  • Big Data Handling: Develop and optimize systems to handle petabyte-scale datasets with a focus on performance, reliability, and cost-effectiveness.

  • Metadata Generation: Leveraging self-hosted open source LLMs and managed APIs to generate reliable metadata to power discovery and enhance the value of the content we deliver.

  • Discovery: Building from the ground up search capabilities leveraging visual, semantic and taxonomic data to deliver the right content to our customers.

  • Monitoring & Reliability: Implement robust monitoring, alerting, and logging to ensure smooth data flow and quickly troubleshoot issues.

  • Collaboration: Work cross-functionally with data scientists, software engineers, and product teams to understand data needs and deliver optimized solutions.

  • Video Processing (Preferred): If applicable, process and manage video data for analytics, quality control, and other use cases.

Required Qualifications

  • Python Proficiency: Strong coding skills in Python (including familiarity with libraries for data manipulation and analysis).

  • AWS Expertise: Hands-on experience using core AWS services (S3, EC2, possibly Lambda, EMR, or ECS).

  • Big Data Skills: Demonstrated ability to work with large-scale datasets (petabyte-level), ensuring high performance and scalability.

  • Database & Storage: Familiarity with large Postgres databases.

  • Automation & Scripting: Comfortable building CI/CD pipelines and automating repetitive tasks.

Nice to Have

  • Video Processing: Experience handling or transforming video data (e.g., transcoding, extracting metadata, compiling FFMPEG).

  • Machine Learning Pipelines: Familiarity with ML and Computer Vision workflows or frameworks (OpenCV, TensorFlow, PyTorch, etc.).

  • Security Best Practices: Understanding of AWS IAM, encryption, and SOC II compliance standards.

What We Offer

  • An opportunity to work with massive data sets and cutting-edge technologies in the cloud serving the biggest companies in tech building the next generation of AI models

  • A collaborative environment with a talented, diverse team of engineers and data experts.

  • Competitive compensation and benefits with room for career growth and professional development.

  • This job is remote/work from home with the option of meeting up from time to time if you are located in the SF Bay Area.