Applied ML Engineer, Data
Description
Applied ML Engineer, Data
🔥 3 hours ago
Report problem
51 - 200 employees
Founded founded by Sean Parker
🤖 Artificial Intelligence
Artificial Intelligence • Gaming • Social Impact
Cantina is a company that specializes in creating advanced AI characters that can talk, feel, and capture their adventures with selfies. It offers a platform where users can unleash their AI bots in online communities, allowing these lifelike, social creatures to interact with humans. Cantina focuses on building networks of AI influencers and encourages users to explore and build their own collections of AI bots. The company's mission is to foster an interactive universe, inviting creativity and social interaction through digital personalities and AI technology.
📋 Description
• Build and maintain data pipelines for large video generation models, including data ingestion, parsing, filtering, preprocessing, and dataset curation at scale, using tools such as AWS S3 and DynamoDB. • Design and run annotation workflows across platforms such as MTurk, Prolific, and Mechanical Turk, including task design, quality control, and label validation. • Train, evaluate, and improve smaller supporting models used for data filtering, quality assessment, preprocessing, or other parts of the ML pipeline. • Partner closely with research and engineering teams to turn experimental workflows into scalable, repeatable systems that support model training and evaluation. • Own data quality across the pipeline by identifying bottlenecks, failure modes, and low-quality sources, and continuously improving tooling and processes. • Build internal tools and automation that make it easier to prepare datasets, launch annotation jobs, monitor outputs, and support model development end to end. • Drive larger pipeline projects from start to finish, such as new dataset creation efforts or upgrades to labeling and preprocessing infrastructure. • Work within a Kubernetes-based training infrastructure, ensuring datasets are properly prepared, formatted, and delivered to training clusters. • Profile and optimize research model inference scripts used in preprocessing steps, ensuring that model-driven filtering and transformation stages run within practical time and cost constraints when applied to large-scale raw data.
🎯 Requirements
• 3+ years of experience in machine learning, applied ML, data pipelines, or related engineering roles, ideally working on large-scale multimodal, video, or vision-based systems. • Strong programming skills in Python and solid experience building reliable data processing and preprocessing pipelines for ML workflows. • Hands-on experience preparing training data for ML models, including parsing, filtering, dataset curation, quality control, and large-scale data handling using tools such as AWS S3 and DynamoDB. • Familiarity with annotation and labeling workflows, including task design, vendor or crowd-platform orchestration such as MTurk or Prolific, and methods for ensuring label quality. • Experience working with Kubernetes for orchestrating distributed workloads, including data preprocessing, pipeline execution, and dataset delivery to training clusters. • Comfort working across cloud and on-demand compute environments such as AWS and RunPod, with the ability to port and optimize pipelines across infrastructure. • Familiarity with distributed data processing frameworks and experience deg systems that operate reliably at scale across many nodes or workers. • Working knowledge of PyTorch and the broader deep learning stack, with the ability to read, debug, and optimize research model inference code for use in production preprocessing pipelines. • Ability to work cross-functionally with research and engineering teams and translate experimental ideas into robust, scalable systems. • Bachelor's, Master's, or PhD in Computer Science, Machine Learning, Engineering, Mathematics, or a related technical field; experience in generative video, computer vision, or multimodal ML is strongly preferred. • Bonus: Experience training, evaluating, or fine-tuning smaller ML models used for classification, filtering, ranking, quality assessment, or other supporting tasks in an ML pipeline.
Skills
Want AI to find more roles like this?
Upload your CV once. Get matched to relevant assignments automatically.