Data Engineer

Posted at: 06/27/2025

Cupertino, CA

Full Remote - IT - Development / Other Technologies - Contract - Job ID: 25-14750

ABOUT THIS FEATURED OPPORTUNITY

Join our Data Operations Team as a Python Engineer , supporting machine learning and AI teams that depend on high-quality datasets to train their models. You'll work at the intersection of data engineering, automation, and operational excellence , delivering datasets across approximately 200 projects per year . These include use cases such as image generation, animation, and other generative AI applications . Many projects are highly confidential— engineers must be able to assess data quality and relevance even without full visibility into the end use case .

We're looking for someone who can design and manage data pipelines, debug issues efficiently , and operate independently across multiple fast-paced projects. Strong communication and attention to detail are essential —you'll need to respond quickly, handle issues proactively, and deliver accurate work the first time. Mistakes or rework can pose serious risks to project timelines , so precision and accountability are critical. The ideal candidate will be highly responsive, reliable, and thorough in communication , and must be available to work 9am–4pm PST , even if located in a different state.

THE OPPORTUNITY FOR YOU

Work on 3–4 projects to start , scaling up to 6–10 during peak season
Contribute to data collection, annotation, and generation pipelines using Python and distributed systems (Spark)
Collaborate with a tight-knit and highly responsive team , engaging in biweekly check-ins with team leads
Gain experience with confidential, multimodal, and LLM-related datasets across a high volume of AI/ML projects
Influence how large-scale datasets are prepared for training models across an enterprise AI org

KEY SUCCESS FACTORS

2+ years of experience in data engineering or Python development, with a strong foundation in Computer Science or Data Science
Proficiency in distributed systems (e.g., Spark), and solid understanding of multithreading vs. multiprocessing
Demonstrated ability to design scalable pipelines , handle diverse data structures, and manage large-scale workflows
Comfortable operating under pressure, context-switching across multiple projects, and working with ambiguity

NICE TO HAVES

Familiarity with Airflow , Spark , or Flask for scalable API/UI development
Experience with Docker , containerization, and CI/CD tools (e.g., Jenkins)
Exposure to LLMs , multi-modal data , or generative AI workflows
Prior involvement in designing tools to automate or scale ML data pipelines
Ability to collaborate in a high-volume, high-trust environment —your work will power some of the most impactful ML use cases in the organization