Data Engineer – New Grad – OpenAI
OpenAI is the world's leading AI safety and research company — creator of ChatGPT, GPT-4, DALL-E, Sora, and Whisper. With 200+ million weekly ChatGPT users and $3B+ in annual revenue, OpenAI is building artificial general intelligence (AGI) to benefit all of humanity. OpenAI's data infrastructure is uniquely critical: the quality, scale, and diversity of training data directly determines the capabilities of the world's most powerful AI models. Data engineers at OpenAI work on some of the most high-stakes data pipelines ever built — processing petabytes of web data, human feedback, and model-generated content that shape the frontier of AI intelligence. We are hiring New Grad Data Engineers to build the data infrastructure enabling OpenAI's mission.
Responsibilities
- Build OpenAI's pre-training data pipelines — processing and filtering web-scale text corpora (Common Crawl, books, code, scientific papers) for large language model training
- Develop OpenAI's RLHF data pipeline — collecting, processing, and quality-scoring human preference feedback used to align GPT models with human values through reinforcement learning
- Implement OpenAI's data deduplication and contamination detection pipelines — identifying and removing duplicate and low-quality content from training datasets at petabyte scale
- Build OpenAI's model evaluation data infrastructure — generating and managing benchmark datasets for capability evaluation across coding, reasoning, and safety dimensions
- Develop OpenAI's usage analytics data platform — processing ChatGPT user interaction logs for product analytics, content policy enforcement, and model improvement signals
- Implement data privacy pipelines ensuring compliance with GDPR, CCPA, and OpenAI's responsible data use commitments across training and product data systems
Requirements
- Bachelor's degree in Computer Science, Data Engineering, or Machine Learning
- Strong Python skills for large-scale data processing (PySpark, Dask, Ray)
- SQL proficiency and experience with cloud data platforms (GCP BigQuery, Snowflake)
- Understanding of ML data concepts: pre-training data, RLHF, data quality, and model evaluation
- Genuine passion for AI safety and the responsible development of artificial general intelligence
Benefits
- Among the most competitive compensation packages in the technology industry with OpenAI equity
- Work at the most consequential technology company in the world
- Comprehensive medical, dental, and vision benefits with 100% premium coverage
- 401(k) with OpenAI matching
- San Francisco Mission District headquarters and OpenAI's collaborative, mission-driven culture