Software Engineer, Infrastructure Reliability at OpenAI

San Francisco, California, United States

OpenAI Logo
$255,000 – $405,000Compensation
Senior (5 to 8 years), Expert & Leadership (9+ years)Experience Level
Full TimeJob Type
UnknownVisa
AI, TechnologyIndustries

Requirements

  • 4+ years of relevant industry experience, with 2+ years leading large scale, complex projects or teams as an engineer or tech lead
  • Deep understanding of distributed systems principles and proven track record in building and operating scalable and reliable systems
  • Keen eye for performance and optimization in complex, globally-distributed systems
  • Experience operating orchestration systems such as Kubernetes at scale and building abstractions over cloud platforms
  • Comfortable working in Linux environments, and with tools like Kubernetes, Terraform, CI/CD pipelines, and modern observability stacks
  • Experience collaborating with cross-functional teams to ensure reliability and scalability in design and development
  • Strong proficiency in cloud infrastructure (like AWS, GCP, Azure) and IaC tools such as Terraform
  • Proficiency in programming/scripting languages
  • Experience with containerization technologies and container orchestration platforms like Kubernetes
  • Experience with observability tools such as Datadog, Prometheus, Grafana, Splunk, and ELK stack
  • Experience with microservices architecture and service mesh technologies
  • Knowledge of security best practices in cloud environments
  • Strong understanding of distributed systems, networking, and database technologies
  • Excellent problem-solving skills and ability to work in a fast-paced environment
  • Passion for distributed systems at scale with focus on reliability, scalability, security, and continuous improvement
  • Proven experience as a reliability engineer, production engineer, or similar role in a fast-paced, rapidly scaling company
  • Humble attitude, eagerness to help colleagues, and desire to do whatever it takes to make the team succeed
  • Own problems end-to-end, willing to pick up missing knowledge
  • Comfortable with ambiguity and rapid change

Responsibilities

  • Design, build, and operate reliable and performant systems used across engineering
  • Identify and fix performance bottlenecks and inefficiencies, ensuring infrastructure can scale to the next order of magnitude
  • Dig deep to resolve complex issues
  • Continuously improve automation to reduce manual work
  • Improve internal tooling and developer experience
  • Contribute to incident response, postmortems, and development of best practices around system reliability and scalability

Skills

Distributed Systems
Performance Optimization
System Reliability
Automation
Incident Response
Postmortems
Observability
Scalability
Database Systems
Storage Systems

OpenAI

Develops safe and beneficial AI technologies

About OpenAI

OpenAI develops and deploys artificial intelligence technologies aimed at benefiting humanity. The company creates advanced AI models capable of performing various tasks, such as automating processes and enhancing creativity. OpenAI's products, like Sora, allow users to generate videos from text descriptions, showcasing the versatility of its AI applications. Unlike many competitors, OpenAI operates under a capped profit model, which limits the profits it can make and ensures that excess earnings are redistributed to maximize the social benefits of AI. This commitment to safety and ethical considerations is central to its mission of ensuring that artificial general intelligence (AGI) serves all of humanity.

San Francisco, CaliforniaHeadquarters
2015Year Founded
$18,433.2MTotal Funding
LATE_VCCompany Stage
AI & Machine LearningIndustries
1,001-5,000Employees

Benefits

Health insurance
Dental and vision insurance
Flexible spending account for healthcare and dependent care
Mental healthcare service
Fertility treatment coverage
401(k) with generous matching
20-week paid parental leave
Life insurance (complimentary)
AD&D insurance (complimentary)
Short-term/long-term disability insurance (complimentary)
Optional buy-up life insurance
Flexible work hours and unlimited paid time off (we encourage 4+ weeks per year)
Annual learning & development stipend
Regular team happy hours and outings
Daily catered lunch and dinner
Travel to domestic conferences

Risks

Elon Musk's legal battle may pose financial and reputational challenges for OpenAI.
Customizable ChatGPT personas could lead to privacy and ethical concerns.
Competitors like Anthropic raising capital may intensify market competition.

Differentiation

OpenAI's capped profit model prioritizes ethical AI development over unlimited profit.
OpenAI's AI models, like Sora, offer unique video creation from text descriptions.
OpenAI's focus on AGI aims to create AI systems smarter than humans.

Upsides

OpenAI's $6.6 billion funding boosts its AI research and computational capacity.
Customizable ChatGPT personas enhance user engagement and satisfaction.
OpenAI's 'Operator' AI agent could revolutionize workforce automation by 2025.

Land your dream remote job 3x faster with AI