Software Engineer, Reliability at OpenAI

San Francisco, California, United States

Apply Now

$255,000 – $490,000Compensation

Senior (5 to 8 years), Expert & Leadership (9+ years)Experience Level

Full TimeJob Type

UnknownVisa

Artificial Intelligence, TechnologyIndustries

Requirements

Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent work experience)
Proven experience as a reliability engineer or a similar role in a fast-paced, rapidly scaling company
Strong proficiency in cloud infrastructure
Proficiency in programming/scripting languages
Experience with containerization technologies and container orchestration platforms like Kubernetes
Knowledge of IaC tools such as Terraform or CloudFormation
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Experience with observability tools such as DataDog, Prometheus, Grafana, Splunk, and ELK stack
Experience with microservices architecture and service mesh technologies
Knowledge of security best practices in cloud environments
Enjoy seeking out and addressing bottlenecks and areas for performance improvement in systems
Utilize Infrastructure as Code (IaC) principles to automate infrastructure provisioning and configuration management
Experienced in collaborating with cross-functional teams to ensure reliability and scalability in design and development
Track record of accelerating engineering reliability by empowering engineers with excellent tooling and systems
Humble attitude, eagerness to help colleagues, and desire to do whatever it takes to make the team succeed
Own problems end-to-end and willing to pick up missing knowledge to get the job done

Responsibilities

Design and implement solutions to ensure the scalability of infrastructure to meet rapidly increasing demands
Collaborate with development teams to make the systems they design and operate more reliable
Implement and manage monitoring systems to proactively identify issues and anomalies in production environment
Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability
Implement fault-tolerant and resilient design patterns to minimize service disruptions
Build and maintain automation tools to streamline repetitive tasks and improve system reliability
Partner with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world
Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability

Skills

Key technologies and capabilities for this role

ScalabilityMonitoring SystemsSLOsSLIsFault-Tolerant DesignInfrastructure ReliabilityObservabilitySRE

Questions & Answers

Common questions about this position

What is the salary range for the Software Engineer, Reliability position?

The salary range is $255K - $490K.

Is this a remote position or does it require office work?

This information is not specified in the job description.

What skills are most important for this reliability engineering role?

Key skills include utilizing Infrastructure as Code (IaC) principles, experience in collaborating with cross-functional teams, implementing monitoring systems, SLOs/SLIs, fault-tolerant designs, and building automation tools.

What is the work environment like at OpenAI for this role?

The role involves a deeply iterative, collaborative, fast-paced environment where you work closely with cross-functional teams including software engineers, product managers, data scientists, researchers, and designers, with a strong emphasis on safety, reliability, and on-call rotation.

What qualities make someone thrive in this Software Engineer, Reliability role?

Candidates thrive if they enjoy addressing bottlenecks and performance improvements, have experience accelerating engineering reliability with tooling, and possess a humble attitude, eagerness to help colleagues, and a desire to do whatever it takes.

OpenAI

Develops safe and beneficial AI technologies

About OpenAI

OpenAI develops and deploys artificial intelligence technologies aimed at benefiting humanity. The company creates advanced AI models capable of performing various tasks, such as automating processes and enhancing creativity. OpenAI's products, like Sora, allow users to generate videos from text descriptions, showcasing the versatility of its AI applications. Unlike many competitors, OpenAI operates under a capped profit model, which limits the profits it can make and ensures that excess earnings are redistributed to maximize the social benefits of AI. This commitment to safety and ethical considerations is central to its mission of ensuring that artificial general intelligence (AGI) serves all of humanity.

San Francisco, CaliforniaHeadquarters

2015Year Founded

$18,433.2MTotal Funding

LATE_VCCompany Stage

AI & Machine LearningIndustries

1,001-5,000Employees