Operations Engineer, Fleet Reliability at CoreWeave

Livingston, New Jersey, United States

Apply Now

Not SpecifiedCompensation

Mid-level (3 to 4 years), Senior (5 to 8 years)Experience Level

Full TimeJob Type

UnknownVisa

AI Hyperscaler, Cloud Computing, Accelerated Computing, Data CentersIndustries

Requirements

Candidates must possess a strong understanding of Linux system administration and internals, along with the ability to troubleshoot hardware and software issues and perform system maintenance tasks consistently.

Responsibilities

The Operations Engineer will be responsible for the day-to-day provisioning, management, and uptime of CoreWeave’s server nodes, configuring and maintaining large-scale high-performance supercomputing clusters, monitoring and analyzing system performance, taking remediation actions for cloud health, documenting team processes, and participating in oncall rotations including after hours and weekend work.

Skills

Fleet Reliability

Operations

Provisioning

Server Management

Uptime

Configuration

Updates

Remote Troubleshooting

Supercomputing Clusters

Networking

Hardware Troubleshooting

Software Troubleshooting

System Performance Monitoring

GPU

CoreWeave

Cloud service for GPU-accelerated workloads

About CoreWeave

CoreWeave provides cloud computing services that focus on GPU-accelerated workloads, which are essential for tasks requiring high computational power. Their services cater to industries such as artificial intelligence, machine learning, visual effects rendering, and data processing. Clients can access powerful computing resources on a pay-as-you-go basis, allowing them to avoid the costs of purchasing expensive hardware. CoreWeave's infrastructure utilizes a bare metal serverless Kubernetes platform, which enhances performance while minimizing operational complexity for users. This setup enables clients to optimize their computing needs with a variety of NVIDIA GPUs, ensuring they can balance performance and cost effectively. The company's goal is to offer flexible and scalable computing solutions that meet the demands of diverse clients, from tech companies to film studios.

New York City, New YorkHeadquarters

2017Year Founded

$1,625.4MTotal Funding

SECONDARYCompany Stage

Enterprise Software, AI & Machine LearningIndustries

501-1,000Employees