Senior Site Reliability Engineer, Devices
Flock SafetyFull Time
Senior (5 to 8 years)
Candidates must possess a strong understanding of Linux system administration and internals, along with the ability to troubleshoot hardware and software issues and perform system maintenance tasks consistently.
The Operations Engineer will be responsible for the day-to-day provisioning, management, and uptime of CoreWeave’s server nodes, configuring and maintaining large-scale high-performance supercomputing clusters, monitoring and analyzing system performance, taking remediation actions for cloud health, documenting team processes, and participating in oncall rotations including after hours and weekend work.
Cloud service for GPU-accelerated workloads
CoreWeave provides cloud computing services that focus on GPU-accelerated workloads, which are essential for tasks requiring high computational power. Their services cater to industries such as artificial intelligence, machine learning, visual effects rendering, and data processing. Clients can access powerful computing resources on a pay-as-you-go basis, allowing them to avoid the costs of purchasing expensive hardware. CoreWeave's infrastructure utilizes a bare metal serverless Kubernetes platform, which enhances performance while minimizing operational complexity for users. This setup enables clients to optimize their computing needs with a variety of NVIDIA GPUs, ensuring they can balance performance and cost effectively. The company's goal is to offer flexible and scalable computing solutions that meet the demands of diverse clients, from tech companies to film studios.