Senior Site Reliability Engineer, Devices
Flock SafetyFull Time
Senior (5 to 8 years)
CoreWeave is the AI Hyperscaler™, delivering a cloud platform of cutting-edge services powering the next wave of AI. Our technology provides enterprises and leading AI labs with the most performant, efficient, and resilient solutions for accelerated computing. Since 2017, CoreWeave has operated a growing footprint of data centers covering every region of the US and across Europe. CoreWeave was ranked as one of the TIME100 most influential companies of 2024. CoreWeave powers the creation and delivery of the intelligence that drives innovation.
The Fleet Reliability Operations team is responsible for the day-to-day provisioning, management, and uptime of CoreWeave’s ever-expanding fleet of server nodes. Playing a central role in CoreWeave’s growth strategy, this team is on the front line for configuration, updates, and remote troubleshooting of our highest tier of supercomputing clusters and their networking, delivery platforms, and tools dependencies. You will be in a daily battle with the forces of entropy to maximize the number of nodes CoreWeave can deliver to customers.
We are seeking curious, creative, and persistent problem solvers to join our Fleet Reliability Operations team to help us drive batches of server nodes through our provisioning and validation processes while efficiently and effectively troubleshooting node or cluster problems as they arise. This individual will join a team of committed engineers working to deploy nodes as fast as they can be racked and turned on.
Minimum Qualifications:
Cloud service for GPU-accelerated workloads
CoreWeave provides cloud computing services that focus on GPU-accelerated workloads, which are essential for tasks requiring high computational power. Their services cater to industries such as artificial intelligence, machine learning, visual effects rendering, and data processing. Clients can access powerful computing resources on a pay-as-you-go basis, allowing them to avoid the costs of purchasing expensive hardware. CoreWeave's infrastructure utilizes a bare metal serverless Kubernetes platform, which enhances performance while minimizing operational complexity for users. This setup enables clients to optimize their computing needs with a variety of NVIDIA GPUs, ensuring they can balance performance and cost effectively. The company's goal is to offer flexible and scalable computing solutions that meet the demands of diverse clients, from tech companies to film studios.