Senior Solutions Architect, HPC Systems Engineer
NVIDIA- Full Time
- Senior (5 to 8 years)
Candidates should possess 5+ years of professional experience in Site Reliability Engineering, Linux system engineering, or compute infrastructure roles, demonstrating strong proficiency in Linux kernel internals, including scheduler, memory allocation, and driver subsystems. Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware is required, along with familiarity with SmartNICs/DPUs (e.g., NVIDIA CX6/7, BlueField-3) and kernel bypass techniques. Expert-level skills in at least one programming language, such as C, Go, or Rust, are also necessary.
As a Senior Site Reliability Engineer, you will develop automation and observability tools to monitor Crusoe’s compute infrastructure, spanning from the kernel to orchestration layers, and support and scale the company’s virtualization stack, including technologies such as KVM, QEMU, and other hypervisors. You will collaborate with Linux kernel and hardware teams to identify and resolve performance bottlenecks, driver issues, and optimize hardware offloads, focusing on optimizing performance for AI and HPC workloads across CPU, GPU, and DPU/NIC resources. Additionally, you will participate in root cause analysis for kernel crashes, hardware-software integration problems, and performance regressions, while integrating hypervisor-level enhancements to improve guest VM reliability and workload isolation, and tune kernel subsystems such as the process scheduler, NUMA configuration, memory management, and interrupt handling, working closely with platform teams to implement and validate support for emerging compute hardware.
Utilizes wasted energy for computing power
Crusoe Energy Systems Inc. provides digital infrastructure that focuses on using wasted, stranded, or clean energy sources to power high-performance computing and artificial intelligence. The company helps clients in the technology and energy sectors by offering scalable computing solutions that aim to reduce greenhouse gas emissions and support the transition to cleaner energy. Crusoe's approach involves converting excess natural gas and renewable energy into computing power, which allows them to maximize resource efficiency while minimizing environmental impact. Unlike many competitors, Crusoe specifically targets the intersection of energy and technology, generating revenue by supplying computing resources to enterprises that need significant computational power for applications like AI and machine learning, along with providing technical support.