Senior Software Engineer - Distributed Inference
NVIDIAFull Time
Senior (5 to 8 years)
Candidates should have familiarity with running ML inference at large scale with high throughput and low latency, and a solid understanding of distributed systems and ML inference challenges. Familiarity with deep learning and deep learning frameworks like PyTorch is also required. Bonus points include ML Systems knowledge, experience using Ray, working closely with the community on LLM engines like vLLM or TensorRT-LLM, and contributions to deep learning frameworks or compilers.
The Distributed LLM Inference Engineer will iterate quickly with product teams to ship end-to-end solutions for batch and online inference at high scale. This role involves working across the stack to integrate Ray Data and LLM engines, providing optimizations for cost-effective large-scale ML inference. Additionally, the engineer will integrate with open-source software like vLLM, contribute improvements to open source, and implement and extend best practices from the latest state-of-the-art in the open-source and research communities.
Platform for scaling AI workloads
Anyscale provides a platform designed to scale and productionize artificial intelligence (AI) and machine learning (ML) workloads. Its main product, Ray, is an open-source framework that helps developers manage and scale AI applications across various fields, including Generative AI, Large Language Models (LLMs), and computer vision. Ray allows companies to enhance the performance, fault tolerance, and scalability of their AI systems, with some users reporting over 90% improvements in efficiency, latency, and cost-effectiveness. Anyscale primarily serves clients in the AI and ML sectors, including major companies like OpenAI and Ant Group, who rely on Ray for training large models. The company operates on a software-as-a-service (SaaS) model, charging clients a subscription fee for access to the Ray platform. Anyscale's goal is to empower organizations to effectively scale their AI workloads and optimize their operations.