4+ years of DevOps, AI Ops, or infrastructure engineering experience. Preferably with 2+ years in AI/ML environments
Hands-on experience with cloud-native services (AWS Bedrock/SageMaker, GCP Vertex AI, or Azure ML) and GPU infrastructure management
Strong skills in CI/CD tools (GitHub Actions, ArgoCD, Jenkins) and configuration management (Ansible, Helm, etc.)
Proficient in scripting languages like Python, Bash (Go or similar is a nice plus)
Experience with monitoring, logging, and alerting systems for AI/ML workloads
Deep understanding of Kubernetes and container lifecycle management
Ability to work with a high level of initiative, accuracy, and attention to detail
Ability to prioritize multiple assignments effectively and meet established deadlines
Ability to successfully, efficiently, and professionally interact with staff and customers
Excellent organization skills and critical thinking ability ranging from moderately to highly complex
Flexibility in meeting the business needs of the customer and the company
Ability to work creatively and independently with latitude and minimal supervision
Ability to utilize experience and judgment in accomplishing assigned goals
Experience in navigating organizational structure
Responsibilities
Design and implement CI/CD pipelines for AI and ML model training, evaluation, and RAG system deployment (including LLMs, vectorDB, embedding and reranking models, governance and observability systems, and guardrails)
Provision and manage AI infrastructure across cloud hyperscalers (AWS/GCP), using infrastructure-as-code tools (strong preference for Terraform)
Maintain containerized environments (Docker, Kubernetes) optimized for GPU workloads and distributed compute
Support vector database, feature store, and embedding store deployments (e.g., pgVector, Pinecone, Redis, Featureform, MongoDB Atlas, etc.)
Monitor and optimize performance, availability, and cost of AI workloads, using observability tools (e.g., Prometheus, Grafana, Datadog, or managed cloud offerings)
Collaborate with data scientists, AI/ML engineers, and other members of the platform team to ensure smooth transitions from experimentation to production
Implement security best practices including secrets management, model access control, data encryption, and audit logging for AI pipelines
Help support the deployment and orchestration of agentic AI systems (LangChain, LangGraph, CrewAI, Copilot Studio, AgentSpace, etc.)