[Remote] Senior Cluster Site Reliability Engineer at The Voleon Group

Berkeley, California, United States

The Voleon Group Logo
Not SpecifiedCompensation
Senior (5 to 8 years)Experience Level
Full TimeJob Type
UnknownVisa
Financial Technology, AI & Machine LearningIndustries

Requirements

  • 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead
  • Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)
  • Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
  • Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)
  • Experience with cloud infrastructure (AWS or GCP)
  • Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
  • Experience with distributed storage technologies (Lustre, Ceph, S3)
  • Bachelor degree in computer science or equivalent experience

Responsibilities

  • Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise
  • Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability
  • Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams
  • Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do
  • Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies
  • Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability

Skills

Key technologies and capabilities for this role

IaCAutomationSREMonitoringTelemetryProblem-SolvingCluster ManagementHigh-Performance Computing (HPC)

Questions & Answers

Common questions about this position

What experience level is required for this Senior Cluster Site Reliability Engineer role?

The position requires 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead.

What technical skills and tools are needed for this position?

Candidates need knowledge of HPC/batch compute frameworks like Slurm, Kueue, AWS/GCP Batch, or ML systems like Kubeflow, MLflow, Horovod; ability to script in Python, Ruby, etc.; and familiarity with IaC tools like Terraform and Ansible.

Is this a remote position, or is there a location requirement?

This information is not specified in the job description.

What is the salary or compensation for this role?

This information is not specified in the job description.

What does the Cluster Operations team do, and how does this role fit in?

The Cluster Operations team triages and mitigates real-time operational issues, and as a Senior SRE, you will be an integral member solving day-to-day issues with urgency while engineering systemic improvements.

The Voleon Group

Investment management using machine learning algorithms

About The Voleon Group

Voleon focuses on investment management by utilizing machine learning to analyze financial market trends. The firm uses advanced statistical models to process large datasets and identify patterns that inform investment decisions, setting it apart from traditional methods that rely on human intuition. Voleon serves institutional clients and operates on a performance-based fee structure, aligning its interests with those of its clients. The company's goal is to provide data-driven insights that optimize investment returns while adapting to changing market conditions.

Berkeley, CaliforniaHeadquarters
2007Year Founded
VENTURE_UNKNOWNCompany Stage
Quantitative Finance, Financial ServicesIndustries
51-200Employees

Benefits

Health Insurance
Dental Insurance
Vision Insurance
Life Insurance
Paid Vacation
Paid Sick Leave
401(k) Retirement Plan
401(k) Company Match

Risks

Competition from other quantitative hedge funds may erode Voleon's market share.
Regulatory scrutiny on AI use in finance could increase compliance costs for Voleon.
Data quality issues could lead to inaccurate predictions and financial losses for Voleon.

Differentiation

Voleon uses machine learning for data-driven financial market predictions.
The firm serves institutional clients with a focus on scalability and risk management.
Voleon's academic approach emphasizes intellectual rigor and continuous learning.

Upsides

Increased interest in ESG investing offers new opportunities for Voleon's models.
Alternative data sources enhance predictive models for quantitative hedge funds like Voleon.
Cloud computing enables efficient scaling of Voleon's data processing capabilities.

Land your dream remote job 3x faster with AI