[Remote] Senior Cluster Site Reliability Engineer at The Voleon Group

Berkeley, California, United States

The Voleon Group Logo
Not SpecifiedCompensation
N/AExperience Level
N/AJob Type
Not SpecifiedVisa
N/AIndustries

Requirements

  • 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead
  • Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)
  • Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
  • Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)
  • Experience with cloud infrastructure (AWS or GCP)
  • Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
  • Experience with distributed storage technologies (Lustre, Ceph, S3)
  • Bachelor degree in computer science or equivalent experience

Responsibilities

  • Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise
  • Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability
  • Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams
  • Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do
  • Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies
  • Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability

Skills

The Voleon Group

Investment management using machine learning algorithms

About The Voleon Group

Voleon focuses on investment management by utilizing machine learning to analyze financial market trends. The firm uses advanced statistical models to process large datasets and identify patterns that inform investment decisions, setting it apart from traditional methods that rely on human intuition. Voleon serves institutional clients and operates on a performance-based fee structure, aligning its interests with those of its clients. The company's goal is to provide data-driven insights that optimize investment returns while adapting to changing market conditions.

Berkeley, CaliforniaHeadquarters
2007Year Founded
VENTURE_UNKNOWNCompany Stage
Quantitative Finance, Financial ServicesIndustries
51-200Employees

Benefits

Health Insurance
Dental Insurance
Vision Insurance
Life Insurance
Paid Vacation
Paid Sick Leave
401(k) Retirement Plan
401(k) Company Match

Risks

Competition from other quantitative hedge funds may erode Voleon's market share.
Regulatory scrutiny on AI use in finance could increase compliance costs for Voleon.
Data quality issues could lead to inaccurate predictions and financial losses for Voleon.

Differentiation

Voleon uses machine learning for data-driven financial market predictions.
The firm serves institutional clients with a focus on scalability and risk management.
Voleon's academic approach emphasizes intellectual rigor and continuous learning.

Upsides

Increased interest in ESG investing offers new opportunities for Voleon's models.
Alternative data sources enhance predictive models for quantitative hedge funds like Voleon.
Cloud computing enables efficient scaling of Voleon's data processing capabilities.

Land your dream remote job 3x faster with AI