技术支持工程师 at Alluxio

Beijing, Beijing, China

Alluxio Logo
Not SpecifiedCompensation
Mid-level (3 to 4 years)Experience Level
Full TimeJob Type
UnknownVisa
Technology, AI/MLIndustries

Requirements

  • 计算机科学或相关专业本科及以上学历。
  • 3年以上大规模分布式系统的运维或SRE经验。
  • 精通Linux操作系统和网络原理(TCP/IP, DNS, 负载均衡)。
  • 具备丰富的容器化和编排工具经验,尤其是Kubernetes。
  • 熟悉至少一种主流编程语言(如Python, Go, Java, Shell),并能够编写自动化脚本。
  • 熟悉监控指标、告警设计的基本方法论和可观测性工具的使用经验(如Prometheus, Grafana, ELK Stack)。
  • 出色的故障排查能力,能够系统性分析复杂问题。
  • 优秀的沟通能力和团队协作精神,能够用中英文与全球团队和客户进行有效技术交流。
  • 优先考虑
  • 有TOB 业务模式下的项目交付经验,接受并有驻场经验。
  • 有项目管理经验,能进行需求管理、客户预期管理、跨团队协作管理。
  • 有AI/ML基础设施运维经验,熟悉主流AI框架(TensorFlow, PyTorch)和GPU资源管理。
  • 有大数据生态系统(Hadoop, Spark, Presto/Trino)的运维或使用经验。
  • 熟悉国内外公有云平台(AWS, GCP, Azure,阿里云、腾讯云、百度云、火山云、华为云)和对象存储(S3, GCS, OSS)。
  • 熟悉Java虚拟机(JVM)性能调优。
  • 有使用或运维Alluxio的经验,具备开源社区贡献经验。

Responsibilities

  • 部署与运维:在客户的混合云或多云环境(如Kubernetes, Hadoop YARN)中,部署、配置和优化Alluxio for AI/ML工作负载,构建和维护高可用的Alluxio集群。
  • 监控及性能调优:分析与AI框架(如TensorFlow, PyTorch, Spark)交互时的性能瓶颈,对Alluxio、JVM、网络和存储系统提出调优建议。
  • 疑难问题处理:快速诊断、定位和解决客户生产环境中出现的问题。复杂问题需要拉通销售、产研,推动跨部门协作,提供最终原因分析并推动修复,维护客户满意度。
  • 客户支持与协作:以客户成功为目标,与客户的技术团队,及内部的产品和研发团队紧密合作,推动架构和产品优化。
  • 知识沉淀与自动化:善于总结、编写运维手册、最佳实践文档,并开发自动化工具和脚本以提高运维效率。
  • On-call支持(7*24h):参与轮值的on-call,解决客户在使用过程中遇到的技术问题,保障核心服务的SLA。

Skills

Key technologies and capabilities for this role

KubernetesLinuxPythonGoJavaShellPrometheusGrafanaELK StackTensorFlowPyTorchSparkHadoopJVMAWSGCPAzureTCP/IPDNS

Questions & Answers

Common questions about this position

What are the required qualifications for this Technical Support Engineer role?

Candidates need a bachelor's degree or above in computer science or related field, 3+ years of experience in operations or SRE for large-scale distributed systems, proficiency in Linux and networking principles, rich experience with containerization tools especially Kubernetes, familiarity with at least one programming language like Python, Go, Java, or Shell for scripting, and experience with monitoring tools like Prometheus, Grafana, or ELK Stack.

Is this a remote position or does it require on-site work?

This information is not specified in the job description.

What is the compensation or salary for this role?

This information is not specified in the job description.

What does the on-call schedule look like for this position?

The role requires participation in 7*24h on-call rotation to resolve technical issues encountered by customers and ensure core service SLAs.

What experiences make a candidate stand out for this role?

Priority is given to candidates with TOB project delivery and on-site experience, project management skills, AI/ML infrastructure operations with frameworks like TensorFlow or PyTorch, big data ecosystem experience with Hadoop or Spark, public cloud and object storage knowledge, JVM tuning, and prior Alluxio usage or open-source contributions.

Alluxio

Data management solutions for AI workloads

About Alluxio

Alluxio.io focuses on optimizing data management for Artificial Intelligence (AI) and Machine Learning (ML) workloads. It offers two main products: Alluxio Enterprise Data and Alluxio Enterprise AI, which help businesses manage their data and AI tasks across various infrastructure setups. By providing a single interface, Alluxio simplifies the management of data silos, enhances performance, and reduces the complexity of handling different technology stacks. Its solutions can accelerate model training by 20 times and model serving by 10 times, while also maximizing the return on investment for infrastructure and achieving high GPU utilization. Alluxio's goal is to help businesses improve efficiency and performance in their AI and ML operations by eliminating data copies and enabling seamless data access.

San Mateo, CaliforniaHeadquarters
2015Year Founded
$79.3MTotal Funding
SERIES_CCompany Stage
Data & Analytics, AI & Machine LearningIndustries
51-200Employees

Benefits

Unlimited PTO
Amazing Medical, Dental, and Vision Plans
Commuter Benefits ($50+)
Everyday Catered Lunch and Dinner
Boba Thursday
Relaxation/Massage Room Onsite
Gym Access
Frequent Company Outings and Trips

Risks

Increased competition could erode Alluxio's market share in AI data management.
Rapid technological advancements may render Alluxio's offerings obsolete if not updated.
Reliance on partnerships poses risks if expected synergies are not achieved.

Differentiation

Alluxio offers a memory-centric architecture for data storage and management.
The company provides a single interface for managing data and AI workloads.
Alluxio's solutions accelerate model training by 20 times and serving by 10 times.

Upsides

Growing demand for AI-optimized data platforms boosts Alluxio's market potential.
Recognition as a top open source project enhances Alluxio's brand reputation.
Partnership with NetApp expands Alluxio's market reach and collaborative opportunities.

Land your dream remote job 3x faster with AI