Search for Senior Site Reliability Engineer roles in Santa Clara, CA, in the AI and HPC industry, requiring 6+ years of experience, with skills in Python and Kubernetes, and full-time schedule.

Okay, I've searched for Senior Site Reliability Engineer roles in Santa Clara, CA within the AI and HPC industry, requiring 6+ years of experience, with Python and Kubernetes skills, and full-time schedule. I have found 23 positions for you.

Senior Site Reliability Engineer

Santa Clara, CA

AI and HPC

6 years + Experience

Python

Kubernetes

Full Time

Can you help me to analyze the job description and company culture?

Sure. Here's the analysis:

Job Analysis:

The Senior Site Reliability Engineer (SRE) role for NVIDIA's AI Research Clusters is fundamentally about ensuring that extremely large-scale GPU-accelerated computing systems operate with high reliability, efficiency, and peak performance. Beyond routine operations, the role demands visionary leadership in designing and evolving the infrastructure that powers cutting-edge AI research. The candidate will confront complex distributed system challenges, needing to balance proactive automation with sustainable incident management in mission-critical environments. Success means not only reducing downtime and outages but enabling AI researchers to maximize their productivity and innovate faster. Core responsibilities including cluster design, performance tuning, real-time monitoring, and incident resolution show an expectation for deep technical expertise combined with strategic foresight. The SRE must work collaboratively, often bridging hardware, software, and research teams, requiring strong cross-functional communication skills. Technical proficiencies with AI/HPC schedulers like Slurm, containerization, configuration management tools (e.g. Ansible, BCM), and scripting (Python, Bash) underpin the ability to implement scalable solutions. In addition, knowledge of GPU-specific technologies (CUDA, NCCL), HPC storage, and networking (InfiniBand, RDMA) reflect the specialized environment. This role demands a mature problem solver comfortable navigating ambiguous, high-pressure situations while leveraging blameless postmortems and continuous improvement. The expectation is for a minimum 6+ years of experience managing multi-thousand GPU clusters, signaling a requirement for seasoned judgment and autonomy. Ultimately, thriving here means embedding a culture of reliability into some of the most sophisticated AI infrastructure globally, driving innovation both in tooling and operational excellence.

Company Analysis:

NVIDIA stands as a visionary leader at the intersection of AI, high-performance computing, and datacenter acceleration, having transformed multiple computing industries over decades. The company’s market position as an innovation powerhouse means its teams are often pioneering technology frontiers rather than following established paths, creating a dynamic and intellectually rigorous environment. The culture emphasizes diversity, curiosity, risk-taking in a blameless setting, and a learning mindset—all elements that foster creativity and resilience in problem-solving. For a Senior SRE, this means the scope of influence is significant, with visibility into highly impactful projects and the potential to shape foundational infrastructure that accelerates AI research worldwide. Given the complexity and scale of NVIDIA’s GPU clusters, the role sits at a nexus between engineering, research, and operations, requiring collaboration across multiple technical domains and external partners. The company’s commitment to equal opportunity and a growth mindset suggests a supportive environment with mentorship, yet one that demands high autonomy and accountability. This hire appears to be a strategic investment to sustain and scale NVIDIA’s AI research infrastructure, helping maintain their cutting-edge advantage. Candidates should be ready to navigate rapid technological evolution, bring innovative solutions, and align closely with the company’s mission to amplify human intelligence—meaning technical mastery goes hand in hand with a passion for meaningful impact.

Mock up some interview questions based on the job description.

Absolutely. Here are some mock interview questions that could come up:

Tell me about a time when you had to troubleshoot a complex failure in a large-scale distributed GPU cluster. How did you approach diagnosing and resolving the issue?
Describe a project where you automated a significant operational workflow in HPC or AI infrastructure. What tools did you use and what was the impact?
Can you share an example of a blameless postmortem you participated in? How did the team leverage the findings to improve system reliability?
How do you prioritize reactive incident response versus proactive system improvements in a high-demand research environment?
Explain your experience with AI/HPC job schedulers like Slurm. How have you optimized job scheduling and resource allocation in GPU clusters?
What strategies do you use for performance tuning and capacity planning in large-scale AI compute environments?
How would you design a monitoring and alerting system for a multi-thousand GPU cluster running deep learning workloads?
Walk me through how you would integrate container technologies (Docker, Enroot) into an HPC infrastructure supporting AI workflows.
Imagine a scenario where a key cluster upgrade fails mid-deployment affecting production workloads. How would you handle the incident and communication?
Suppose you identify a systemic bottleneck in GPU cluster utilization that’s slowing down research experiments. What steps would you take to diagnose and resolve it?
Why are you interested in working at NVIDIA, particularly within the GPU AI/HPC Infrastructure team?
How do your personal and professional values align with NVIDIA’s focus on diversity, collaboration, and continuous learning?
Given NVIDIA’s leadership in AI and HPC innovation, how do you see your role contributing to the company’s mission of amplifying human imagination and intelligence?
What challenges do you anticipate in working across multiple technical teams and researchers, and how would you foster effective collaboration?
Do you have any questions about the team structure, tools, or current challenges faced by the GPU AI/HPC Infrastructure group?

Job Details

Overview Company

Senior Site Reliability Engineer

$184k/y-$425.5k/y

RemoteWorker US

3 months ago

Highlight

HybridFull-time6+ yearsSenior

Summary

NVIDIA is seeking a Senior Site Reliability Engineer - AI Research Clusters to join their team in Santa Clara, CA. The role involves designing and implementing GPU compute clusters to support AI research.

About the Role

As a member of the GPU AI/HPC Infrastructure team, you will lead the design and implementation of GPU compute clusters, optimize cluster operations for reliability and performance, and drive automation to enhance researcher productivity. Responsibilities include troubleshooting system failures, scaling systems sustainably, and supporting production systems.

About You

Required:

Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience with a minimum 6+ years of experience designing and operating large scale compute infrastructure.
Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 5K GPUs cluster.
Deep understanding of GPU computing and AI infrastructure.
Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm.
Solid experience with GPU clusters, and working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc.
In depth understanding of container technologies like Docker, Enroot, etc.
Experience programming in Python and Bash scripting.

Preferred:

Interest in crafting, analyzing and fixing large-scale distributed systems.
Familiarity with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking.
Familiarity with InfiniBand with IBoIP and RDMA.
Experience with Cloud Deployment, BCM, Terraform.
Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads.
Familiarity with deep learning frameworks like PyTorch and TensorFlow.
Multi-cloud experience.

Benefits

Highly competitive salaries with a base salary range of $184k–$425.5k.
Eligibility for equity and benefits.
Opportunity to work with some of the most brilliant and talented people in the world.
Commitment to fostering a diverse work environment and being an equal opportunity employer.

RemoteWorker US

The Home of Remote Workers in the United States

We understand that outstanding performance begins with outstanding hiring, and this approach sits at the head of everything we do. Our focus is on excellent customer service, and on delivering an effective solution that targets the best talent more efficiently.

The friendly recruitment experts behind Remote Worker have many years’ experience in matching candidates to their perfect job, we are specialists in bringing recruiters and job seekers together successfully!

We have a huge range of jobs on offer right now, join our talent database and receive job alerts today!

Santa Clara, CA