CoreWeave is seeking a Staff Software Engineer, Observability to join their team in Sunnyvale, CA. The role involves leading efforts in building, maintaining, and optimizing highly scalable, reliable, and secure systems for observability.
About the Role
As a Staff Software Engineer, you will lead the Observability team responsible for deploying and maintaining critical infrastructure, including logging, tracing, and metrics platforms. Your key responsibilities will include mentoring engineers, scaling observability platforms, developing monitoring and alerting systems, advising on best practices, and managing production clusters.
About You
Required:
7+ years of experience in Software Engineering, Site Reliability Engineering, DevOps, or a related field.
Deep expertise across all observability pillars using tools like ClickHouse, Elastic, Loki, Victoria Metrics, Prometheus, Thanos and/or Grafana.
Expertise in Kubernetes, containerization, and microservices architectures.
Proven track record of leading incident management and post-mortem analysis.
Excellent problem-solving, analytical, and communication skills.
Preferred:
Experience running and scaling observability tools as a cloud provider.