THE ROLE:
We are seeking a highly experienced Principal II, Site Reliability Engineer (SRE) to lead the strategy and
execution of reliability engineering across Herbalife’s global platforms. This role focuses on building and
scaling resilient, observable systems, advancing multi-cloud operations, and embedding reliability,
automation, and guidelines across engineering teams. You will define standards, drive adoption of
modern infrastructure practices, and ensure that our services deliver performance, availability, and reliability
at scale.
HOW YOU WOULD CONTRIBUTE:
• Architect resilient platforms and tooling across Azure and GCP, bringing to bear Kubernetes, serverless
technologies, and infrastructure as code.
• Drive observability and monitoring practices with Dynatrace, Splunk, and OpenTelemetry,
establishing metrics, tracing, alerting, and actionable dashboards.
• Design and implement GitOps workflows for consistent, auditable, and secure infrastructure and
application deployments.
• Lead infrastructure automation with Terraform and related tooling to enable scalable, self-service
provisioning and governance.
• Define and enforce SLOs, SLIs, and error budgets to measure and improve system reliability and
customer experience.
• Develop operational standards and runbooks for incident response, disaster recovery, and
performance management.
• Partner with application and infrastructure teams to ensure reliability, scalability, and cost-efficiency
are built into every layer of the stack.
• Mentor and influence engineering teams to adopt modern SRE practices and drive a culture of
operational excellence.
WHAT’S SPECIAL ABOUT THE TEAM:
The SRE team is evolving to expand its scope beyond traditional operations, embedding observability,
automation, and cloud-native practices across Herbalife’s platform. Our mission is to ensure production
systems are resilient, observable, and scalable, while enabling application teams to move quickly with
confidence in Azure, GCP, and hybrid environments
SKILLS AND BACKGROUND REQUIRED TO BE SUCCESSFUL:
• 7+ years of engineering or SRE experience with modern distributed systems.
• Proficiency in at least one modern programming language (Python, Go, Java, etc.).
• Deep knowledge of observability and monitoring with Dynatrace, Splunk, and log/metrics pipelines.
• Strong hands-on experience with multi-cloud environments (Azure + GCP), Kubernetes, and
serverless platforms.
• Proven expertise with GitOps practices and Terraform (IaC) for automation, scalability, and
governance.
• Experience defining SLOs, SLIs, and error budgets and embedding them into production systems.
• Strong background in incident response, postmortems, and operational excellence.
• Ability to mentor, guide, and influence technical and business collaborators.
Education
• Bachelor’s Degree in Computer Science, Engineering, or related field required.
#LI-AR1
#LI-Hybrid
Software Powered by iCIMS
www.icims.com