Principal II, Observability Engineer (SRE)

Job Locations US-CA-Torrance
ID
2024-14765
Category
Global Technology Services
Position Type
Regular Full-Time

Overview

HE ROLE: 

The Observability Principal II Engineer will work a hybrid schedule, with the requirement to be onsite at our Torrance as needed. This role is responsible for leading the design, implementation, and optimization of observability solutions across the organization, ensuring end-to-end transparency into application performance, system health, and user experience. The Observability Principal II Engineer will focus on monitoring, alerting, and logging frameworks, ensuring that teams have the tools and data vital to identify and resolve issues quickly and efficiently.

This role will drive the adoption of industry-leading observability platforms like Dynatrace, Splunk, and Prometheus, providing real-time insights into system behavior across hybrid and multi-cloud environments. The Observability Principal II Engineer will work closely with development, operations, and security teams to establish monitoring strategies that optimize performance, reliability, and customer experience.

 

DETAILED RESPONSIBILITIES/DUTIES:

  • Design and implement observability frameworks to provide full transparency into the performance and reliability of systems, applications, and infrastructure.
  • Lead the deployment and optimization of monitoring and observability tools, including Dynatrace, Splunk, Prometheus, Grafana, and other relevant technologies.
  • Collaborate with development and operations teams to build comprehensive monitoring and alerting systems that ensure real-time detection of issues.
  • Develop and maintain dashboards and reporting systems to supervise system health, performance metrics, and key indicators.
  • Ensure integration of observability solutions with CI/CD pipelines to provide feedback and insights throughout the deployment process.
  • Manage and refine alerting strategies to minimize false positives while ensuring rapid response to real incidents.
  • Perform root cause analysis using observability data to improve system resilience and prevent recurring issues.
  • Continuously evaluate and improve logging, tracing, and metric collection methodologies to ensure accurate data for diagnostics and optimization.
  • Drive the implementation of SLOs (Service Level Objectives) and SLIs (Service Level Indicators) to ensure the availability and performance of critical systems.
  • Provide guidance and mentorship to engineering teams on standard methodologies for observability and monitoring.
  • Collaborate with security teams to ensure that observability data meets compliance and security standards, enabling fast detection of anomalies or threats.

Qualifications

SKILLS AND BACKGROUND REQUIRED TO BE SUCCESSFUL: 

  • Validated experience in designing and implementing observability solutions using tools like Dynatrace, Splunk, Prometheus, Grafana, or ELK Stack.
  • Deep understanding of monitoring, logging, and tracing practices in hybrid and multi-cloud environments (Azure, AWS, GCP).
  • Expertise in creating and optimizing dashboards, alerts, and reports for monitoring performance and system health.
  • Experience with log management and analysis tools such as Splunk or ElasticSearch, for real-time data analysis and troubleshooting.
  • Proven understanding of distributed tracing methodologies (e.g., OpenTelemetry, Jaeger, Zipkin) to diagnose performance bottlenecks and improve system reliability.
  • Knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible to automate the deployment of monitoring and observability solutions.
  • Proficient in scripting and automation using Python, Bash, or Go for supervising and alerting infrastructure.
  • Strong understanding of SLOs, SLIs, to ensure reliability and performance objectives are met.
  • Ability to work in Agile and DevOps environments, ensuring seamless integration of observability into development workflows.

Experience:

  • 8+ years of experience in IT, with a focus on monitoring, observability, or performance engineering.
  • Extensive experience with observability tools like Dynatrace and Splunk, including setup, customization, and optimization for large-scale environments.
  • Proficiency in building and maintaining complex dashboards, alerts, and automated monitoring systems in cloud-native and hybrid environments.
  • Hands-on experience with logging, metrics, and tracing frameworks, ensuring the end-to-end observability of systems.
  • Strong understanding of cloud infrastructure, including AWS, Azure, and GCP, and how to implement observability across cloud platforms.
  • Experience with monitoring containerized applications using tools like Prometheus and Kubernetes, ensuring performance at scale.
  • Proven ability to perform root cause analysis and performance tuning using observability data.

Certificates / Training Preferred:

  • Certifications in relevant observability tools such as Dynatrace Certified Associate, Splunk Core Certified Power User, or Prometheus certifications.
  • Cloud certifications like AWS Certified Solutions Architect, Azure Solutions Architect Expert, or Google Cloud Professional Cloud Architect.

Education:

  • Bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent experience.

#LI-AR1

 

US Benefits Statement

Herbalife offers a variety of benefits to eligible employees in the U.S. (limited to the 50 States and the District of Columbia), which includes Group Health Programs, other Voluntary Benefit Programs, and Paid Time Off. Group Health Programs include Medical, Dental, Vision, Health Savings Account (HSA), Flexible Spending Accounts (FSA), Basic Life/AD&D; Short-Term and Long-Term Disability, and an Employee Assistance Program (EAP). Other Voluntary Benefit Programs include a 401(k) plan, Wellness Incentive Program, Employee Stock Purchase Plan (ESPP), Supplemental Life/Critical Illness/Hospitalization/Accident Insurance, and Pet Insurance. Paid time off includes Company-observed U.S. Holidays, Floating Holidays, Vacation, Sick Time, a Volunteer Program, Paid Maternity and Paternity Leave, Bereavement Leave, Personal Leave and time off for voting.

Options

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.
Share on your newsfeed

Need help finding the right job?

We can recommend jobs specifically for you! Click here to get started.