HE ROLE:
The Observability Principal II Engineer will work a hybrid schedule, with the requirement to be onsite at our Torrance as needed. This role is responsible for leading the design, implementation, and optimization of observability solutions across the organization, ensuring end-to-end transparency into application performance, system health, and user experience. The Observability Principal II Engineer will focus on monitoring, alerting, and logging frameworks, ensuring that teams have the tools and data vital to identify and resolve issues quickly and efficiently.
This role will drive the adoption of industry-leading observability platforms like Dynatrace, Splunk, and Prometheus, providing real-time insights into system behavior across hybrid and multi-cloud environments. The Observability Principal II Engineer will work closely with development, operations, and security teams to establish monitoring strategies that optimize performance, reliability, and customer experience.
DETAILED RESPONSIBILITIES/DUTIES:
- Design and implement observability frameworks to provide full transparency into the performance and reliability of systems, applications, and infrastructure.
- Lead the deployment and optimization of monitoring and observability tools, including Dynatrace, Splunk, Prometheus, Grafana, and other relevant technologies.
- Collaborate with development and operations teams to build comprehensive monitoring and alerting systems that ensure real-time detection of issues.
- Develop and maintain dashboards and reporting systems to supervise system health, performance metrics, and key indicators.
- Ensure integration of observability solutions with CI/CD pipelines to provide feedback and insights throughout the deployment process.
- Manage and refine alerting strategies to minimize false positives while ensuring rapid response to real incidents.
- Perform root cause analysis using observability data to improve system resilience and prevent recurring issues.
- Continuously evaluate and improve logging, tracing, and metric collection methodologies to ensure accurate data for diagnostics and optimization.
- Drive the implementation of SLOs (Service Level Objectives) and SLIs (Service Level Indicators) to ensure the availability and performance of critical systems.
- Provide guidance and mentorship to engineering teams on standard methodologies for observability and monitoring.
- Collaborate with security teams to ensure that observability data meets compliance and security standards, enabling fast detection of anomalies or threats.