Principal Engineer - PSRE
Location
Bangalore | India
Job description
Team Summary
We are looking for an experienced Principal Engineer to implement a new monitoring tool for the firm. The ideal candidate will have a strong background in SRE principles and practices, and strong knowledge and experience in maintaining monitoring frameworks for large scale organizations. The Engineer will be responsible for the evaluation of monitoring tools, understand the scale of Arcesium, and propose a cost effective and reliable monitoring framework, also manage the system end to end. The SRE team is responsible for monitoring the stability and availability of mission critical production systems, managing incidents for quicker resolution, and establishing BAU. Team also building tools/infra which to be used by all development teams to assist in monitoring and troubleshooting.
What You'll Do
- Design, develop, and implement scalable and reliable monitoring solutions for distributed systems at scale.
- Define and implement monitoring requirements in collaboration with cross-functional teams.
- Lead the development of monitoring architectures and strategies.
- Integrate monitoring tools into existing infrastructure.
- Maintain and support monitoring systems.
- Demonstrate strong technical breadth/depth, driving innovation, evaluating new technologies, and deciphering the technical vision for engineering teams.
- Own key contributions to technical design and architecture decisions, considering trade-offs of choices, managing risk, making decisions independently where appropriate, and presenting reasoned options for decision making by others.
- Lead the way by writing exemplary code, documentation, and RFCs.
- Identify, propose, develop, deploy, and own R&D projects in accordance with the technical vision and needs of the team, turning problem statements into solutions, and operating independently as needed.
What You'll Need
- 10+ years of experience in SRE or a related field.
- Proven experience in designing, developing, and implementing monitoring solutions.
- Deep understanding of monitoring technologies and tools, including Prometheus, Grafana, Loki, and Tempo
- Experience with cloud-based monitoring systems, such as New Relic, Datadog, and Grafana Cloud
- Experience with log analysis tools, such as Splunk, Logstash, Fluent, and Sumo Logic
- Experience with distributed tracing implementation using Open Telemetry, Jaeger
- Strong understanding of SRE principles and practices.
- Experience with incident response and management.
- Reliability: An exposure to Chaos Engineering and various reliability practices including disaster recovery will be good to have.
- Experience with Cloud Computing like AWS.
- Experience with Kubernetes.
- Experience in Agile practices (Scrum)
- Excellent analytical, problem-solving, and troubleshooting skills.
- Excellent communication and presentation skills.
- Experience managing and mentoring engineers.
- Ability to work independently and as part of a team.
Arcesium and its affiliates do not discriminate in employment matters on the basis of race, color, religion, gender, gender identity, pregnancy, national origin, age, military service eligibility, veteran status, sexual orientation, marital status, disability, or any other category protected by law. Note that for us, this is more than just a legal boilerplate. We are genuinely committed to these principles, which form an important part of our corporate culture, and are eager to hear from extraordinarily well qualified individuals having a wide range of backgrounds and personal characteristics.
Arcesium's Personal Data Privacy Notice for Candidates is linked at the bottom of this page.
Job tags
Salary