logo

JobNob

Your Career. Our Passion.

Principal Engineer, Platform Reliability Engineering


Arcesium


Location

Hyderabad | India


Job description

We are looking for an experienced

Principal Engineer

to implement a new monitoring tool for the firm. The ideal candidate will have a strong background in SRE principles and practices, and strong knowledge and experience in maintaining monitoring frameworks for large scale organizations. The Engineer will be responsible for the evaluation of monitoring tools, understand the scale of Arcesium, and propose a cost effective and reliable monitoring framework, also manage the system end to end. The SRE team is responsible for monitoring the stability and availability of mission critical production systems, managing incidents for quicker resolution, and establishing BAU. Team also building tools/infra which to be used by all development teams to assist in monitoring and troubleshooting. This position is for

HYD/BLR .

What You'll Do Design, develop, and implement scalable and reliable monitoring solutions for distributed systems at scale. Define and implement monitoring requirements in collaboration with cross-functional teams. Lead the development of monitoring architectures and strategies. Integrate monitoring tools into existing infrastructure. Maintain and support monitoring systems. Demonstrate strong technical breadth/depth, driving innovation, evaluating new technologies, and deciphering the technical vision for engineering teams. Own key contributions to technical design and architecture decisions, considering trade-offs of choices, managing risk, making decisions independently where appropriate, and presenting reasoned options for decision making by others. Lead the way by writing exemplary code, documentation, and RFCs. Identify, propose, develop, deploy, and own R&D projects in accordance with the technical vision and needs of the team, turning problem statements into solutions, and operating independently as needed.

What You'll Need 10+ years of experience in SRE or a related field. Proven experience in designing, developing, and implementing monitoring solutions. Deep understanding of monitoring technologies and tools, including Prometheus, Grafana, Loki, and Tempo Experience with cloud-based monitoring systems, such as New Relic, Datadog, and Grafana Cloud Experience with log analysis tools, such as Splunk, Logstash, Fluent, and Sumo Logic Experience with distributed tracing implementation using Open Telemetry, Jaeger Strong understanding of SRE principles and practices. Experience with incident response and management. Reliability: An exposure to Chaos Engineering and various reliability practices including disaster recovery will be good to have. Experience with Cloud Computing like AWS. Experience with Kubernetes. Experience in Agile practices (Scrum) Excellent analytical, problem-solving, and troubleshooting skills. Excellent communication and presentation skills. Experience managing and mentoring engineers. Ability to work independently and as part of a team.

The Company offers excellent benefits, an informal and collegial working environment, and an attractive compensation package.

Members of the Arcesium Company Group do not discriminate in employment matters on the basis of sex, race, color, caste, creed, religion, pregnancy, national origin, age, military service eligibility, veteran status, sexual orientation, marital status, disability, or any other protected class.


Job tags



Salary

All rights reserved