HNM Solutions
Location
Puducherry | India
Job description
Site Reliability Engineer - Lead - Minimum of 12 years of experience in IT, with at least 8 years in 
 monitoring. 
 The ideal candidate should have a strong background in both software engineering, 
 Monitoring and operations, with a focus on ensuring the reliability, performance, and 
 scalability of our web applications. 
 Skills 
• Strong understanding of Modern single page web applications with 
 Angular/React, NodeJS etc and mobile applications. 
• Deep knowledge of monitoring and observability tools (e.g., Dynatrace, 
 Prometheus, Grafana, ELK stack, Datadog, AppDynamics, New Relic, 
 etc.) 
• Familiarity with configuration management tools (Ansible, Puppet, etc.) 
 and shell scripting 
• AWS Cloud: VPC, subnets, network access control lists, security groups, 
 EC2 instances, S3 buckets, IAM, Route 53, Lambda. 
• Experience in Containerization tools like Docker, VM, Kubernetes. 
• Strong knowledge towards SRE Principles into implementing 
 monitoring. 
 Responsibilities: 
1. Monitoring and Alerting: 
• Implement and manage monitoring solutions to track the health and 
 performance of services. 
• Proactively monitor application stability. 
• Set up alerting and automated responses to minimize downtime. 
• Perform root cause analysis and manage incidents for issue resolution. 
• Monitor system performance, identify bottlenecks, and collaborate on 
 optimizations. 
2. Service Reliability: 
• Ensure the reliability and availability of our web applications by setting 
 and meeting Service Level Objectives (SLOs). 
• Collaborate with development teams to improve the overall reliability 
 of applications and services. 
3. Automation: 
• Develop and maintain automation scripts and tools for repetitive 
 operational tasks. 
4. Product Continuous Improvement 
• Maintain open communication with the Product Owner for product 
 alignment. 
• Ensure SRE tasks align with the product's strategic goals. 
 Internal 
 Internal 
• Participate in backlog refinement meetings to prioritize SRE-related 
 work items. 
• Identify, document, and communicate defects and improvement 
 opportunities. 
5. Capacity Planning: 
• Conduct capacity planning to ensure that systems can handle expected 
 loads. 
• Analyze data and predict future resource requirements, scaling systems 
 as needed. 
6. Incident Response: 
• Participate in an on-call rotation to respond to incidents and outages 
 promptly. 
• Follow incident management procedures and conduct post-incident 
 reviews. 
7. Change Management: 
• Assess risks associated with changes to the production environment. 
• Coordinate and execute deployments, ensuring rollback plans are in 
 place. 
8. Performance Analysis: 
• Analyze performance bottlenecks and work on optimizing systems for 
 efficiency and cost-effectiveness. 
9. Documentation: 
• Maintain comprehensive documentation for systems, processes, and 
 procedures. 
10. Collaboration: 
• Work closely with cross-functional teams, including development, 
 operations, and security, to achieve common goals. 
• Foster a culture of reliability within the organization. 
11. Other 
• Execute releases and contribute to the deployment process. 
• Provide on-call support.
Job tags
Salary