logo

JobNob

Your Career. Our Passion.

Site Reliability Engineer


HNM Solutions


Location

Puducherry | India


Job description

Site Reliability Engineer - Lead - Minimum of 12 years of experience in IT, with at least 8 years in
monitoring.
The ideal candidate should have a strong background in both software engineering,
Monitoring and operations, with a focus on ensuring the reliability, performance, and
scalability of our web applications.
Skills
• Strong understanding of Modern single page web applications with
Angular/React, NodeJS etc and mobile applications.
• Deep knowledge of monitoring and observability tools (e.g., Dynatrace,
Prometheus, Grafana, ELK stack, Datadog, AppDynamics, New Relic,
etc.)
• Familiarity with configuration management tools (Ansible, Puppet, etc.)
and shell scripting
• AWS Cloud: VPC, subnets, network access control lists, security groups,
EC2 instances, S3 buckets, IAM, Route 53, Lambda.
• Experience in Containerization tools like Docker, VM, Kubernetes.
• Strong knowledge towards SRE Principles into implementing
monitoring.
Responsibilities:
1. Monitoring and Alerting:
• Implement and manage monitoring solutions to track the health and
performance of services.
• Proactively monitor application stability.
• Set up alerting and automated responses to minimize downtime.
• Perform root cause analysis and manage incidents for issue resolution.
• Monitor system performance, identify bottlenecks, and collaborate on
optimizations.
2. Service Reliability:
• Ensure the reliability and availability of our web applications by setting
and meeting Service Level Objectives (SLOs).
• Collaborate with development teams to improve the overall reliability
of applications and services.
3. Automation:
• Develop and maintain automation scripts and tools for repetitive
operational tasks.
4. Product Continuous Improvement
• Maintain open communication with the Product Owner for product
alignment.
• Ensure SRE tasks align with the product's strategic goals.
Internal
Internal
• Participate in backlog refinement meetings to prioritize SRE-related
work items.
• Identify, document, and communicate defects and improvement
opportunities.
5. Capacity Planning:
• Conduct capacity planning to ensure that systems can handle expected
loads.
• Analyze data and predict future resource requirements, scaling systems
as needed.
6. Incident Response:
• Participate in an on-call rotation to respond to incidents and outages
promptly.
• Follow incident management procedures and conduct post-incident
reviews.
7. Change Management:
• Assess risks associated with changes to the production environment.
• Coordinate and execute deployments, ensuring rollback plans are in
place.
8. Performance Analysis:
• Analyze performance bottlenecks and work on optimizing systems for
efficiency and cost-effectiveness.
9. Documentation:
• Maintain comprehensive documentation for systems, processes, and
procedures.
10. Collaboration:
• Work closely with cross-functional teams, including development,
operations, and security, to achieve common goals.
• Foster a culture of reliability within the organization.
11. Other
• Execute releases and contribute to the deployment process.
• Provide on-call support.


Job tags



Salary

All rights reserved