Site Reliability Engineer

HNM Solutions

Location

Puducherry | India

Job description

Site Reliability Engineer - Lead - Minimum of 12 years of experience in IT, with at least 8 years in
monitoring.
The ideal candidate should have a strong background in both software engineering,
Monitoring and operations, with a focus on ensuring the reliability, performance, and
scalability of our web applications.
Skills
• Strong understanding of Modern single page web applications with
Angular/React, NodeJS etc and mobile applications.
• Deep knowledge of monitoring and observability tools (e.g., Dynatrace,
Prometheus, Grafana, ELK stack, Datadog, AppDynamics, New Relic,
etc.)
• Familiarity with configuration management tools (Ansible, Puppet, etc.)
and shell scripting
• AWS Cloud: VPC, subnets, network access control lists, security groups,
EC2 instances, S3 buckets, IAM, Route 53, Lambda.
• Experience in Containerization tools like Docker, VM, Kubernetes.
• Strong knowledge towards SRE Principles into implementing
monitoring.
Responsibilities:
1. Monitoring and Alerting:
• Implement and manage monitoring solutions to track the health and
performance of services.
• Proactively monitor application stability.
• Set up alerting and automated responses to minimize downtime.
• Perform root cause analysis and manage incidents for issue resolution.
• Monitor system performance, identify bottlenecks, and collaborate on
optimizations.
2. Service Reliability:
• Ensure the reliability and availability of our web applications by setting
and meeting Service Level Objectives (SLOs).
• Collaborate with development teams to improve the overall reliability
of applications and services.
3. Automation:
• Develop and maintain automation scripts and tools for repetitive
operational tasks.
4. Product Continuous Improvement
• Maintain open communication with the Product Owner for product
alignment.
• Ensure SRE tasks align with the product's strategic goals.
Internal
Internal
• Participate in backlog refinement meetings to prioritize SRE-related
work items.
• Identify, document, and communicate defects and improvement
opportunities.
5. Capacity Planning:
• Conduct capacity planning to ensure that systems can handle expected
loads.
• Analyze data and predict future resource requirements, scaling systems
as needed.
6. Incident Response:
• Participate in an on-call rotation to respond to incidents and outages
promptly.
• Follow incident management procedures and conduct post-incident
reviews.
7. Change Management:
• Assess risks associated with changes to the production environment.
• Coordinate and execute deployments, ensuring rollback plans are in
place.
8. Performance Analysis:
• Analyze performance bottlenecks and work on optimizing systems for
efficiency and cost-effectiveness.
9. Documentation:
• Maintain comprehensive documentation for systems, processes, and
procedures.
10. Collaboration:
• Work closely with cross-functional teams, including development,
operations, and security, to achieve common goals.
• Foster a culture of reliability within the organization.
11. Other
• Execute releases and contribute to the deployment process.
• Provide on-call support.

Job tags

Salary

Site Reliability Engineer

GENERAL

Home

About

Contact

Blog

MORE PAGES

Popular searches

Urban popular searches

Cities

Companies

LEGAL

Privacy policy

Terms of service

eAccessibility commitment

JobNob HQ Address

1 E Broad St
Ste 130 - 1252
Bethlehem, PA 18018-5934
United States

Site Reliability Engineer

GENERAL

Home

About

Contact

Blog

MORE PAGES

Popular searches

Urban popular searches

Cities

Companies

LEGAL

Privacy policy

Terms of service

eAccessibility commitment

JobNob HQ Address

1 E Broad St Ste 130 - 1252 Bethlehem, PA 18018-5934 United States

1 E Broad St
Ste 130 - 1252
Bethlehem, PA 18018-5934
United States