Tech Mahindra (formerly Mahindra Satyam)
Location
Secunderabad | India
Job description
Shift Timing- General
Exp range- 8+ years
Band- U4
Location- Hyderabad, hybrid
Primary Responsibilities
Site Reliability Engineering (SRE) is an engineering discipline that combines software and system engineering to build and run large scale, massively distributed, fault-tolerant systems. SREs ensure managed service offerings and customer deployments have reliability and uptime appropriate to user's needs and a fast rate of improvement while monitoring and validating capacity and performance. Focused on reliability, scalability, and the development of automation to manage a set of repetitive tasks at scale.
Knowledge &Skills
- In depth knowledge on SRE practices and concepts like SLA, SLO, SLI, Error budget, Toil elimination, Post-mortem etc.
- Mandatory have experience in Terraform.
- Should have experience in Monitoring and Observability tools: Prometheus, Grafana, Elasticsearch Logstash Kibana, Splunk, Dynatrace, GCP operation suite, Azure Application Insights, any log analytics tools.
- Should have understanding and knowledge of any APM tools App dynamics, Datadog etc. – preferably AppDynamics.
- Should have experience in Infrastructure as a Code: Terraform, Ansible etc.
- Should have experience working with cloud-native applications to manage them effectively in GCP or Azure.
- Should have experience in creating pipelines in CI/CD tools like GitHub action, Azure Devops, Jenkins, preferably Scripted Pipelines.
- Should have knowledge of version control tools like Git, Bitbucket etc.
- Good to have knowledge of any of the scripting languages like PowerShell, python, bash etc.
- Responsible for ensuring the availability, performance, and scalability of a website or application.
- Knowledge of containerization and orchestration: Docker, Kubernetes, Docker compose, writing Dockerfile.
- Involved in capacity planning and performance tuning to ensure that the site can handle increased traffic without issue.
- Responsible for ensuring the availability, performance, and scalability of a website or application.
- Should have experience working with cloud-native applications to manage them effectively.
Work closely with developers to identify and fix potential issues before they cause problems for users.
- Deep understanding of how distributed systems work to be able to troubleshoot and optimize them.
- Deep understanding of how different types of databases work to be able to effectively troubleshoot any issues that may arise.
- Ability to communicate clearly and concisely about system alerts or outages to other members of your team.
- Below points to be noted: Apart from JD, Customer is looking for a candidate who can mature their SRE practice across the division. Someone who is comfortable being a champion and leader in the SRE space.
Job tags
Salary