Site Reliability Engineer

NCR Atleos

Location

Secunderabad | India

Job description

We are looking for a Site Reliability Engineer (SRE), initially focused on production AppOps, who can manage scalable systems, using best practices around automation, that improve reliability, and velocity and enable monitoring of the operational health of services throughout their lifecycle including metrics collection, aggregation, and visualization
As a member of the SRE team, you will support NCR s Financial Services business unit, product, and technology teams to improve the design and operation of systems, focusing on making them scalable, reliable, and efficient while ensuring production performance and high availability of products/services primarily residing in the cloud
You will influence the development and implementation of reliable production systems and services to address emerging business needs (such as Cloud-based SaaS)
SREs pride themselves on the resiliency and stability of production systems, yet at the same time are committed to innovation and operational improvement through the application of software engineering practices to operations
The SRE will facilitate innovation and operational improvement through the application of software engineering practices to operations
You will make our products easier to adopt and use by making improvements to the product, tools, processes, and documentation

Job Description:

You will maintain and scale production services and servers for complex and high-throughput cloud services.
You will bridge and own the union between development, quality, security, and operations.
You will improve scalability, service reliability, capacity, and performance.
You will write automation code for provisioning and operating infrastructure at a massive scale.
You are not just an operator, you re an experienced software engineer focused on application reliability and scalability.
You will initiate and contribute to the continuous improvement of our software delivery processes and practices in a multi-location, multidisciplinary team to empower and accelerate product development.
You will use automation extensively to design, configure, manage, and monitor systems in support of our product development teams
You will participate in disaster recovery planning and execution
You will be responsible for maintaining/patching servers supporting SaaS products. This includes Windows Servers, and Linux Servers running in private data centers and/or using cloud PaaS providers (Azure).

You ll work together with all teams to ship our code to production using Continuous Integration / Continuous Deployment (CI/CD) and AppSec tooling.

You will collaborate with development teams and use intuition, experience and understanding to create SLIs, SLOs, and SLAs
You will provide timely assistance and remediation solutions during critical situations and production incidents to help resolve service problems (You will be on call for periods of time)

You will develop monitoring architecture, implement monitoring agents, build dashboards, and manage escalations and alerts.
You will participate in incident management and drive root cause analysis (RCA) and risk management processes.
You will participate in a rotating on-call schedule during off-hours where you may periodically need to remote into systems if a production outage occurs.

IDEAL TECHNICAL AND PROFESSIONAL SKILLS:

BS degree in Computer Science or related technical field or 5 years prior relevant experience.
Extensive experience in a DevOps / SRE role with demonstrable experience in deploying and managing large-scale production environments in Azure; AWS, GCP, and multi-data center environments.
Experience developing and debugging code (i.e., one or more of the following: Ansible, Python, Shell, Perl, Golang or JavaScript, Java, C, C++, .NET)
2+ years deploying and supporting high-traffic, scalable web applications/services.
2+ years with Azure/GCP/AWS
2+ years with Docker, Kubernetes, and an early version of OpenShift.
Experience with Linux, Shell Scripting, PKI TLS/SSL, Network, firewalls, load balancers and backup
Experience in designing, analyzing, and running large-scale distributed systems.
Experience in hosting and solving problems in public-facing services securely in Azure, AWS or GCP
Experience with orchestration, automation, and configuration management tools like git, Fabric and Ansible (or Puppet, Chef, Terraform, Helm or related technology)
Excellent analysis, debugging, root-cause identification, and troubleshooting skills.
Experience with Kubernetes, system virtualization, on-prem and/or hybrid cloud computing, cloud Identity and security system, cloud monitoring and logging, and/or local/cloud storage.
Experience with one or more CI/CD and related tools Azure DevOps/Jenkins/GitHub Actions, Artifactory, Harness, CloudBuild
Experience with application disaster recovery, migration, roll-back plans, expansion, routine deployments, and system upgrades
Experience with log management, including monitoring, aggregation, alerting, and graphing (i.e., NagiosXI / Prometheus / ELK / Sensu / StackDriver / TICK stacks)
Bonus points for experience with Kafka, Elasticsearch, or Cassandra.
Extra bonus points for Cloud certifications and exposure to Harness

Job tags

Salary

Site Reliability Engineer

GENERAL

Home

About

Contact

Blog

MORE PAGES

Popular searches

Urban popular searches

Cities

Companies

LEGAL

Privacy policy

Terms of service

eAccessibility commitment

JobNob HQ Address

1 E Broad St
Ste 130 - 1252
Bethlehem, PA 18018-5934
United States

Site Reliability Engineer

GENERAL

Home

About

Contact

Blog

MORE PAGES

Popular searches

Urban popular searches

Cities

Companies

LEGAL

Privacy policy

Terms of service

eAccessibility commitment

JobNob HQ Address

1 E Broad St Ste 130 - 1252 Bethlehem, PA 18018-5934 United States

1 E Broad St
Ste 130 - 1252
Bethlehem, PA 18018-5934
United States