logo

JobNob

Your Career. Our Passion.

Sr Reliability/ObservabilityCenter of PracticeEngineer


Stellent IT LLC


Location

Hartford, CT | United States


Job description

Lead SRE Engineer
Location: Hybrid- Hartford, CT(Hybrid Job)
Phone+skype

Job Description:

Lead SREs to define Observability Processes/Center of Excellence for other SREs across our business.

A high communication skillset is the priority here, with technical and process skillset as a slightly lower priority.

Key Responsibilities:

Looking for more of a Process guy to work as an SRE to setup observability and Monitoring metrics for each LOB to support Cloud systems.

Wants them to lead Transformations end to end from development to deployment Blue/Green/Canary.

Recomending SLO/SLIs/SLAs- Setup Holistic, Open Source processes from beginning to end IaaC/IaaS, Automation, DevOps,

Observability, CI/CD Pipelines, use metrics, create Dashboards.

Someone to Champion migration to Open-Source Platforms to establish standards-

Agile managing Backlogs/ Backlog refinement, metrics, golden signals.

The ideal candidate should have a strong background in SRE and IT operations, as well as proficiency in various programming languages. Position requires a strong technical understanding of complex IT environments, cloud, and evolving technologies.

Responsibilities:

Influence and design architecture, infrastructure, standards and methods for large-scale cloud systems

Engage in and improve the software development life-cycle through CI/CD; Improve build to deployment process to establish greater reliability and a sustainable release process;

Oversee release gating; establish deployment metrics (DORA).

Monitor and develop SLOs and SLIs through customer user journey; Advise on SLA; Establish error budgets What is SLI SLO and SLA?

Observability and custom monitoring tool integrations; introduce telemetry to support SLOs

Automate system scalability and continually work to improve system resiliency, performance, and efficiency; Makes recommendations for design changes for improved reliability for HA Systems

Deploy software through highly available deployments; rolling, blue-green or canary

Provide mentorship to reliability engineering squads under a consistent framework for the Development, Testing and Alerting processes

Practice sustainable incident response through blameless RCA and postmortems

Advise performance testing and capacity planning

Communicate proactively with colleagues and formally present work product outcomes and risk analysis to product team and management.

Follow the Agile/Scrum working methodologies

Establish dashboarding for monitoring capabilities and metrics.

Qualifications:

8 + years of relevant technical experience

BS degree in Engineering, Computer Science, or equivalent practical experience

Expertise designing, analyzing, and troubleshooting large-scale distributed systems.

Experience in implementing Infrastructure as code

Experience building software and maintaining systems in a highly secure, regulated or compliant industry

Experience in monitoring infrastructure and application service level objectives to ensure functional and performance objectives.

Experience in implementing service dashboards for monitoring. objectives, and metrics

Experience developing and/or administering software in AWS cloud infrastructure

System administration skills, including automation and orchestration of environments using Terraform or CloudFormation and configuration

management

3-5 years of experience in languages such as Python, Ruby, Bash, Powershell

Experience with container orchestration tools and container management (Docker, Kubernetes, etc.)

Proficiency with continuous integration and continuous delivery tooling and practices

Must have exceptional communication skills (written, oral, presentation and facilitation)

Solid understanding of AWS, DevSecOps practices, SAFe Agile methodologies:

Knowledgeable of Amazon Web Services including but not limited to EC2, S3, ECS, RDS, CloudWatch, SNS, CloudTrail, SQS, Service Catalog.

Expertise with cloud platforms like AWS and microservices architecture
Familiarity with enterprise software solutions such as GitHub, Jenkins, Nexus, Ansible, Jira, Rally.. etc.
Observability and Monitoring Tools and Metrics- Dynatrace, Splunk,Nagios, Cloudwatch, ELK, Grafana,Prometheus.....
Familiarity with programming languages (Python, Lambda, Go )
Experience in Infrastructure as Code (IaC) using CloudFormation & Terraform templates, YAML files, build specifications
Must have exceptional communication skills (written, oral, presentation and facilitation)
Solid understanding of technologies that support the services offered for cloud applications.

Nishi Dixit Technical Recruiter

Stellent IT

Phone:

Email: Nishi

Gtalk: Nishi

Report this job


Job tags

Contract work


Salary

All rights reserved