Senior Systems And Infrastructure Engineer
Location
Bangalore | India
Job description
- Demonstrates up-to-date expertise and applies this to the development, execution, and improvement of action plans by providing expert advice and guidance to others in the application of information and best practices; supporting and aligning efforts to meet customer and business needs; and building commitment for perspectives and rationales
- Provides and supports the implementation of business solutions by building relationships and partnerships with key stakeholders; identifying business needs; determining and carrying out necessary processes and practices; monitoring progress and results; recognizing and capitalizing on improvement opportunities; and adapting to competing demands, organizational changes, and new responsibilities
- Models compliance with company policies and procedures and supports company mission, values, and standards of ethics and integrity by incorporating these into the development and implementation of business plans; using the Open Door Policy; and demonstrating and assisting others with how to apply these in executing business processes and practices
What youll do:
- As a Senior Site Reliability Operations Engineer within the Global Technology Platforms (GTP) CCC team you will work with other CCC, TDO, SRE, DevOps and Engineering practitioners to pro-actively maintain mission-critical infrastructure, cloud platforms, micro-services, tools, and processes that will ensure highest levels of availability and reliability across our Global Technology platforms
- Youre right for the job if you are comfortable leading our major incident response as part of a technical team of engineer s laser focused on restoring service across complex distributed systems
- Youll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization
- You will work directly with our SRE, Engineering and DevOps teams to support our next generation always up cloud-based e-commerce platforms
- The CCC Senior Site Reliability Operations Engineer is responsible for pro-actively monitoring, detecting and resolving site issues before they become customer and availability impacting
- Technically you will understand the full end to end stack and use this knowledge to detect errors/failures and take corrective action to mitigate
- During a major incident, you will draw on your technical skills and knowledge to triage and troubleshoot, differentiating between symptom and cause, to help restore impacting issues
- Your ability to continuously challenge yourself and develop a strong network within your peer group will see you exceed in this role
- Our goal is to protect the customer experience and deliver outstanding levels of availability
To do so, you will need strong skills in the following areas:
- Xmatters workflow integration with scalability, resiliency and performance
- Expert level understanding of incident management processes and procedures.
- Calm under pressure when participating in major incident response.
- Deep technical understanding of core infrastructure, cloud services, platforms and micro-services.
- Ability to understand and capture key data from logs at an expert level.
- Ability to understand traffics flows and key dependencies between services.
- Ability to effectively triage - be able to detect and determine symptom vs cause.
- Detect and quantify impact.
- Expert level troubleshooting skills using a diverse set of tools and methods
- Analyze trends to pro-actively prevent incidents.
- Focus on immediate restoration vs root cause.
- Research and recommend alternative actions for incident resolution - Develop procedures and documentation to support this.
- Create and maintain procedural documentation.
- Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
- Absorb knowledge and understand complex distributed systems - ability to share and impart this knowledge into your peer group and beyond.
- Build tools to improve visibility, pro-actively detect issues and restore system availability.
- Develop automation and self-healing with DevOps, Engineering and SRE partners.
- Strong focus on collecting and inferring metrics.
- Clear communication skills.
- Ability to contribute to multiple incidents at any given time.
- Analyze systems and make recommendations to prevent possible problems. Takes lead on issue resolution activities using knowledge of complex and company-wide systems.
- Scripting and software development to automate and help enhance existing solutions.
- Experience owning, developing and evangelizing a product.
- Ability to gather requirements and build solutions into a product.
- Evangelize operational excellence
Additional responsibilities may include:
- Actively provide data for and participate in root cause analysis.
- Define CCC onboarding process and ensure they are adhered to when accepting new systems into service.
- Share knowledge globally between CCC teams.
- Analyze systems and make recommendations to prevent possible incidents.
- Strive for continuous improvement and make recommendations based on CCC process.
- Act as a technical focal point for the CCC team.
- Other duties and responsibilities as assigned.
Qualifications:
- 7+ years experience in enterprise application development and API integrations with Java, React/Java Script.
- Experience building and scaling distributed, highly available systems
- Experience developing applications for a cloud environment such as Google Cloud Platform or Microsoft Azure
- Experience with frameworks/tools such as GIT, xMatters workflow integration, Service Now Integration etc
- Comfortable building metrics, monitoring, and alerting for micro-services
- 4+ years in an infrastructure, systems, engineering or development environment delivering operational excellence to highly complex distributed systems.
Job tags
Salary