As a Technology Services Engineer III in Operational Excellence Org
team member, you ll spend your days driving production incidents with RCA/COE finding tactical solutions to unblock business impacts, collaborating with cross-functional teams and providing us with the insights necessary to think beyond the status quo
You ll oversee small project teams consisting of other engineers who will look to you as a trusted advisor, a subject matter expert who provides guidance and the right tools to get the job done
About the Team:
As a member of the Cart & Checkout prod Eng
group, you ll be responsible for driving site/customer impacting production incidents, with tactical/long-term solutions identified, working with cross functional teams
You ll independently handle high impact, critical software/systems monitoring issues, troubleshoot business and production issues, developing automations/tools leading to Operational excellence
As a member of the team, you ll be able to say that you work for the world s largest retailer and contribute to the development to best-in-class methodologies that impacted perception and drastically changed business as we know it
What You ll Do:
Supporting java full stack backend application system components in a massively scalable, high performance, multi-tenant, international eCommerce platform with multiple micro-services deployed in cloud environment, root causing every reactive/proactive production issues.
Leads and participates in medium- to large-scale, complex, cross-functional projects
Partners with architects and development leads to come up with high level design to accelerate omni customer experience, recommending out-of-box engineering best practices.
Pro-Actively identifies areas to drive automation/speed/innovation
Troubleshoots business and production issues by gathering information (for example, issue, impact, criticality, possible root cause); performing root cause analysis to reduce future issues; engaging support teams to assist in the resolution of issues; developing solutions; driving the development of an action plan; performing actions as designated in the plan; interpreting the results to determine further action; and completing online documentation.
Provides support to the business by responding to user questions, concerns, and issues (for example, technical feasibility, implementation strategies); researching and identifying needed solutions; determining implementation designs; providing guidance regarding implications of new and enhanced systems; identifying short and long term solutions; and directing users to appropriate contacts for issues outside of associates domain.
Assists in providing guidance to small groups of 5 to 6 engineers, including offshore associates, for assigned Engineering projects by proving pertinent documents, directions, examples, and timeline.
Demonstrates up-to-date expertise and applies this to the development, execution, and improvement of action plans by providing expert advice and guidance to others in the application of information and best practices; supporting and aligning efforts to meet customer and business needs; and building commitment for perspectives and rationales.
Models compliance with company policies and procedures and supports company mission, values, and standards of ethics and integrity by incorporating these into the development and implementation/Support of business plans; using the Open Door Policy; and demonstrating and assisting others with how to apply these in executing business processes and practices.
Provides and supports the implementation of business solutions by building relationships and partnerships with key stakeholders; identifying business needs; determining and carrying out necessary processes and practices; monitoring progress and results; recognizing and capitalizing on improvement opportunities; and adapting to competing demands, organizational changes, and new responsibilities.
What you ll Bring ...
Expertise in creating Java utilities, Python scripts for operational success.
Strong Analytical thinking / troubleshooting / Problem solving in complex ecosystem.
Minimum of 2+ years of experience in observability tools - Grafana and Splunk (both are required) with demonstrated skills in building monitoring, alerting, dashboards, and processes.
Minimum 2+ years of experience running large scale customer facing application.
2+ years of hands-on experience implementing, supporting, and using the tools and services required for on-prem & cloud DevOps best practices. This includes but not limited to:
Application services (Tomcat, Spring Boot)
Source code control systems (Subversion, Git variants)
Build systems (Jenkins, GitLab)
Containerization tools (Docker)
Orchestration and environment management tools (Kubernetes)
Monitoring tools (Splunk, Grafana, ELK)
Programming or scripting languages (Java, Python, Bash)
Automated testing tools (Selenium, Test Complete)
Hands on experience debugging production issues, managing escalations, SLAs and striving to bring down MTTD, MTTR on production issues.
Ability to work with application developers to find root cause analysis, tactical and permanent solutions to critical business and customer impacting issues.
Ability to take methodical approach to troubleshooting complex problems.
Experience in trouble shooting production incidents on eCommerce retailer site and identify root cause, working with all business and Engineering stakeholders to bring it to closure with action items and define timeline for tactical and permanent fixes.
Hands-on experience in both public (Azure/GCP) and private cloud experience, planning and driving efficiencies
Good to have knowledge of Machine Learning models.
Ability to automate and create SOPs for repetitive issues, building knowledge base articles.