Principal Service Reliability Engineer

NetSuite

Location

Secunderabad | India

Job description

Oracle, the world leader in Enterprise Cloud, is hiring the most creative technologists in the industry as we continue to add customer-centric, premier, groundbreaking, secure, hyper-scale based solutions throughout all levels of the cloud stack. Oracle's cloud eco-system is the only complete business cloud platform on the planet, with market leading and business redefining solutions spanning SaaS, DaaS, PaaS and IaaS. Oracle's Cloud applications, such as Enterprise Resource Management, Customer Relationship Management, Human Capital Management, and Supply Chain Management are used by thousands of customers across the globe and are the broadest, most innovative in the industry, providing businesses with adaptive intelligence, standardized business processes and competitive advantage at low cost.

As part of market leading ERP Cloud, Oracle ERP Cloud Operations offers a broad suite of modules and capabilities designed to empower modern finance and deliver customer success with streamlined processes, increased efficiency, and improved business decisions.

The ERP Cloud Operations is looking for hardworking, innovative, high caliber, team oriented super stars that seek being a major part of a dynamic revolution in the development of modern business cloud based applications. We are seeking highly capable, best in the world developers, architects and technical leaders at the very top of the industry in terms of skills, capabilities and proven delivery who seek out and implement imaginative and strategic, yet practical, solutions people who calmly take measured and necessary risks while putting customers first.

Key Tasks and Responsibilities

. Service Ownership -You will bea part of the SRE team, whose mission is the shared full stack ownership of a collection of services, with our Service Development and Operations SRE partners.

. Ownership Scope - You will understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of the production services you own. In partnership with your Service Development and Operations SRE partners, you will have the responsibility to ensure that services are designed and delivered to be critically important with focus on monitoring, telemetry, security, resiliency, scale, and performance.

. Service Design - You will partner with the SRE Architect, Service Development and operations SRE teams in defining and implementing improvements in service architecture, both current and future.

You will be an authority at articulating technical characteristics of your services and the dependencies between services, and guide service Development teams to engineer and add SRE capabilities to the Oracle SaaS/ERP service portfolio.
You will participate in feature design reviews to ensure Monitoring, Telemetry, Reliability, Automation, and Runtime Debuggability is represented as a first class, design time priority.
o You will provide technical leadership in defining software engineering patterns, practices, and coding standards focused on increasing reliability and resilience of Oracle SaaS/ERP services. You will deliver strong work artifacts (reusable components, plug-ins, blueprints, sample code, scripts and tooling, etc.) to streamline adoption by Service development.
. Operations Engineering- You willunderstand and be able to communicate the scale, capacity, security, performance attributes and requirements of the services you own. You are an authority, able to understand and communicate every characteristic of your service stack, such as
o Degradation and behavior under load of the services and their dependencies.
o End-to-end tuning needs, optimizing resource utilization, as load patterns fluctuate.
o Instrumentation and metrics that clearly describe the service behaviors.
o Scaling requirements and patterns.
o Resiliency and recoverability, ensuring that backup / restore and disaster recovery capabilities are implemented, tested and maintained.

. Technical Experts - You are the ultimate customer concern point for complex or critical issues that have not yet been documented as SOPs for Level1 staff. You will usually get calledin during major incidents as an SME, when the source of a problem is unclear. You will have the deep understanding of service topology and their dependencies required to solve issues and define mitigations.

. Incident Response - You will be the primary author of technical content for both customer and internal communication used throughout the incident response process, e.g. postmortem/root cause analysis, end-to-end repair item definition, fixes in production.

. Automation - You will have a clear understanding of automation and orchestration principles, and will be eager to automate, wherever and whenever the possibility arises, while simultaneously eliminating technical debt. Automation must bea part of your DNA.

. Prevention - Using data-driven incident findings, you will work on solutions that will ultimately prevent the incident/problem from arising ever again, and interim solutions to more quickly resolve the problem next time.

Skills and Qualifications

. Minimum of 5 years of software development, with demonstratedknowledge of professional software engineering standard methodologies for the full software development process, including coding standards, code reviews, source control, build and release processes, continuous deployment, and test suite development and maintenance.

. Experience deploying andrunning large scale online systems built on Cloud platformssuch as Oracle Cloud, AWS, Azure, Google Cloud Platform, and/or OpenStack

. Experience designing and implementing solutions for platform and application layer telemetry, monitoring,scalability, performance and reliability.

. Experience coordinating resources across teams with varied strengths to restore service and maintain SLA's ITIL certification is preferred.

. Excellent written and verbal technical communications with technical and non-technical peers, customers, and at times, executive leadership.

. Proven success in contributing in a collaborative, team-oriented environment, with the ability to establish and cultivate relationships between multiple teams and navigate dependencies.

. 3+ years of experience
Working in systems and network administration, application security, DevOps and/or Site Reliability Engineering.
o Hands-on with web protocols and Linux/Unix tools and architecture, from kernel to shell, file systems, and client-server protocols.
o Using C#, PowerShell/Shell script, ASP.NET/MVC, JavaScript, TypeScript, React, or T-SQL.
o Maintaining and analyzing, large-scale distributed services
o Building automated tools in Python, Java, GoLang, and/or Ruby.

. Experience with monitoring alerting using technologies like Prometheus, Sensu, Nagios, Kafka, Wavefront, BigPanda, DataDog, and/or PagerDuty.

. Experience implementing, designing, deploying: Docker, Kubernetes, and Serverless (Lambda's).

. Experience with Oracle Linux, RedHat Linux, Ubuntu, Centos, CoreOS, and/or Amazon Linux.

. Experience with one or more orchestration, deployment tools, e.g. CloudFormation, Terraform, Ansible, Packer, and/or Chef.

. Experience with one or more CI tools: Jenkins, TeamCity, Bamboo, Artifactory.

. Experience with configuration management systems such as Ansible, Chef, or Puppet.

. Experience with Agile software development practices.

. Knowledge of testing methodologies, the testing pyramid (i.e., Unit, Integration, UI, E2E, etc.), testing frameworks, and testing automation toolslike QTP, OATS, and Selenium.

. Determined to keep moving things forward even in the face of ambiguity and imperfect knowledge (resilient to hazards of 'analysis paralysis').

. BS in Computer Science or related field and 7 years relevant experience.

Job tags

Salary

Principal Service Reliability Engineer

GENERAL

Home

About

Contact

Blog

MORE PAGES

Popular searches

Urban popular searches

Cities

Companies

LEGAL

Privacy policy

Terms of service

eAccessibility commitment

JobNob HQ Address

1 E Broad St
Ste 130 - 1252
Bethlehem, PA 18018-5934
United States

Principal Service Reliability Engineer

GENERAL

Home

About

Contact

Blog

MORE PAGES

Popular searches

Urban popular searches

Cities

Companies

LEGAL

Privacy policy

Terms of service

eAccessibility commitment

JobNob HQ Address

1 E Broad St Ste 130 - 1252 Bethlehem, PA 18018-5934 United States

1 E Broad St
Ste 130 - 1252
Bethlehem, PA 18018-5934
United States