Western Digital's High-Performance Computing environments are key to bringing new storage solutions to market
As a Senior High-Performance Computing (HPC) engineer in the IT Infrastructure team, you will be at the heart of Western Digital s engineering and product development process, delivering the IT HPC infrastructure and services that empowers engineering teams to develop new storage technologies and deliver high quality products to market quickly
As a member of the HPC as a service team - HPCaaS, you will be responsible for establishing and executing strategic objectives focused on improving the effective utilization of the compute resources while meeting or exceeding customer service level agreements for job prioritization, job concurrency, and job throughput in our EDA compute clusters
This includes leading architectural innovation and path finding efforts to create and implement Western Digital s next generation Grid computing environment
As a member of the team, you will be expected to not only deliver on technical requirements and solutions but also be able to present your solutions to senior management
Responsibilities include but are not limited to working as an individual contributor, a team member and a technical team lead to explore, define, and pilot new solutions with little supervision
Develop solutions, scripts, and/or processes to automate management of services and tools as required
In this role, you will be collaborating closely with EDA and hardware design team stakeholders to define and deliver workload efficiency improvements in Western Digital s EDA HPC infrastructure globally
What You'll be doing:
Support multi-site, high-performance compute infrastructure and services for the global engineering product development organizations
Design, create, deliver, and support the deployment of Ansible automation within HPC and Unix environments
Identify and propose solutions and new services for the distributed ASIC and GPU computing clusters
Perform troubleshooting and root cause analysis of HPC clusters and file system related issues
Develop and maintain documentation for all aspects of the HPC infrastructure
Improve root cause analysis and corrective action for problems large and small - identify patterns and propose how we can automate repetitive tasks
Recommend and implement solutions to improve the performance of workloads
Support diverse Engineering Design Automation environment
Bachelor s degree in computer science or equivalent experience
10+ years of Linux systems administration experience specifically in managing or supporting RedHat and/or Centos Linux in production environments