Excelon Solutions
Location
Us, 50250 | France
Job description
Title Site Reliability Engineer With Machine Learning
Location Austin TX (Day one Onsite)
Job type FullTime
Good experience in SRE with ML Ops ML Flows & very good at Scripting is required.
Job description:
The ideal candidate would be the person who had experience on Kubernetes Machine Learning workflows (preferably Amazon Sagemaker) Python scripting Rubix. The person should have experience in Jupyter Notebooks as SRE
Successful candidate will several years of experience in supporting large enterprise system with at least 10 different upstream and downstream systems. Identifying issues from Splunk logs.
Technically sound in AWS Kubernetes and Python basic SQL ML Ops knowledge like MLFlow is a plus.
Answering/Fixing support issues for DatalaLab.
Implement and maintain Infra as Code and Build pipeline.
Taking measures to minimize oncall incidents.
Post incident reviews
Documenting the issue resolution and the undocumented knowledge
Work with dev teams to ensure that the new features meet the reliability and performance goals.
Ability to work with geographically distributed teams in India and SCV
Excellent problemsolving skills and decision making skills about when to engage other team members.
Job tags
Salary