logo

JobNob

Your Career. Our Passion.

Site Reliability Engineer With Machine Learning


Excelon Solutions


Location

Us, 50250 | France


Job description

Title Site Reliability Engineer With Machine Learning

Location Austin TX (Day one Onsite)

Job type FullTime

Good experience in SRE with ML Ops ML Flows & very good at Scripting is required.

Job description:

The ideal candidate would be the person who had experience on Kubernetes Machine Learning workflows (preferably Amazon Sagemaker) Python scripting Rubix. The person should have experience in Jupyter Notebooks as SRE

Successful candidate will several years of experience in supporting large enterprise system with at least 10 different upstream and downstream systems. Identifying issues from Splunk logs.

Technically sound in AWS Kubernetes and Python basic SQL ML Ops knowledge like MLFlow is a plus.

Answering/Fixing support issues for DatalaLab.

Implement and maintain Infra as Code and Build pipeline.

Taking measures to minimize oncall incidents.

Post incident reviews

Documenting the issue resolution and the undocumented knowledge

Work with dev teams to ensure that the new features meet the reliability and performance goals.

Ability to work with geographically distributed teams in India and SCV

Excellent problemsolving skills and decision making skills about when to engage other team members.


Job tags



Salary

All rights reserved