EXL IT service management
Location
Gurgaon | India
Job description
Title : Data & ML Ops SME
Experience: 9+ years
Location: Gurgaon, India (Hybrid)
Job Summary :
EXL is seeking a skilled and experienced Data and ML Ops Engineer to join our team. This role is critical in ensuring the seamless deployment, monitoring, and maintenance of data pipelines and machine learning models in a production environment. The successful candidate will be responsible for bridging the gap between data engineering and data science to ensure efficient, reliable, and scalable data and ML operations.
Responsibilities:
- Data Pipeline Deployment: Collaborate with data engineers and data scientists to deploy data pipelines and machine learning models into production environments.
- Model Versioning: Implement version control for machine learning models to ensure reproducibility and traceability.
- Continuous Integration/Continuous Deployment (CI/CD): Implement and manage CI/CD pipelines for data and ML deployment and setup configuration based automated CI/CD pipelines
- Orchestration: Integration data pipelines/products with orchestration tools like Airflow
- Monitoring and Alerting: Implement & automate monitoring and alerting systems to proactively detect and address issues in data pipelines, machine learning models and infra
- Automation: Develop automation scripts and tools for managing data and ML operations, including data ingestion, feature engineering, model training, and deployment.
- Scalability: Ensure that data and ML pipelines can scale with the growing volume of data and usage demands.
- Performance Optimization: Identify and address performance bottlenecks in data and ML pipelines.
- Security: Implement security best practices to protect sensitive data and machine learning models.
- Collaboration: Collaborate with data engineers, data scientists, and IT teams to resolve technical issues and optimize data and ML workflows.
- Documentation: Maintain detailed documentation for data and ML operations processes, configurations, and updates.
- Work with multiple clients, identify ops opportunities in the form of gaps or new scope, work on solution design for the same.
- Setting up Ops frameworks, processes & teams from scratch
- Research on new tools, tech & trends in the market, create PoVs on where they fit in the Operations scope and how they support the client architectures.
- Business Development in the form of RFP responses, building accelerators, ops products etc
Preferred Qualifications:
- Bachelor's or Master's degree in Computer Science, Data Science, or a related field.
- Proven experience in data & machine learning operations, or a related role.
- Must have lead Operations & support team in the past
- Awareness of L1 to L4 Ops support model, responsibilities & activities involved in each level
- Good knowledge of data engineering tools and frameworks (e.g., Apache Spark, Kafka, Airflow, Hadoop, Hive, Greenplum).
- Scripting and programming skills in languages such as Python, SQL
- Strong experience in Unix
- Proven experience in E2E process automation in automating monitoring, alerting & DQ frameworks
- Proficiency inCI/CD tools like Git, Jenkins, Concourse, Ansible, Chef, Puppet etc.
- Hands-on experience in monitoring & alerting tools like Prometheus, Grafana, Datadog, Nagios, Datadog, Montecarlo, New Relic, Dynatrace etc
- Working knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes)and integrating it with the CI/CD pipeline
- Working knowledge of integrating Unit & Integration test cases with the CI/CD pipelines
- Understanding of machine learning concepts and frameworks (e.g., TensorFlow, PyTorch).
- Hands-on experience in MLOps tools like MLFlow, Torchserve, Kubeflow, Vertex AI, Sagemaker, etc
- Experience of at least one cloud platforms (e.g., AWS, Azure, GCP) along with related Ops tools like Cloudwatch, Cloud Monitoring, Cloud Deploy, CodePipeline, Azure Devops etc.
- Working knowledge of Cloud Infra automation tools like Terraform, Cloud Formation, Resource Manager, etc
- Excellent problem-solving and troubleshooting skills.
- Strong communication and collaboration skills.
Good To have:
- Working knowledge of building configuration-based CI/CD pipelines
- Working knowledge of Hadoop commands to manager edgenodes
- Working knowledge of kubectl to monitor K8s pods
- Experience in setting up Data or MLOps teams from scratch
- Experience in an On-Prem to cloud Migration project
Job tags
Salary