.Job Req Number: 88247 Time Type: Full Time Job title: IT Specialist – Openstack Site Reliability Engineer (SRE) Purpose of the job/ Overall responsibility: A Site Reliability Engineer (SRE) is responsible for maintaining the reliability of infrastructure environments, ensuring that software applications run smoothly without causing errors after deployment and new changes. The SRE combines software engineering and systems administration to ensure the scalability, performance, and reliability of large-scale, cloud-based applications and infrastructure. Success criteria/KPI: Uptime percentage of OpenStack services.Mean Time to Recovery (MTTR) for incidents.Performance metrics (e.G., response time, throughput).Security vulnerabilities identified and mitigated.Successful backup and recovery tests.Documentation completeness and accuracy.Stakeholder satisfaction with communication and collaboration.Number of successful change implementations. Key Tasks: Automation and Infrastructure as Code (IaC): Develop and maintain automation scripts for deployment, configuration, and management of OpenStack components. Use tools like Ansible to manage infrastructure as code. Implement CI/CD pipelines to automate the deployment and testing of OpenStack updates and configurations.System Reliability and Availability: Implement and maintain monitoring systems to ensure the health and performance of the OpenStack environment. Quickly respond to and resolve incidents to minimize downtime and service disruptions.Performance Optimization: Continuously monitor and optimize the performance of OpenStack services. Forecast resource needs and plan for scaling the infrastructure to meet demand. Conduct load testing to identify bottlenecks and optimize system performance.Security and Compliance: Implement and enforce security best practices to protect the OpenStack environment. Regularly scan for and mitigate security vulnerabilities.Backup and Disaster Recovery: Develop and implement backup strategies to protect data and ensure quick recovery in case of failures. Create and maintain disaster recovery plans to minimize downtime and data loss. Regularly test backup and disaster recovery processes to ensure their effectiveness.Documentation and Knowledge Sharing: Maintain up-to-date documentation for all OpenStack configurations, processes, and procedures. Share knowledge and best practices with the team and other stakeholders.Collaboration and Communication: Work closely with development, operations, and other teams to ensure smooth integration and operation of OpenStack. Communicate effectively with stakeholders about system status, incidents, and planned maintenance. Gather and incorporate feedback from users and stakeholders to improve the OpenStack environment.Continuous Improvement: Conduct post-mortem analyses after incidents to identify root causes and implement preventive measures. Evaluate and integrate new technologies and tools to improve the OpenStack environment