**Responsibilities**:
- Build infrastructure as a code using Terraform.
- Build, create, and enable Kubernetes clusters (GCP / AWS / Azure / RKE)
- Manage and performance tune either database (NIFI, Elasticsearch) or streaming data pipelines (Kafka)
- Manage and Create CICD pipelines, configuration, and automation tools for infrastructure provisioning.
- Write and maintain runbooks for knowledge-driven automated processes and bots.
- Do capacity planning based on performance, usage, and utilization stats.
- Ensure system availability and business continuity by implementing redundant servers/services.
- Manage after-hours infrastructure updates and maintenance.
- Proactively research and propose the use of new concepts, processes, technologies, and tools.
- Proactive monitoring, diagnosis, on-call rotation, and resolution of issues in a 24x7 multi-cloud environment (OpenStack), analyze failures, and provide support for software engineers to debug production issues across microservices and distributed platforms.
- Follow SRE's best practices and procedures.
**Experience Required For You To Be Successful**:
- Follow SRE's best practices and procedures.
- Laser focus and be able to design infrastructure solutions for scalability, reliability, high availability, performance, software maintainability, and operational excellence
- The ability to "fix the plane while in flight" (not just support greenfield solutions)
- The ability to prioritize existing technical and infrastructure debt, and experience to build and execute a plan to pay it off
**Required skills**:
- Delivering reliable operations for web-scale infrastructure for a global market at high release velocity
- Must have proven experience of over 5 years with at least 1 of the languages: Go, Python or Java
- Experience with Kafka, Kubernetes, NIFI, Elasticsearch, MongoDB, Vertica, Zookeeper, and IaC (Terraform).
- 2+ years of industry experience in managing infrastructure in large enterprises.
- 2 years of Linux administration in a large-scale SaaS environment.
- 2+ years maintaining production systems on AWS and/or OpenStack, Azure and GCP.
- 2+ years' experience in managing Kubernetes in a large-scale production environment
- Strong familiarity in running and optimizing RDBs and NoSQL databases.
- 2 years using infrastructure as code software (eg. Terraform, AWS and Google Cloud Deployment, CloudFormation).
- 2 years of experience in continuous integration practices & tools (Jenkins)
- Experience with monitoring solutions such as Prometheus, Grafana, and ELK.