COCUS - Lead Site Reliability Engineer

COCUS:

COCUS Prosource is all about People! We are proud to deliver skilled services and products developed by great talent, with attitude and ambition to work in innovative IT solutions.

Emotions are part of us, we encourage everyone to be what they truly are in a collaborative, informal, transparent, and open environment, that is why we take our partnerships seriously - supporting as a Talent Acquisition specialized partner on the recruitment for companies with the same People first mindset as we have!

What you will be doing:

As a Lead Site Reliability Engineer, you will be responsible for defining and driving reliability engineering practices across a global technology landscape. Working closely with Platform Engineering, Product Teams, Security, and Service Operations, you will help ensure that business-critical services remain reliable, scalable, observable, and resilient.

You will play a key role in establishing SRE standards, promoting automation, reducing operational complexity, and continuously improving platform reliability across the organization.
Define and drive the Site Reliability Engineering strategy across global platforms and services
Establish and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budget frameworks
Drive reliability, resilience, and observability best practices across cloud-based and distributed systems
Lead technical coordination during major incidents and facilitate post-incident reviews and root cause analysis
Promote automation initiatives to reduce operational toil and improve engineering efficiency
Collaborate with Platform Engineering, Product Teams, Security, and Operations teams to improve service reliability
Define and monitor reliability metrics such as availability, latency, capacity, MTTR, and error budgets
Champion observability practices including logging, monitoring, metrics, and distributed tracing
Support the continuous evolution of reliability standards, governance models, and engineering practices
Mentor and support the development of Site Reliability Engineers across global teams
Contribute to the design and implementation of scalable, resilient, and highly available cloud architectures.

What we are looking for:

Several years of experience working with cloud infrastructure, distributed systems, or platform engineering environments
Several years of experience in Site Reliability Engineering, Reliability Engineering, DevOps, Platform Engineering, or similar senior technical roles
Strong understanding of SRE principles, reliability engineering practices, and operational excellence frameworks
Proven experience defining and implementing SLO, SLI, and Error Budget strategies
Strong experience with cloud platforms, preferably Microsoft Azure
Experience with observability platforms and monitoring solutions covering logs, metrics, and distributed tracing
Hands-on experience with automation and scripting using technologies such as Python, PowerShell, Bash, or similar
Experience working with Infrastructure as Code tools such as Terraform
Strong understanding of CI/CD pipelines, deployment strategies, and release reliability practices
Experience leading major incident response and conducting structured postmortem analysis
Strong stakeholder management and cross-functional collaboration skills
Excellent communication, analytical, and problem-solving abilities
Fluent in written and spoken English
Degree in Computer Science, Engineering, Information Technology, or a related field.

What will be a plus:

Experience working in global, multi-region, follow-the-sun operational models
Knowledge of AWS and/or Google Cloud Platform (GCP)
Experience with enterprise-scale observability platforms such as Datadog, Dynatrace, New Relic, Grafana, Prometheus, or similar
Experience building reliability frameworks in large-scale organizations
Knowledge of platform engineering practices and internal developer platforms
Relevant certifications in Azure, Cloud, DevOps, or Site Reliability Engineering
Previous experience mentoring or leading globally distributed engineering teams.

What we can offer you:

Permanent contract with long-term career prospects
Competitive salary aligned with your experience and expertise
Annual target bonus based on performance
Daily meal allowance
Health insurance and Life insurance
Access to Employee Assistance Program for well-being and support
A flexible work model, including the possibility to work remotely.

Lead Site Reliability Engineer

Other jobs at COCUS