Lead Site Reliability Engineer

  • Permanent
  • Prosource
  • Portugal

COCUS:

COCUS Prosource is all about People! We are proud to deliver skilled services and products developed by great talent, with attitude and ambition to work in innovative IT solutions. 

Emotions are part of us, we encourage everyone to be what they truly are in a collaborative, informal, transparent, and open environment, that is why we take our partnerships seriously - supporting as a Talent Acquisition specialized partner on the recruitment for companies with the same People first mindset as we have!

What you will be doing:

As a Lead Site Reliability Engineer, you will be responsible for defining and driving reliability engineering practices across a global technology landscape. Working closely with Platform Engineering, Product Teams, Security, and Service Operations, you will help ensure that business-critical services remain reliable, scalable, observable, and resilient.

  • You will play a key role in establishing SRE standards, promoting automation, reducing operational complexity, and continuously improving platform reliability across the organization.
  • Define and drive the Site Reliability Engineering strategy across global platforms and services
  • Establish and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budget frameworks
  • Drive reliability, resilience, and observability best practices across cloud-based and distributed systems
  • Lead technical coordination during major incidents and facilitate post-incident reviews and root cause analysis
  • Promote automation initiatives to reduce operational toil and improve engineering efficiency
  • Collaborate with Platform Engineering, Product Teams, Security, and Operations teams to improve service reliability
  • Define and monitor reliability metrics such as availability, latency, capacity, MTTR, and error budgets
  • Champion observability practices including logging, monitoring, metrics, and distributed tracing
  • Support the continuous evolution of reliability standards, governance models, and engineering practices
  • Mentor and support the development of Site Reliability Engineers across global teams
  • Contribute to the design and implementation of scalable, resilient, and highly available cloud architectures.

What we are looking for:

  • Several years of experience working with cloud infrastructure, distributed systems, or platform engineering environments
  • Several years of experience in Site Reliability Engineering, Reliability Engineering, DevOps, Platform Engineering, or similar senior technical roles
  • Strong understanding of SRE principles, reliability engineering practices, and operational excellence frameworks
  • Proven experience defining and implementing SLO, SLI, and Error Budget strategies
  • Strong experience with cloud platforms, preferably Microsoft Azure
  • Experience with observability platforms and monitoring solutions covering logs, metrics, and distributed tracing
  • Hands-on experience with automation and scripting using technologies such as Python, PowerShell, Bash, or similar
  • Experience working with Infrastructure as Code tools such as Terraform
  • Strong understanding of CI/CD pipelines, deployment strategies, and release reliability practices
  • Experience leading major incident response and conducting structured postmortem analysis
  • Strong stakeholder management and cross-functional collaboration skills
  • Excellent communication, analytical, and problem-solving abilities
  • Fluent in written and spoken English
  • Degree in Computer Science, Engineering, Information Technology, or a related field.

What will be a plus:

  • Experience working in global, multi-region, follow-the-sun operational models
  • Knowledge of AWS and/or Google Cloud Platform (GCP)
  • Experience with enterprise-scale observability platforms such as Datadog, Dynatrace, New Relic, Grafana, Prometheus, or similar
  • Experience building reliability frameworks in large-scale organizations
  • Knowledge of platform engineering practices and internal developer platforms
  • Relevant certifications in Azure, Cloud, DevOps, or Site Reliability Engineering
  • Previous experience mentoring or leading globally distributed engineering teams.

What we can offer you:

  • Permanent contract with long-term career prospects
  • Competitive salary aligned with your experience and expertise
  • Annual target bonus based on performance
  • Daily meal allowance
  • Health insurance and Life insurance
  • Access to Employee Assistance Program for well-being and support
  • A flexible work model, including the possibility to work remotely.