As a Site Reliability Engineer, youll design, build, deploy and maintain the infrastructure that enables critical cyber activity. Working in a highly technical environment, youll ensure the systems others depend on are secure, reliable and fit for purpose. Using a blend of cutting-edge and bespoke technologies, including internally developed systems you wont find elsewhere, youll solve complex, mission-critical challenges.
This is an engineering-led role where youll play a key part in delivering and managing infrastructure through code, operating in fully code-driven environments. Using tools such as Kubernetes, infrastructure as code and automation frameworks, youll develop and manage production environments that support operational activity. Youll oversee deployments end to end, manage upgrades and changes, and ensure systems remain stable and available without disruption, maintaining continuity of service.
Monitoring is central to the role. Youll track system health, performance and behaviour using logs and metrics, identifying risks and weaknesses before they impact service. When issues arise, youll investigate root causes and contribute to improvements that strengthen reliability and prevent recurrence.
Day to day, youll work closely with engineers, developers and researchers to deploy new capabilities and improve existing systems. Youll collaborate with monitoring teams to respond to system concerns and ensure environments remain secure and resilient. Alongside this, youll contribute to improving how systems are managed, refining processes and strengthening reliability practices as systems evolve.
Operating within a small, specialist team in a wider delivery environment, youll have the autonomy to influence how production systems are managed and delivered. This is a hands-on engineering role focused on ensuring systems are robust, consistently available and capable of supporting critical operational activity.