The opportunity for U
We’re now on the lookout for a SRE Engineer. You’ll be joining a global, diverse team working with cross-functional stakeholders. This is a permanent full time opportunity based in London.
Responsibilities
The type of person suitable for this role
- Ability to work on multiples tasks in parallel
- Problem solver
- Excellent communicator
- Desire to improve things
What skills you will need?
o Kubernetes and application troubleshooting
o Application deployment GitOps / ArgoCD
o K8s and application logging (Loki / fluent bit)
o Service Mesh (Linkerd preferred)
o Ingress Config / Troubleshooting (AWS LB Controller / Nginx)
o Autoscaling configuration (Karpenter)
o Certificate management (cert-manager)
o EKS
o RDS, DMS, RDS Proxy
o AWS Backup
o API Gateway
o RabbitMQ
o AWS Transfer Family (SFTP / SFTP Connector)
o AWS NGFW, TGW, PrivateLink
o AppStream
o Lambda – Python
o IAM
o Kinesis
o DynamoDB
o Troubleshooting defects
o Helm / ArgoCD
o Grafana, Prometheus, Loki, Cloudwatch configuration/dashboard creation
o Git / Code Deploy / Code Pipeline
What U will do
- Platform Operations:
- Managing and optimising our infrastructure to ensure high availability and system reliability.
- Deliver 24/7 support via on call rotation for after hour issues
- Infrastructure Automation Expertise:
- Experience with the AWS cloud platform including designing, deploying, and maintaining
scalable infrastructure.
U will be someone with:
- Strong knowledge of container orchestration tools like Kubernetes and Docker.
- Familiarity with deploying infrastructure as Code (IaC) with Terraform and CloudFormation.
- Chaos Engineering Proficiency:
- Understanding of implementing resilience testing strategies
- Designing and implementing chaos engineering tools like AWS Fault Injection, Gremlin, Chaos
Monkey, or LitmusChaos to design and execute fault injection experiments.
- Knowledge of modern chaos engineering trends, such as adaptive resilience testing or AI driven fault detection.
- Monitoring and Observability:
- Experience with monitoring and observability tools (e.g., Prometheus, ADOT, Grafana, Datadog,
New Relic, Elastic Stack).
- Strong understanding of instrumenting infrastructure with metrics, logging, and tracing
- Automation and Scripting:
- Proficiency in scripting and automation languages (e.g., Python, Go, Shell, Ruby, or Java).
- Demonstrated ability to automate infrastructure and operational processes.
- Incident Management and Root Cause Analysis:
- Participating in incident response processes, including triage, mitigation, and communication.
- Familiarity with incident management tools like PagerDuty or Opsgenie.
- Responding to production incidents, troubleshoot issues across the full stack, and ensure
minimal downtime by driving root cause analysis and applying long-term fixes.
- Conducting blameless post-mortems to identify root causes and derive actionable insights,
ensuring continuous improvement.
- Developing playbooks for common incidents, reducing Mean Time to Resolution (MTTR)
- Resilience and Scalability Design:
- Understanding of system design principles, scalability, and high-availability architectures.
- Practical experience with load testing and performance benchmarking tools (e.g., JMeter,
Locust, k6).
- Designing and testing disaster recovery (DR) strategies to ensure minimal downtime and data
integrity during failures.
If you’re looking for a role where you can be a part of exciting innovation, we want to hear from you! Apply now or connect now if you would like to hear more about our career opportunities in the future