About us:
We are an international IT and Fulfillment Services company with offices in the US, UK, and Portugal. The Virtual Forge works with organisations to create digital and technology platforms that drive transformation, develop capabilities, and build businesses around the world.
We are looking for a seasoned Senior Site Reliability Engineer (SRE) / Infrastructure Performance Specialist who doesn’t just treat operational symptoms with quick reboots, but instinctively digs into the underlying architectural substrate to permanently fix multi-tiered cloud application issues.
Role Overview:
In this role, you will be responsible for bringing order, stability, and robust resource hygiene to a complex, high-stakes core banking environment deployed on Azure Kubernetes Service (AKS). You will lead the transition from a reactive firefighting posture to a proactive, durable operational baseline—ensuring that complex data flows and multi-tiered applications run reliably.
Duties and Responsibilities:
- Kubernetes Resource Governance & Architecture
- Enforce Pod-Level Hygiene: Mandate and configure explicit CPU, memory, and ephemeral storage requests and limits across all microservices to completely eliminate resource over-commitment and node-level evictions.
- Autoscaling Safeguards: Manage Horizontal Pod Autoscaler (HPA) configurations carefully, ensuring that high-risk, single-replica overrides (e.g., min=max=1) are banned on critical web UI and front-office tiers.
- Cluster Readiness Management: Implement and maintain robust application readiness probes to properly gate traffic post-restart, ensuring clients never hit un-initialized or un-ready application pods.
- Database & Application Performance Tuning
- Deadlock Mitigation: Analyze, trace, and remediate persistent database performance bottlenecks, specifically focusing on Azure SQL Hyperscale deadlocks, aborted transactions, and high data I/O contention.
- Application Memory Optimization: Partner with product engineering teams to investigate runtime memory growth issues and database working-table bloat (such as queue/composition table growth), replacing manual table-clearing workarounds with permanent programmatic fixes.
- Advanced Observability & Alert Engineering
- De-noise the Paging Layer: Take ownership of an inherently noisy alerting system generating hundreds of daily threshold alerts, implementing aggressive deduplication and alert-noise reduction.
- Actionable Telemetry Curation: Configure custom Dynatrace and enterprise APM alert rules that page specifically on genuine availability-impacting events, such as edge gateway HTTP 5xx errors or customer-facing tiers dropping below minimum replica thresholds.
- Incident Response & Root Cause Analysis (RCA)
- Durable Problem Management: Drive root cause investigations directly inside ticketing systems (ITSM/TSR), ensuring that incidents are tracked to structural resolution and never closed prematurely based purely on temporary "service restored" statuses.
- Telemetry-Driven Diagnostics: Utilize deep-dive APM tools to map application dependency chains, preventing misattributed performance blame across distinct system layers.
Essential Skills:
- Bachelor Degree - Engineering; Computer Science or similar.
- Excellent problem-solving skills and attention to detail
- Ability to communicate effectively with both technical and non-technical stakeholders
- Flexibility in working hours to support multiple timezones
- Deep engineering experience with Azure Kubernetes Service (AKS), node configuration, admission controllers, and cluster autoscaling mechanisms.
- Strong background performance-tuning relational enterprise databases, specifically Azure SQL Hyperscale or similar high-volume engines under heavy transaction-retention stress.
- Advanced proficiency with Dynatrace (or equivalent top-tier APM suites like AppDynamics/New Relic) for transaction tracing, metric-event alerts, and PagerDuty routing.
- Direct familiarity with the Temenos product stack—specifically Transact (T24 core), Triple'A Plus (TAP / Financial Server), and Temenos Data Source (TDS)—is highly advantageous.
What do you need to do:
- Send your CV
- Gross Salary expectation and availability
Do you have concerns about facing discrimination based on factors such as age, gender, race, religion, disability, sexual orientation, or gender identity? Rest assured, at The Virtual Forge, we are committed to being an inclusive employer that values diversity in our workforce!
At The Virtual Forge, we have a dedicated recruitment team that works globally to fill all our recruitment needs. Therefore we don't need a response from recruitment companies. Thanks for understanding
Pay: £43,602.62-£112,893.39 per year
Work Location: Remote