DevOps Engineer Role (Multi-Cloud: GCP primary, Azure/AWS nice to have)
You will shape and evolve a DevOps toolchain that enables reliable product delivery across a mixed estate: GCP (VMs and Cloud Run), on-prem, and multi-cloud patterns. You will work closely with delivery teams to design repeatable, secure, scalable deployment strategies, improve operational performance, and reduce manual effort through automation and infrastructure as code.
You will also support the deployment and operation of AI-enabled platforms, including LLM-based services, RAG applications, vector database integrations, and AI observability tooling to monitor performance, reliability, latency, cost, and quality across production AI workloads.
About AWTG
AWTG is a global technology partner delivering secure, scalable, mission-critical SaaS platforms that help organisations innovate with confidence. Established in 2006 and headquartered in London, AWTG brings deep experience across telecoms, smart cities, Industry 4.0, cloud, data, AI, and digital governance, supporting public and private sector clients worldwide.
Quality, security, and operational excellence are central to how we work. Our services align with recognised international standards and best practice, supported by certifications including ISO/IEC 27001, ISO/IEC 20000-1, ISO/IEC 42001, ISO 9001, and Cyber Essentials Plus, with independent CREST-accredited penetration testing.
We operate as a full lifecycle partner, covering advisory, architecture and design, rollout and integration, and long-term support and operations, with proven delivery at enterprise scale. Our platforms are engineered for performance, automation, and insight, incorporating AI-powered analytics, LLM-enabled services, RAG-based knowledge systems, and multi-cloud architectures, supported by robust programme governance and auditable controls.
What you will do
- Design and implement a coherent DevOps toolchain that enables safe, repeatable delivery across GCP (VMs/Cloud Run), on-prem, and multi-cloud practices (Azure/AWS).
- Build and improve CI/CD pipelines, such as GitHub Actions or GitLab CI, focusing on deployment repeatability, speed, and risk reduction.
- Establish and maintain infrastructure as code, including Terraform and related patterns, reducing manual configuration and improving consistency.
- Support the deployment and operation of LLM-based applications, including APIs, inference services, orchestration layers, prompt/configuration management, and supporting cloud services.
- Deploy and support RAG-based systems, including document ingestion pipelines, embedding workflows, vector databases, retrieval services, and integration with LLM application layers.
- Implement AI deployment observability practices, monitoring key indicators such as model/API latency, token usage, cost, error rates, retrieval performance, hallucination risk signals, user feedback, and end-to-end request tracing.
- Improve reliability and availability through proactive capacity planning, performance tuning, and resilience patterns, such as rollback strategies, blue/green, and canary, where appropriate.
- Strengthen security posture by embedding security controls into designs and pipelines, including IAM, secrets, least privilege, supply-chain controls, and auditability.
- Lead incident investigation and fault resolution; improve operational maturity through runbooks, post-incident reviews, and preventative actions.
- Partner with engineering, AI/ML, and product teams to plan and design large groups of stories, translating requirements into delivery and operational work.
- Drive development process optimisation with teams, identifying improvement opportunities and helping implement pragmatic changes.
- Implement and evolve observability practices, including metrics, logs, and traces, using Prometheus/Grafana and cloud-native equivalents to reduce MTTR and improve SLO performance.
- Support systems design and integration across services, coordinating integration builds and supporting integration testing activities.
- Develop and maintain scripts/tools of medium-to-high complexity to automate build, release, environment management, AI deployment workflows, and operational tasks.
- Mentor and coach junior engineers through pairing, reviews, standards, and knowledge sharing, without line management responsibility.
What you will bring – must-have
- Hands-on DevOps experience delivering secure, reliable services in production environments.
- Strong GCP experience, including compute on VMs and serverless, IAM, networking, monitoring, and operational tooling, with the ability to design for scale and availability.
- Experience deploying or supporting AI/LLM-enabled applications in production or pre-production environments.
- Understanding of RAG deployment patterns, including document ingestion, embeddings, vector databases, retrieval APIs, and integration with LLM services.
- Experience implementing observability for AI-enabled systems, including logs, metrics, traces, latency monitoring, error monitoring, usage tracking, and cost visibility.
- Proven CI/CD capability, using GitHub Actions and/or GitLab CI, including secure pipeline design and automated release strategies.
- Strong Infrastructure as Code experience, especially Terraform, plus experience migrating away from manual/console-heavy estates.
- Solid Linux and networking fundamentals, including hybrid connectivity considerations between cloud and on-prem.
- Practical information security engineering mindset: least privilege IAM, secrets management, secure build/release, and audit-ready change controls.
- Observability experience using Prometheus/Grafana and/or cloud-native monitoring to drive actionable operational insight.
- Strong troubleshooting and service support ability: diagnosing incidents, fixing faults, and improving stability through prevention and automation.
- Ability to design and review systems with medium risk, impact, and complexity, selecting appropriate standards, methods, and tools.
- Strong scripting/programming ability, such as Python, Go, or Bash, with disciplined testing and documentation.
- Collaborative ways of working: able to translate requirements into delivery plans, work across teams, and represent user needs in technical decisions.
Nice to have
- Experience operating or deploying workloads across container and orchestration platforms, such as Kubernetes, as well as serverless and VM estates.
- Experience with LLMOps, MLOps, or AI platform operations, including model gateway patterns, evaluation pipelines, prompt/config deployment, and AI service monitoring.
- Experience with vector databases or retrieval platforms such as Pinecone, Redis, Vertex AI Search, Azure AI Search, or similar.
- Experience supporting AI platforms or services such as Vertex AI, Azure AI Foundry, OpenAI/Azure OpenAI, or comparable LLM providers.
- Experience with policy-as-code and security tooling, such as OPA, SAST/DAST, container scanning, and dependency scanning.
- Experience with release strategies such as canary, blue-green, and progressive delivery patterns.
- Cost optimisation experience across cloud platforms, including FinOps mindset, budgeting/forecasting, right-sizing, workload efficiency, and AI/LLM usage cost control.
- Experience building internal developer platforms, golden paths, templates, or paved-road approaches.
- Familiarity with service management practices, including runbooks, SLAs/SLOs, and incident/problem/change processes.
- Experience coordinating cross-system integration builds and supporting integration testing at scale.
- Experience designing for regulated environments or formal assurance frameworks.
- Experience working with data pipelines, search/retrieval systems, or knowledge-base platforms used by AI applications.