// Senior Staff Software Engineer

Rami AlGhanmi.

AI infrastructure·Distributed systems·Crusoe

I design and operate the cloud-native orchestration platforms that eliminate the underlying infrastructure complexity of GPU-intensive AI workloads. Two decades across platform engineering, distributed systems, and DevOps — solving the problems no one else wants to own. If I do my job right, you have no idea I exist.

01 What I work on — focus areas
01 · ai/hpc

AI & HPC infrastructure

Managed services for GPU-intensive training and inference — NVIDIA and AMD accelerators on Kubernetes, Slurm scheduling, and heterogeneous compute at scale.

02 · distributed systems

Distributed systems at scale

Architecting the data, compute, and control planes that run production — and keeping them running when they're under load.

03 · operations

Operating at production scale

On-call, incidents, capacity, observability — the operational discipline that turns a working system into a reliable one.

04 · devops

DevOps & GitOps

Self-service CI/CD, infrastructure-as-code, container delivery, and the deployment automation behind two decades of production systems.

02 Selected work — last decade

Crusoe

AI & HPC managed services

Building managed infrastructure for GPU-intensive AI training and inference workloads — Kubernetes orchestration across NVIDIA and AMD accelerators, Slurm scheduling, and the operational tooling around them.

k8sslurmnvidiaamdgpu-scheduling

Workday

Operational DataLake migration to AWS

Two-phase migration architecture — DataSync for transfer, EMR Spark for transformation — that decoupled copy from logic, allowing thorough validation and reuse downstream. Delivered without disrupting production workloads.

aws-datasyncemr-sparkhadoopawsk8sdevops

Workday

Unified observability across the fleet

Architected and operated the EKS-based telemetry platform that replaced a sprawl of per-system tooling — single pane of glass across every environment, lower capex, and the operational signal engineers actually trusted.

ekstelemetrymulti-cloudawsk8sdevops

Symantec

First cloud-native product to production

Led the production deployment of Symantec Endpoint Protection Cloud — the company's first SaaS product — and built the self-service CI/CD pipeline behind it from scratch.

ci/cdsaaspublic-cloudawsopenstackk8sdevops
03 Trajectory — twenty years, five chapters
YearsWhereWhat
2025 – nowCrusoeAI & HPC infrastructure. Managed services for GPU workloads — NVIDIA and AMD accelerators on Kubernetes at scale. Current
2019 – 2024WorkdayDistributed infrastructure, DevOps tooling, and fleet-wide observability. DataLake migration to AWS. Kubernetes platform for public-cloud delivery with zero-downtime deploys.
2014 – 2019SymantecCloud security. First SaaS product to production. Established in-house DevOps practice — self-service CI/CD, IaC, and microservice containerization with Docker & Kubernetes.
2008 – 2014USC · NASA JPLMS & PhD coursework. Earth-science data systems at JPL. Built a git-based assignment-delivery and grading pipeline as TA — early DevOps instincts.
2004 – 2008KFUPMBS, Computer Engineering. Hardware-software fundamentals.
04 About — in brief

I've spent two decades building infrastructure that has to work — Kubernetes platforms, data pipelines, observability systems, and the operational discipline behind all of them. Currently at Crusoe, building managed AI/HPC infrastructure for GPU workloads. Previously at Workday (distributed systems and DevOps) and Symantec (cloud security).

I care about systems that are operable, not just designed. The interesting work is in the failure modes, the migrations, the incidents — the parts that don't make it into architecture diagrams. Patient with detail, allergic to drama, comfortable on the bridge when production is on fire.

Stack

kubernetesslurmawsterraformargocdansiblelinuxgopythonhelmprometheussparkhadoop

// elsewhere

Connect.

Open to conversations about AI infrastructure, platform engineering, and hard problems at scale.