Professional Summary

Staff Site Reliability Engineer with 10+ years building large-scale CI/CD platforms and Kubernetes infrastructure. Lead SRE for Datadog's CI infrastructure processing 13M+ builds monthly, saving $7M annually. Expert in platform engineering, multi-region Kubernetes (5,000+ nodes), incident command, and cost optimization. Core Incident Commander leading enterprise-wide outage response.

Technical Skills

Core

Kubernetes, Docker, AWS, GCP, Python, Go, Java, Terraform, GitLab, Jenkins, CI/CD Platforms

Distributed Systems

Apache Kafka, gRPC, Service Mesh (Istio, Envoy), Event-Driven Architecture, Microservices

Observability

Prometheus, Grafana, Datadog, ELK Stack, SLI/SLO/SLA, Incident Management, On-Call Operations

Security

DevSecOps, Vault, PCI, SOC 2, FedRAMP, HIPAA, Policy-as-Code

Certifications

CKAD - Linux Foundation, 2020 | AWS Solutions Architect - Professional | AWS Solutions Architect - Associate | AWS Developer - Associate

Professional Experience

Staff Software Engineer, CI Infrastructure

Datadog

Remote, US

June 2025 - Present

  • Engineer and evolve custom CI platform in Go/Python processing 13M+ builds/month—implementing advanced features including distributed task scheduling, smart caching layers, and real-time build analytics serving 1,000+ engineers across 100+ teams
  • Architect and implement multi-tenancy framework in Go using namespace isolation, resource quotas, and pod security policies—enabling secure build isolation across 100+ teams while maintaining 99.95% platform availability and preventing resource contention
  • Continuously optimize CI infrastructure costs through intelligent workload placement algorithms, spot instance orchestration, and resource right-sizing automation—maintaining $7M annual savings while scaling platform to handle 30% growth in build volume
  • Design and implement caching improvements and build optimization strategies reducing cold start times by 40%, including Docker layer optimization, artifact reuse across pipelines, and predictive cache warming based on historical build patterns
  • Lead technical design reviews for 15+ RFCs covering CI infrastructure improvements, platform evolution, and architectural decisions—providing code-level feedback and prototyping proof-of-concepts to validate feasibility before team implementation
  • Drive consensus across 30+ engineering teams on CI strategies, monorepo architecture, and repository patterns through cross-team collaboration
  • Serve as Core Incident Commander for enterprise-wide severe outages (since 2023), debugging production issues across distributed systems, implementing automated remediation, and leading post-incident improvements

Senior Software Engineer, CI Infrastructure

Datadog

Remote, US

August 2022 - June 2025

  • Built and maintained CI/CD infrastructure processing 13M builds/month, achieving 99.95%+ uptime SLA through automated failover and graceful degradation strategies
  • Reduced annual CI infrastructure costs from $10M to $3M (70% reduction, $7M savings) through intelligent node selection, resource optimization, and automated scaling strategies
  • Designed and built custom enterprise CI system with task engine framework enabling reusable pipelines and smart dependency detection—foundation for company-wide standard
  • Reduced pipeline execution time from 70 minutes to 7-12 minutes (up to 90% faster) through persistent runner framework, Docker image warm caching, and build impact analysis integration
  • Architected build impact analysis service analyzing code changes to determine affected dependencies—eliminating unnecessary builds and reducing infrastructure waste by 60%+
  • Optimized Gitaly cluster configuration to handle 10,000+ commits/day across 20GB+ monorepo, reducing Git clone times from 15+ minutes to <2 minutes through custom checkout strategies
  • Engineered persistent runner framework with intelligent caching for extreme-scale Git operations, solving checkout performance challenges for massive monorepo
  • Implemented Vertical Pod Autoscaler (VPA) for automatic resource sizing across CI workloads, eliminating manual tuning overhead and optimizing cluster utilization
  • Became Core Incident Commander in 2023, training IC team members on incident response playbooks, simulation exercises, and best practices for high-pressure incident management

Staff Site Reliability Engineer

VMware

Remote, TX

November 2020 - August 2022

  • Led VMware's largest SaaS Kubernetes platform: 5,000+ nodes, 100+ clusters, 99.99%+ uptime for mission-critical workloads
  • Built custom Kubernetes operators in Go automating cluster lifecycle—reduced manual operations from 40 hrs/week to <12 hrs/week
  • Deployed global Istio service mesh across multi-cloud with zero-trust networking, mTLS, and circuit breaking for 300+ microservices
  • Maintained PCI, HIPAA, FedRAMP compliance through automated policy enforcement (OPA) and infrastructure-as-code validation

Senior Site Reliability Engineer

Toyota Connected

Plano, TX

October 2019 - November 2020

  • Built enterprise Kubernetes platform on AWS for 80+ teams, improving availability from 99.5% to 99.9%, reducing costs 40%
  • Deployed global ELK cluster with Kafka processing 3TB/day, achieving 80% cost reduction vs commercial alternatives
  • Created self-service developer platform reducing provisioning time from 3 days to 15 minutes

Site Reliability Engineer Manager

Capital One

Plano, TX

February 2016 - October 2019

  • Managed Kubernetes platform architecture spanning AWS/GCP supporting 500+ microservices as technical lead (60% IC, 40% leadership)
  • Led cloud migration of 100+ microservices from on-premise to AWS/Kubernetes, completing 3 months ahead of schedule
  • Reduced incident MTTR from 2.1 hours to 52 minutes through automated runbooks and enhanced observability

Senior Software Engineer

Pariveda Solutions

Plano, TX

August 2014 - February 2016

  • Architected automated hybrid cloud solutions using AWS, Chef, and Jenkins for enterprise clients
  • Built RESTful API services for cross-platform mobile applications using Java and Spring Framework

Education

University of Texas at Dallas — B.S. Computer Science, Cum Laude, GPA: 3.92

2013

Open Source & Leadership

Kubernetes 1.21 Bug Triage - CNCF Release Team Member (2021) | Core Incident Commander - Enterprise outage response leader (2023-Present) | Technical Portfolio: github.com/desponda