Daniel Esponda - Staff Site Reliability Engineer

Professional Summary

Staff Site Reliability Engineer with 10+ years building large-scale CI/CD platforms and Kubernetes infrastructure. Lead SRE for Datadog's CI infrastructure processing 13M+ builds monthly, saving $7M annually. Expert in platform engineering, multi-region Kubernetes (5,000+ nodes), incident command, and cost optimization. Core Incident Commander leading enterprise-wide outage response.

Technical Skills

Core

Kubernetes, Docker, AWS, GCP, Python, Go, Java, Terraform, GitLab, Jenkins, CI/CD Platforms

Distributed Systems

Apache Kafka, gRPC, Service Mesh (Istio, Envoy), Event-Driven Architecture, Microservices

Observability

Prometheus, Grafana, Datadog, ELK Stack, SLI/SLO/SLA, Incident Management, On-Call Operations

Security

DevSecOps, Vault, PCI, SOC 2, FedRAMP, HIPAA, Policy-as-Code

Certifications

CKAD - Linux Foundation, 2020 | AWS Solutions Architect - Professional | AWS Solutions Architect - Associate | AWS Developer - Associate

Professional Experience

Staff Software Engineer, CI Infrastructure

Datadog

Remote, US

June 2025 - Present

Engineer and evolve custom CI platform in Go/Python processing 13M+ builds/month—implementing advanced features including distributed task scheduling, smart caching layers, and real-time build analytics serving 1,000+ engineers across 100+ teams
Architect and implement multi-tenancy framework in Go using namespace isolation, resource quotas, and pod security policies—enabling secure build isolation across 100+ teams while maintaining 99.95% platform availability and preventing resource contention
Continuously optimize CI infrastructure costs through intelligent workload placement algorithms, spot instance orchestration, and resource right-sizing automation—maintaining $7M annual savings while scaling platform to handle 30% growth in build volume
Design and implement caching improvements and build optimization strategies reducing cold start times by 40%, including Docker layer optimization, artifact reuse across pipelines, and predictive cache warming based on historical build patterns
Lead technical design reviews for 15+ RFCs covering CI infrastructure improvements, platform evolution, and architectural decisions—providing code-level feedback and prototyping proof-of-concepts to validate feasibility before team implementation
Drive consensus across 30+ engineering teams on CI strategies, monorepo architecture, and repository patterns through cross-team collaboration
Serve as Core Incident Commander for enterprise-wide severe outages (since 2023), debugging production issues across distributed systems, implementing automated remediation, and leading post-incident improvements

Senior Software Engineer, CI Infrastructure

Datadog

Remote, US

August 2022 - June 2025

Built and maintained CI/CD infrastructure processing 13M builds/month, achieving 99.95%+ uptime SLA through automated failover and graceful degradation strategies
Reduced annual CI infrastructure costs from $10M to $3M (70% reduction, $7M savings) through intelligent node selection, resource optimization, and automated scaling strategies
Designed and built custom enterprise CI system with task engine framework enabling reusable pipelines and smart dependency detection—foundation for company-wide standard
Reduced pipeline execution time from 70 minutes to 7-12 minutes (up to 90% faster) through persistent runner framework, Docker image warm caching, and build impact analysis integration
Architected build impact analysis service analyzing code changes to determine affected dependencies—eliminating unnecessary builds and reducing infrastructure waste by 60%+
Optimized Gitaly cluster configuration to handle 10,000+ commits/day across 20GB+ monorepo, reducing Git clone times from 15+ minutes to <2 minutes through custom checkout strategies
Engineered persistent runner framework with intelligent caching for extreme-scale Git operations, solving checkout performance challenges for massive monorepo
Implemented Vertical Pod Autoscaler (VPA) for automatic resource sizing across CI workloads, eliminating manual tuning overhead and optimizing cluster utilization
Became Core Incident Commander in 2023, training IC team members on incident response playbooks, simulation exercises, and best practices for high-pressure incident management

Staff Site Reliability Engineer

VMware

Remote, TX

November 2020 - August 2022

Led VMware's largest SaaS Kubernetes platform: 5,000+ nodes, 100+ clusters, 99.99%+ uptime for mission-critical workloads
Built custom Kubernetes operators in Go automating cluster lifecycle—reduced manual operations from 40 hrs/week to <12 hrs/week
Deployed global Istio service mesh across multi-cloud with zero-trust networking, mTLS, and circuit breaking for 300+ microservices
Maintained PCI, HIPAA, FedRAMP compliance through automated policy enforcement (OPA) and infrastructure-as-code validation

Senior Site Reliability Engineer

Toyota Connected

Plano, TX

October 2019 - November 2020

Built enterprise Kubernetes platform on AWS for 80+ teams, improving availability from 99.5% to 99.9%, reducing costs 40%
Deployed global ELK cluster with Kafka processing 3TB/day, achieving 80% cost reduction vs commercial alternatives
Created self-service developer platform reducing provisioning time from 3 days to 15 minutes

Site Reliability Engineer Manager

Capital One

Plano, TX

February 2016 - October 2019

Managed Kubernetes platform architecture spanning AWS/GCP supporting 500+ microservices as technical lead (60% IC, 40% leadership)
Led cloud migration of 100+ microservices from on-premise to AWS/Kubernetes, completing 3 months ahead of schedule
Reduced incident MTTR from 2.1 hours to 52 minutes through automated runbooks and enhanced observability

Senior Software Engineer

Pariveda Solutions

Plano, TX

August 2014 - February 2016

Architected automated hybrid cloud solutions using AWS, Chef, and Jenkins for enterprise clients
Built RESTful API services for cross-platform mobile applications using Java and Spring Framework

Education

University of Texas at Dallas — B.S. Computer Science, Cum Laude, GPA: 3.92

2013

Open Source & Leadership

Kubernetes 1.21 Bug Triage - CNCF Release Team Member (2021) | Core Incident Commander - Enterprise outage response leader (2023-Present) | Technical Portfolio: github.com/desponda