Site Reliability Engineer

NOW HIRING
Location
PA - King of Prussia
Area
Store Management Careers
Category
Information Technology - Quality IT

From Aisle to Algorithm and for All Life’s Moments, at David’s Bridal, we empower our customers and our employees to stay true to their dreams and find the one, whether that means the event or the wedding dress that matches a personal style—or the career that is a perfect fit. Join a company that dominates the products in their category – 1 out of 3 being sold by them and taking care of them with one of the highest customer service scores in retail!

If you are passionately enthusiastic, endlessly curious, and customer obsessed, say “I do” and apply today!



ROLE SUMMARY

The Site Reliability Engineer (SRE) is accountable for the availability, scalability, and performance of David's Bridal cloud compute platform across AWS and Azure, as well as the on-premises infrastructure spanning Hyper-V, VMware, Linux systems, and core network services (Active Directory, DNS, DHCP). This role owns SLO definition, incident response, capacity planning, and automation for production workloads supporting our retail and ecommerce platforms during peak wedding and holiday demand. The SRE also ensures the reliability and lifecycle of on-premises compute and identity infrastructure, maintaining seamless integration between datacenter environments and cloud platforms. The SRE partners with Platform Engineering, Security, and Application teams to drive service health, reduce toil, and embed reliability practices across the SDLC.

CORE COMPETENCIES

Cloud Compute (AWS): EC2, ECS, EKS, Lambda, Auto Scaling Groups, ALB / NLB, EBS, CloudWatch, Systems Manager

Cloud Compute (Azure): Virtual Machines, AKS, Azure Functions, VM Scale Sets, App Gateway, Azure Monitor, Update Manager

On-Premises Virtualization: Hyper-V (Windows Server), VMware vSphere / ESXi — host provisioning, VM lifecycle, HA/DR, cluster management

Linux Administration: RHEL, CentOS, Ubuntu — server hardening, patching, performance tuning, shell scripting

Directory & Identity Services: Active Directory (AD DS) — OU design, GPO, group management, user lifecycle, AD Connect / Entra ID sync

DNS & DHCP: Windows Server DNS, DHCP failover, scope management, relay configuration, split-brain DNS, zone administration

Reliability Engineering: SLI / SLO / error budget design, chaos engineering, blameless postmortems, production readiness reviews

Containers and Orchestration: Kubernetes (EKS, AKS), Helm, service mesh, container runtime hardening, image supply chain

Observability: Splunk, Datadog, distributed tracing, synthetic monitoring, Zabbix

Automation and IaC: GitHub Actions, Azure DevOps, Terraform, Bicep, Python, Bash, PowerShell

Networking and Security: VPC, Transit Gateway, Direct Connect, Azure VWAN, ExpressRoute, IAM, Entra ID, Zero Trust, Vault, Meraki

Incident Response: On-call leadership, runbook authoring, executive communication during major incidents

KEY RESPONSIBILITIES

Cloud Compute Reliability | AWS and Azure

  • Own production reliability for compute workloads on AWS (EC2, ECS, EKS, Lambda) and Azure (VMs, AKS, Functions) supporting davidsbridal.com and core retail systems.
  • Define and operate SLIs, SLOs, and error budgets for tier 1 services; partner with product and engineering on reliability targets and release gates.
  • Lead capacity planning, autoscaling design, and rightsizing across multiple AWS regions and Azure subscriptions, with explicit readiness for peak bridal and holiday traffic.
  • Drive cloud compute cost optimization, including Reserved Instance and Savings Plan strategy, Azure Reservations, spot adoption, and workload placement decisions.

On-Premises Infrastructure | Hyper-V, VMware, and Linux

  • Own the operational health of on-premises virtualization platforms including Hyper-V clusters (Windows Server) and VMware vSphere / ESXi environments.
  • Manage VM lifecycle end-to-end — provisioning, rightsizing, snapshotting, replication, and decommission — across Hyper-V and VMware hosts.
  • Maintain Hyper-V and VMware host patching, firmware updates, and capacity management; coordinate with hardware vendors and the datacenter team for physical infrastructure.
  • Administer Linux servers (RHEL, CentOS, Ubuntu) including OS patching, performance monitoring, cron job management, and service hardening.
  • Design and maintain HA and DR configurations for on-premises compute — Hyper-V Replica, VMware vSphere HA/DRS, and backup integration with Rubrik or equivalent.
  • Drive hybrid cloud integration between on-premises infrastructure and Azure, including ExpressRoute / Direct Connect connectivity, Azure Arc-managed on-premises VMs, and workload migration planning.

Active Directory, DNS & DHCP

  • Own the health and operational integrity of Active Directory (AD DS) across all DBI domains — OU structure, Group Policy, site topology, replication health, and AD Connect / Entra ID synchronization.
  • Manage AD user and group lifecycle in coordination with HR and IT operations — provisioning, modification, offboarding, and access governance aligned with identity management policy.
  • Administer Windows Server DNS infrastructure including forward/reverse zones, DNS replication, conditional forwarders, and split-brain DNS for internal and store-facing domains (dbistores.com, dbi.com).
  • Manage DHCP infrastructure across corporate and store environments — scope configuration, failover partnerships (DHCP02/DHCP03), relay/helper IP updates, lease duration management, and exclusion ranges.
  • Investigate and resolve DNS/DHCP incidents including OFFER=0 conditions, scope exhaustion, relay misconfiguration, and cross-site replication failures.
  • Maintain DHCP relay configurations on all routers and switches; coordinate with the network team on helper-address updates during infrastructure changes (office moves, new VLANs, server decommissions).
  • Document and enforce AD change management practices — no production AD changes on Fridays, script exclusions for service accounts and distribution groups, and OU-level protections for store and disabled accounts.

Incident Response and On Call

  • Serve as incident commander for production events across cloud and on-premises environments; lead triage, mitigation, customer impact assessment, and executive communications.
  • Build and maintain runbooks, automated remediation, and self-healing patterns for high-frequency failure modes across AWS, Azure, Hyper-V, VMware, and AD/DNS/DHCP infrastructure.
  • Lead blameless postmortems and drive corrective actions to closure; track recurring themes and feed them into the reliability roadmap.
  • Participate in a 24x7 on-call rotation and continuously reduce on-call burden through automation and toil tracking.

Observability and Monitoring

  • Instrument compute workloads with metrics, logs, traces, and synthetic checks across AWS, Azure, and on-premises using Datadog, Splunk, and Zabbix.
  • Define golden signals and dashboards for tier 1 services; ensure alerts are actionable, owned, and tied to documented SLOs.
  • Extend observability coverage to on-premises infrastructure — Hyper-V host health, VMware cluster utilization, AD replication status, DNS query latency, and DHCP lease utilization.
  • Partner with application teams to embed observability into new services from day one and during production readiness reviews.

Automation, IaC, and Toil Reduction

  • Build and maintain Terraform modules and Bicep templates for AWS and Azure compute, networking, and Kubernetes infrastructure.
  • Automate on-premises operations including Hyper-V VM provisioning via PowerShell/DSC, VMware tasks via PowerCLI, AD bulk operations, DHCP scope management, and DNS zone updates.
  • Operate GitHub Actions and Azure DevOps pipelines for compute platform changes, including policy as code and drift detection.
  • Automate routine work including patching, AMI and image bakery, certificate rotation, scaling events, and access provisioning across cloud and on-premises environments.
  • Champion shift-left reliability by integrating reliability and security checks into CI/CD pipelines.

Cross-Functional Leadership

  • Partner with Security, Platform Engineering, and Application teams on production readiness reviews, change management, and architecture decisions.
  • Mentor engineers on SRE principles, observability best practices, and cloud and on-premises compute patterns.
  • Influence architecture toward scalable, resilient, and cost-effective compute patterns across the David's Bridal technology portfolio — cloud, hybrid, and on-premises.

REQUIRED QUALIFICATIONS

  • 7+ years of SRE, DevOps, or production engineering experience operating production workloads on AWS and/or Azure; multi-cloud and hybrid experience strongly preferred.
  • Deep operational expertise with AWS EC2, ECS or EKS, and Lambda, or with Azure VMs, AKS, and Functions, in a regulated or high-traffic ecommerce environment.
  • Hands-on experience administering Hyper-V and/or VMware vSphere environments including clustering, HA, VM lifecycle, and storage integration.
  • Demonstrated experience with Linux server administration (RHEL, CentOS, or Ubuntu) including patching, performance tuning, and shell scripting.
  • Strong Active Directory expertise — OU design, GPO management, user/group lifecycle, replication troubleshooting, and Entra ID / AD Connect synchronization.
  • Practical experience managing Windows Server DNS and DHCP in multi-site enterprise environments, including DHCP failover, relay configuration, and split-brain DNS.
  • Strong scripting skill in Python, Bash, and PowerShell; PowerShell DSC and PowerCLI experience is a plus.
  • Production experience with Terraform or Bicep, and CI/CD with GitHub Actions or Azure DevOps.
  • Hands-on Kubernetes operations experience on EKS, AKS, or self-managed clusters.
  • Demonstrated experience defining and operating against SLOs and error budgets.
  • Strong observability background with Datadog, Splunk, Zabbix, Prometheus, or equivalent platforms.
  • Experience leading incident response and authoring blameless postmortems.

PREFERRED QUALIFICATIONS

  • Retail or ecommerce platform experience, especially during peak events such as Black Friday, Cyber Monday, and bridal season.
  • Experience with Azure Arc for managing on-premises and multi-cloud servers from a unified control plane.
  • Familiarity with Shopify Plus, headless commerce, CDN edge platforms, and Tealium or equivalent CDP.
  • Chaos engineering practice using AWS FIS, Azure Chaos Studio, or Gremlin.
  • FinOps practice and cloud cost optimization at enterprise scale.
  • Experience with cross-cloud networking (Transit Gateway, Azure VWAN, ExpressRoute, Direct Connect) and Meraki SD-WAN.
  • Experience with SIEM platforms (Microsoft Sentinel, Gravwell, Splunk) in a hybrid cloud / on-premises environment.

EDUCATION AND CERTIFICATIONS

  • Bachelor's degree in Computer Science, Information Systems, Engineering, or equivalent professional experience.
  • Preferred: AWS Certified Solutions Architect (Associate or Professional); AWS Certified DevOps Engineer.
  • Preferred: Microsoft Certified Azure Administrator (AZ-104); Azure Solutions Architect Expert (AZ-305).
  • Preferred: Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD).
  • Preferred: Microsoft Certified: Identity and Access Administrator (SC-300) or equivalent Active Directory / Entra ID certification.
  • Preferred: VMware Certified Professional (VCP-DCV) or Microsoft Hyper-V certification.


Now that we’ve popped the question, please say “I do”.

 

Full Time Opportunity – A comprehensive benefits package is available.

  • Rewarding Environment and Competitive Pay
  • Generous Dream Maker Discount After First Pay Period
  • Referral Incentive Program
  • Dayforce Wallet – Get Paid Early!
  • Health/Dental/Vision Insurance
  • 401K Program
  • Paid Vacation, Wellness Days & Holidays, including your Birthday off!
  • Pet Benefits

Love wins when love is for Everyone!

Our mission at David’s Bridal is to embrace the ideas of Diversity, Equity, and Inclusion. It is our goal to build a workforce that is as representative as the customers we serve. We vow to create a culture where all forms of diversity are celebrated and seen as valuable. 

 

David’s Bridal encourages applications from all qualified candidates. David’s Bridal has a great record of accommodating persons with disabilities. Contact Human Resources at humanresources@dbi.com or 610.943.5048 if you need accommodation at any stage of the application process or want more information on our accommodation policies.

 

Policy: Candidate Use of AI in Live Interviews

We conduct interviews to evaluate each candidate’s own knowledge, judgment, and communication. During any live interview (virtual or in-person), candidates must not use real-time generative AI tools to compose or feed their answers. Candidates may use assistive technologies (e.g., screen readers, live captions) and may request reasonable accommodation in advance.

 

Disclaimer: The preceding job description has been designed to highlight the general nature and level of work performed by employees within this classification.  It is not designed to contain or be interpreted as a comprehensive description of all duties, responsibilities and qualifications required of employees assigned to this job.  Actual duties and responsibilities will vary. The standard base pay range for this role is posted at a minimum and maximum rate.

 

The starting rate of pay offered will vary based on factors including, but not limited to, position offered, location, training, and/or experience, and internal equity. This base pay range is specific to the state this role is posted in and may not be applicable to other locations. At David’s Bridal, it is rare for an individual to be hired at the high end of the range in their role, and compensation decisions are dependent upon the details and circumstances of each position and candidate.