Skip to content

chore(infra): add in-deploy CloudFormation drift detection #41

@mikanmarusan

Description

@mikanmarusan

Problem

Production stack transitmikanmarusan can silently drift from template.yml when AWS Console edits, hot-fixes, or AWS-side automation modify managed resources. A subsequent sam deploy either clobbers those out-of-band changes without warning, or fails mid-deploy after partial application — the WebACL incident behind PR #39 was an instance of this class.

Decision

Add an in-deploy CloudFormation drift detection step to .github/workflows/deploy-production.yml. The step polls aws cloudformation detect-stack-drift for the stack and hard-fails the deploy when StackDriftStatus != IN_SYNC. Rationale (full discussion at #41 (comment)):

  • Drift only matters at deploy time — the moment CFN reconciles. Inline gate is co-located with the action it protects.
  • No cron, no auto-opened follow-up Issues, no per-resource exception list. The scheduled-cron + auto-Issue design rejected for being over-engineered for a single-operator personal repo.
  • The originating WebACL incident is already structurally fixed twice over (template declares WebACLId since fix(infra): preserve CloudFront WebACL attachment in SAM template #39; workflow has a secret-vs-live ARN comparison). This Issue catches the open-ended class of future drifts, not that specific incident.

Scope

  • One new step Detect CloudFormation drift in the deploy job of deploy-production.yml, between SAM Build and Verify WEB_ACL_ARN_PROD secret matches live distribution.
  • Polls describe-stack-drift-detection-status with bounded backoff (≤ ~5 min).
  • On DETECTION_FAILED::error:: with reason → exit 1.
  • On DETECTION_COMPLETE + non-IN_SYNC::error:: + drifted resources table + remediation hint → exit 1.
  • Applies to both dry-run=true and dry-run=false paths.
  • Add ::add-mask::$WEB_ACL_ARN_PROD as the first step of the deploy job so describe-stack-resource-drifts output cannot leak the ARN through public workflow logs.

Out of scope

  • Scheduled drift detection workflow.
  • Auto-remediation.
  • Per-resource allowlist of acceptable drift (will be added as a follow-up Issue only if the known CloudFront::Distribution tag-drift false positive surfaces in practice — see aws-cloudformation/cloudformation-coverage-roadmap#901).
  • IAM least-privilege scoping of the deploy role (separate Issue chore(infra): scope down gh-actions-deploy-prod role to least privilege #52).

IAM

The existing OIDC role gh-actions-deploy-prod has AWSCloudFormationFullAccess, which covers cloudformation:DetectStackDrift, DescribeStackDriftDetectionStatus, and DescribeStackResourceDrifts. No IAM change required for this Issue.

Pre-merge gate

This PR can only be merged when the production stack transitmikanmarusan is currently IN_SYNC. If it is not, the first production deploy after merge will fail at the new step and all subsequent deploys will be blocked until drift is reconciled. Verify before merge:

DETECTION_ID=$(aws cloudformation detect-stack-drift --stack-name transitmikanmarusan --query 'StackDriftDetectionId' --output text)
aws cloudformation describe-stack-drift-detection-status --stack-drift-detection-id "$DETECTION_ID"

Repeat until DetectionStatus=DETECTION_COMPLETE and StackDriftStatus=IN_SYNC.

Acceptance

  • Deploy to Production with dry-run=true succeeds end-to-end; new step prints Stack is IN_SYNC. and downstream changeset describe runs green.
  • Deploy to Production with dry-run=false succeeds end-to-end; new step prints Stack is IN_SYNC. and full deploy + frontend sync + invalidation + both health checks pass.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions