You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Production stack transitmikanmarusan can silently drift from template.yml when AWS Console edits, hot-fixes, or AWS-side automation modify managed resources. A subsequent sam deploy either clobbers those out-of-band changes without warning, or fails mid-deploy after partial application — the WebACL incident behind PR #39 was an instance of this class.
Decision
Add an in-deploy CloudFormation drift detection step to .github/workflows/deploy-production.yml. The step polls aws cloudformation detect-stack-drift for the stack and hard-fails the deploy when StackDriftStatus != IN_SYNC. Rationale (full discussion at #41 (comment)):
Drift only matters at deploy time — the moment CFN reconciles. Inline gate is co-located with the action it protects.
No cron, no auto-opened follow-up Issues, no per-resource exception list. The scheduled-cron + auto-Issue design rejected for being over-engineered for a single-operator personal repo.
The originating WebACL incident is already structurally fixed twice over (template declares WebACLId since fix(infra): preserve CloudFront WebACL attachment in SAM template #39; workflow has a secret-vs-live ARN comparison). This Issue catches the open-ended class of future drifts, not that specific incident.
Scope
One new step Detect CloudFormation drift in the deploy job of deploy-production.yml, between SAM Build and Verify WEB_ACL_ARN_PROD secret matches live distribution.
Polls describe-stack-drift-detection-status with bounded backoff (≤ ~5 min).
On DETECTION_FAILED → ::error:: with reason → exit 1.
On DETECTION_COMPLETE + non-IN_SYNC → ::error:: + drifted resources table + remediation hint → exit 1.
Applies to both dry-run=true and dry-run=false paths.
Add ::add-mask::$WEB_ACL_ARN_PROD as the first step of the deploy job so describe-stack-resource-drifts output cannot leak the ARN through public workflow logs.
Out of scope
Scheduled drift detection workflow.
Auto-remediation.
Per-resource allowlist of acceptable drift (will be added as a follow-up Issue only if the known CloudFront::Distribution tag-drift false positive surfaces in practice — see aws-cloudformation/cloudformation-coverage-roadmap#901).
The existing OIDC role gh-actions-deploy-prod has AWSCloudFormationFullAccess, which covers cloudformation:DetectStackDrift, DescribeStackDriftDetectionStatus, and DescribeStackResourceDrifts. No IAM change required for this Issue.
Pre-merge gate
This PR can only be merged when the production stack transitmikanmarusan is currently IN_SYNC. If it is not, the first production deploy after merge will fail at the new step and all subsequent deploys will be blocked until drift is reconciled. Verify before merge:
Repeat until DetectionStatus=DETECTION_COMPLETE and StackDriftStatus=IN_SYNC.
Acceptance
Deploy to Production with dry-run=true succeeds end-to-end; new step prints Stack is IN_SYNC. and downstream changeset describe runs green.
Deploy to Production with dry-run=false succeeds end-to-end; new step prints Stack is IN_SYNC. and full deploy + frontend sync + invalidation + both health checks pass.
Problem
Production stack
transitmikanmarusancan silently drift fromtemplate.ymlwhen AWS Console edits, hot-fixes, or AWS-side automation modify managed resources. A subsequentsam deployeither clobbers those out-of-band changes without warning, or fails mid-deploy after partial application — the WebACL incident behind PR #39 was an instance of this class.Decision
Add an in-deploy CloudFormation drift detection step to
.github/workflows/deploy-production.yml. The step pollsaws cloudformation detect-stack-driftfor the stack and hard-fails the deploy whenStackDriftStatus != IN_SYNC. Rationale (full discussion at #41 (comment)):WebACLIdsince fix(infra): preserve CloudFront WebACL attachment in SAM template #39; workflow has asecret-vs-liveARN comparison). This Issue catches the open-ended class of future drifts, not that specific incident.Scope
Detect CloudFormation driftin thedeployjob ofdeploy-production.yml, betweenSAM BuildandVerify WEB_ACL_ARN_PROD secret matches live distribution.describe-stack-drift-detection-statuswith bounded backoff (≤ ~5 min).DETECTION_FAILED→::error::with reason →exit 1.DETECTION_COMPLETE+ non-IN_SYNC→::error::+ drifted resources table + remediation hint →exit 1.dry-run=trueanddry-run=falsepaths.::add-mask::$WEB_ACL_ARN_PRODas the first step of thedeployjob sodescribe-stack-resource-driftsoutput cannot leak the ARN through public workflow logs.Out of scope
CloudFront::Distributiontag-drift false positive surfaces in practice — seeaws-cloudformation/cloudformation-coverage-roadmap#901).IAM
The existing OIDC role
gh-actions-deploy-prodhasAWSCloudFormationFullAccess, which coverscloudformation:DetectStackDrift,DescribeStackDriftDetectionStatus, andDescribeStackResourceDrifts. No IAM change required for this Issue.Pre-merge gate
This PR can only be merged when the production stack
transitmikanmarusanis currentlyIN_SYNC. If it is not, the first production deploy after merge will fail at the new step and all subsequent deploys will be blocked until drift is reconciled. Verify before merge:Repeat until
DetectionStatus=DETECTION_COMPLETEandStackDriftStatus=IN_SYNC.Acceptance
Deploy to Productionwithdry-run=truesucceeds end-to-end; new step printsStack is IN_SYNC.and downstream changeset describe runs green.Deploy to Productionwithdry-run=falsesucceeds end-to-end; new step printsStack is IN_SYNC.and full deploy + frontend sync + invalidation + both health checks pass.