Skip to content

P0: Reconcile retained DataOps intake table blocking deploy #60

Description

@alexeygrigorev

P0: Unblock DataOps V1 deploy role EventBridge permissions

Status: pending
Tags: bug, infra, work-engine, backend, testing, P0, human
Depends on: None
Blocks: successful Deploy DataOps V1 Lambda deployment of dataops-v1

Scope

Fix the current DataOps V1 deployment blocker where GitHub Actions OIDC can assume arn:aws:iam::817685572750:role/dataops-github-actions-deploy, all repo/app checks pass, and CloudFormation then rolls back while creating the work-engine daily EventBridge schedule.

Observed latest failure:

  • Workflow: Deploy DataOps V1 Lambda
  • Run: https://github.com/DataTalksClub/dataops/actions/runs/28319977186
  • Stack: dataops-v1
  • Region/account: eu-west-1 / 817685572750
  • Failing logical resource: WorkEngineFunctionDailyWorkEngineCron
  • Failing AWS resource type: AWS::Events::Rule
  • Assumed role: arn:aws:sts::817685572750:assumed-role/dataops-github-actions-deploy/dataops-v1-deploy
  • Missing permission: events:DescribeRule
  • Example resource ARN: arn:aws:events:eu-west-1:817685572750:rule/dataops-v1-WorkEngineFunctionDailyWorkEngineCron-*

This issue has two distinct work streams:

  1. Repo-side verification/fix, owned by an agent: verify that the deploy-role template in this repo grants only the required EventBridge permissions on the DataOps stack rule namespace. If the repo template has drift or is incomplete, make the smallest repo change needed.
  2. Live AWS stack update, owned by a credentialed human/On-Call gate: apply the deploy-role stack/policy update to AWS and rerun the deployment. This requires real AWS permissions and must not be attempted with ad hoc local credentials by a normal implementation agent.

The repo-side source of truth is currently lambda-functions/template.github-actions-dataops.yaml. At grooming time, it already contains Sid: EventBridgeDataOpsRules with events:DescribeRule, events:PutRule, events:PutTargets, events:RemoveTargets, events:DeleteRule, tagging/list-target actions, and resource scope arn:${AWS::Partition}:events:${FullDocsRegion}:${AWS::AccountId}:rule/${FullDocsStackName}-*. The likely remaining blocker is that the live dataops-v1-github-actions deploy-role stack has not been updated from that template.

Preserve least privilege throughout:

  • Do not broaden GitHub OIDC trust beyond repo:DataTalksClub/dataops:ref:refs/heads/main.
  • Do not add administrator, wildcard-account, wildcard-region, or all-EventBridge permissions.
  • Keep EventBridge scope limited to ${FullDocsRegion}, the current account, and rule/${FullDocsStackName}-* unless a narrower CloudFormation-compatible ARN pattern is proven to work.
  • Keep Lambda, DynamoDB, Logs, Secrets Manager, S3, IAM, and CloudFormation permissions within their existing DataOps stack namespaces.
  • Do not create unmanaged EventBridge rules, IAM policies, roles, or one-off AWS resources outside CloudFormation.

Affected Files and Stacks

Repo-side files to inspect or minimally change:

  • lambda-functions/template.github-actions-dataops.yaml
  • .github/workflows/deploy-dataops-v1.yml
  • lambda-functions/template.full.yaml
  • lambda-functions/samconfig.toml if stack names or deploy config need confirmation

Live AWS resources/stacks, human-gated:

  • Deploy-role stack: dataops-v1-github-actions in eu-west-1
  • Deploy role: arn:aws:iam::817685572750:role/dataops-github-actions-deploy
  • App stack: dataops-v1 in eu-west-1
  • EventBridge rule namespace: arn:aws:events:eu-west-1:817685572750:rule/dataops-v1-*
  • Work-engine scheduled rule from SAM: WorkEngineFunctionDailyWorkEngineCron

Acceptance Criteria

  • Repo-side deploy-role template contains an EventBridge statement that allows CloudFormation/SAM to manage the dataops-v1-* EventBridge rules required by WorkEngineFunctionDailyWorkEngineCron, including at minimum events:DescribeRule, events:PutRule, events:PutTargets, events:RemoveTargets, and events:DeleteRule.
  • Repo-side IAM scope remains least-privilege: EventBridge resources are limited to the DataOps stack rule namespace in eu-west-1; no admin policy, events:*, all-resource EventBridge write scope, or broadened GitHub OIDC trust is introduced.
  • Repo-side changes, if any, are limited to deployment/IAM configuration and related tests/docs for this blocker. No changes are made to ../dtc-operations, ../datatasks, or ../podcast-assistant.
  • The deploy workflow still assumes only arn:aws:iam::817685572750:role/dataops-github-actions-deploy through GitHub Actions OIDC with id-token: write only on the deploy job.
  • lambda-functions/template.full.yaml still declares the work-engine schedule through SAM/CloudFormation rather than creating the EventBridge rule at runtime.
  • Repo-side verification commands pass, including SAM/CloudFormation validation and focused checks that the GitHub Actions deploy-role template contains the expected scoped EventBridge statement.
  • [HUMAN] A credentialed AWS operator deploys or updates dataops-v1-github-actions from lambda-functions/template.github-actions-dataops.yaml using CloudFormation, preserving the parameter values for owner/repo/branch, deploy role name, stack name, region, and SAM artifact bucket.
  • [HUMAN] After the live deploy-role stack update, AWS policy inspection or simulation confirms dataops-github-actions-deploy allows the required EventBridge actions on arn:aws:events:eu-west-1:817685572750:rule/dataops-v1-* and does not allow broader EventBridge write scope.
  • [HUMAN] The Deploy DataOps V1 Lambda workflow is rerun from main; the Deploy DataOps v1 stack step reaches stack update success instead of UPDATE_ROLLBACK_COMPLETE on WorkEngineFunctionDailyWorkEngineCron.
  • [HUMAN] The deployed smoke tests in the workflow pass, including Smoke test deployed docs full app and Smoke test private work engine Lambda.

Test Scenarios

Scenario: Repo template is already correct

Given: lambda-functions/template.github-actions-dataops.yaml contains the scoped EventBridgeDataOpsRules statement for ${FullDocsStackName}-*.
When: the agent validates the template and compares it with the failing deploy workflow surface.
Then: no unnecessary repo code change is made; the issue is handed to the credentialed AWS gate to apply the deploy-role stack update.

Scenario: Repo template drift is found

Given: the repo deploy-role template is missing a required EventBridge action or has a resource pattern that does not cover SAM-generated dataops-v1-WorkEngineFunctionDailyWorkEngineCron-* rules.
When: the Software Engineer updates the template.
Then: the change grants only the missing least-privilege EventBridge permissions and does not broaden OIDC trust or unrelated AWS service permissions.

Scenario: Live deploy-role stack is stale

Given: the repo template is correct but the live dataops-v1-github-actions stack policy does not include the EventBridge statement.
When: a credentialed AWS operator deploys lambda-functions/template.github-actions-dataops.yaml to dataops-v1-github-actions.
Then: the deploy role can describe/create/update/remove DataOps V1 EventBridge rules and the application stack can create WorkEngineFunctionDailyWorkEngineCron through CloudFormation.

Scenario: Main deployment is rerun

Given: the live deploy role has the scoped EventBridge permissions.
When: Deploy DataOps V1 Lambda is rerun on main.
Then: checks pass, sam deploy --config-env full-sandbox updates dataops-v1 successfully, deployed docs full app smoke passes, and private work-engine Lambda health smoke returns {"status":"ok"}.

Verification Commands

Agent-verifiable repo checks:

rg -n "EventBridgeDataOpsRules|events:DescribeRule|events:PutRule|events:PutTargets|rule/\$\{FullDocsStackName\}-\*" lambda-functions/template.github-actions-dataops.yaml
rg -n "id-token: write|AWS_ROLE_ARN|configure-aws-credentials|sam deploy" .github/workflows/deploy-dataops-v1.yml
rg -n "DailyWorkEngineCron|Type: Schedule|Schedule: cron" lambda-functions/template.full.yaml
cd lambda-functions && sam validate --template-file template.full.yaml

Recommended full repo workflow before any repo-side commit if a template/workflow change is made:

uv run --project lambda-functions --extra search --with pytest python -m pytest tests/docs_app
npm --prefix work-engine test
npm --prefix work-engine run typecheck
npm --prefix work-engine run build
cd lambda-functions && sam build --config-env full-sandbox

Human/AWS-gated inspection before live update:

aws iam get-role-policy \
  --role-name dataops-github-actions-deploy \
  --policy-name dataops-v1-deploy \
  --region eu-west-1

aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::817685572750:role/dataops-github-actions-deploy \
  --action-names events:DescribeRule events:PutRule events:PutTargets events:RemoveTargets events:DeleteRule \
  --resource-arns arn:aws:events:eu-west-1:817685572750:rule/dataops-v1-WorkEngineFunctionDailyWorkEngineCron-test \
  --region eu-west-1

Human/AWS-gated deploy-role stack update:

aws cloudformation deploy \
  --stack-name dataops-v1-github-actions \
  --template-file lambda-functions/template.github-actions-dataops.yaml \
  --region eu-west-1 \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides \
    GitHubOwner=DataTalksClub \
    GitHubRepo=dataops \
    GitHubBranch=main \
    DeployRoleName=dataops-github-actions-deploy \
    FullDocsStackName=dataops-v1 \
    FullDocsRegion=eu-west-1 \
    SamArtifactBucketName=aws-sam-cli-managed-default-samclisourcebucket-dgwncwijmnpd

Human/GitHub-gated deploy rerun after the role stack update:

gh workflow run deploy-dataops-v1.yml --repo DataTalksClub/dataops --ref main
gh run list --repo DataTalksClub/dataops --workflow deploy-dataops-v1.yml --limit 1
gh run watch --repo DataTalksClub/dataops <run-id>

Human Gates

  • [HUMAN] AWS credentials are required to inspect the live IAM inline policy, run IAM simulation, and deploy/update dataops-v1-github-actions.
  • [HUMAN] The operator must verify the CloudFormation changeset before executing if AWS reports unexpected changes outside GitHubActionsDeployRole inline policy updates.
  • [HUMAN] Rerunning the deployment is a real GitHub Actions/AWS write path and must be monitored by On-Call after it starts.
  • [HUMAN] If the deploy-role stack update would replace or delete dataops-github-actions-deploy, stop and escalate instead of proceeding.

Dependencies

  • Access to GitHub Actions in DataTalksClub/dataops for workflow rerun and monitoring.
  • Credentialed AWS access to account 817685572750 in eu-west-1 with permission to inspect/update CloudFormation stack dataops-v1-github-actions, inspect IAM role dataops-github-actions-deploy, and rerun/observe dataops-v1 deployment behavior.
  • Current live stack should be in a stable state before rerun. The latest observed failed run rolled back to UPDATE_ROLLBACK_COMPLETE.

Out of Scope

  • Changing the work-engine daily cron behavior, schedule expression, task-generation logic, or portal UI.
  • Moving scheduled execution out of SAM/CloudFormation or creating EventBridge rules manually.
  • Broadening deployment permissions for unrelated stacks, branches, repositories, AWS accounts, or regions.
  • Modifying source repositories outside dataops.
  • Rotating production secrets, changing Basic Auth, changing GitHub tokens, or editing runtime data in DynamoDB.
  • Closing older feature issues that merely exposed this deploy blocker; use comments/refs there if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Must havebackendBackend/APIbugSomething is brokenhumanCode done or issue blocked on human verificationinfraDeployment and infrastructuretestingTests and QAwork-engineDataTasks task execution engine

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions