How are you doing metrics? #160451

gthomson31 · 2022-08-11T15:37:42Z

gthomson31
Aug 11, 2022

Hi All

I have since deployed ARC and have managed to get the self hosted runners up and running.
Just looking for some advice on how best to monitor and maintain this deployment ?

We currently use Grafana internally so looking to see how I could use this with the runner deployment metrics

Any help appreciated!

Answered by debugger24

Aug 16, 2022

You can use Prometheus to scrape metrics on action-controller-runner.

Then you can connect Prometheus to Grafana

The current version exposed following metrics.

horizontalrunnerautoscaler_spec_min_replicas
horizontalrunnerautoscaler_spec_max_replicas
horizontalrunnerautoscaler_status_desired_replicas

In next release we will have following metrics too apart from above ones (PR already merged)

horizontalrunnerautoscaler_replicas_desired
horizontalrunnerautoscaler_runners
horizontalrunnerautoscaler_runners_registered
horizontalrunnerautoscaler_runners_busy
horizontalrunnerautoscaler_terminating_busy
horizontalrunnerautoscaler_necessary_replicas
horizontalrunnerautoscaler_workflow_runs_comp…

View full answer

debugger24 · 2022-08-16T12:23:23Z

debugger24
Aug 16, 2022

You can use Prometheus to scrape metrics on action-controller-runner.

Then you can connect Prometheus to Grafana

The current version exposed following metrics.

horizontalrunnerautoscaler_spec_min_replicas
horizontalrunnerautoscaler_spec_max_replicas
horizontalrunnerautoscaler_status_desired_replicas

In next release we will have following metrics too apart from above ones (PR already merged)

horizontalrunnerautoscaler_replicas_desired
horizontalrunnerautoscaler_runners
horizontalrunnerautoscaler_runners_registered
horizontalrunnerautoscaler_runners_busy
horizontalrunnerautoscaler_terminating_busy
horizontalrunnerautoscaler_necessary_replicas
horizontalrunnerautoscaler_workflow_runs_completed
horizontalrunnerautoscaler_workflow_runs_in_progress
horizontalrunnerautoscaler_workflow_runs_queued
horizontalrunnerautoscaler_workflow_runs_unknown

For some constraints we were using only the PercentBusy metric and not webhook metrics, we created a Grafana dashboard to track autoscaling metrics.

6 replies

nitinses Mar 20, 2023

@debugger24 Could you share the grafana json please

remover Mar 21, 2023

horizontalrunnerautoscaler_replicas_desired

horizontalrunnerautoscaler_runners

horizontalrunnerautoscaler_runners_registered

horizontalrunnerautoscaler_runners_busy

horizontalrunnerautoscaler_terminating_busy

horizontalrunnerautoscaler_necessary_replicas

horizontalrunnerautoscaler_workflow_runs_completed

horizontalrunnerautoscaler_workflow_runs_in_progress

horizontalrunnerautoscaler_workflow_runs_queued

horizontalrunnerautoscaler_workflow_runs_unknown

are these metrics usable if you're doing webhook based autoscaling, i.e. workflow_job?

balajisa09 May 24, 2023

@debugger24 can you provide the grafana dashboard json ?

gcosta-lively May 26, 2023

If you could share the grafana dashboard json it would be very helpful , as it saves me half day 🫶

alopezsanchez Aug 25, 2023

Hey, I just published this Grafana dashboard.
Any suggestion is welcome!

gthomson31 · 2022-08-16T12:31:35Z

gthomson31
Aug 16, 2022
Author

@debugger24 that's great, thanks for the help!
I was playing around with metrics yesterday and managed to get Prometheus deployed in our cluster but couldnt get the ARC ServiceMonitor to register with Prometheus.

set {
name = "metrics.serviceMonitor"
value = true
}

Did you have to do any additional Config to get this working?
Particularly around RBAC and Kube Proxy

Installed Prometheus Stack :
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack

4 replies

debugger24 Aug 16, 2022

couldnt get the ARC ServiceMonitor to register with Prometheus.

Can you share more details what errors are you getting ?

gthomson31 Aug 16, 2022
Author

Not Seeing any specific error message just not seeing it picked up under service discovery - however github-actions-exporter is being picked up also using serviceMonitor.

gthomson31 Aug 16, 2022
Author

Actions Metrics Yaml Output

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: github-actions-runner-controller
    meta.helm.sh/release-namespace: actions-runner-system
  creationTimestamp: "2022-08-15T13:39:13Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: github-actions-runner-controller
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: actions-runner-controller
    app.kubernetes.io/version: 0.25.2
    helm.sh/chart: actions-runner-controller-0.20.2
  name: github-actions-runner-controller-service-monitor
  namespace: actions-runner-system
  resourceVersion: "31879047"
  uid: 2d178124-b1f4-4790-a8f0-49389f8d819f
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    path: /metrics
    port: metrics-port
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  selector:
    matchLabels:
      app.kubernetes.io/instance: github-actions-runner-controller
      app.kubernetes.io/name: actions-runner-controller

Github Actions Exporter Yaml Output

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: eks-monitoring-github-actions-exporter
    meta.helm.sh/release-namespace: actions-runner-system
  creationTimestamp: "2022-08-16T13:33:41Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    release: prometheus
  name: eks-monitoring-github-actions-exporter
  namespace: actions-runner-system
  resourceVersion: "32283244"
  uid: ccc110f9-0975-458c-b75b-98123c322f6a
spec:
  endpoints:
  - interval: 300s
    path: /metrics
    targetPort: 9999
  jobLabel: eks-monitoring-github-actions-exporter
  namespaceSelector:
    matchNames:
    - actions-runner-system
  selector:
    matchLabels:
      app: github-actions-exporter
      release: eks-monitoring-github-actions-exporter

gthomson31 Aug 16, 2022
Author

Tried to Disable the Proxy but leave ServiceMonitor enabled - Still not being picked up

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: github-actions-runner-controller
    meta.helm.sh/release-namespace: actions-runner-system
  creationTimestamp: "2022-08-16T14:07:40Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: github-actions-runner-controller
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: actions-runner-controller
    app.kubernetes.io/version: 0.25.2
    helm.sh/chart: actions-runner-controller-0.20.2
  name: github-actions-runner-controller-service-monitor
  namespace: actions-runner-system
  resourceVersion: "32293595"
  uid: 8af8fc5e-d954-4730-a9f8-4c207eb8237e
spec:
  endpoints:
  - path: /metrics
    port: metrics-port
  selector:
    matchLabels:
      app.kubernetes.io/instance: github-actions-runner-controller
      app.kubernetes.io/name: actions-runner-controller

debugger24 · 2022-08-16T12:40:09Z

debugger24
Aug 16, 2022

In my case,

We were not using Prometheus Operator
We don't need RBAC as of now because we have complete cluster dedicated to gh runner.

Config I used was

In https://prometheus-community.github.io/helm-charts/

 extraScrapeConfigs: |
  - job_name: 'action-runner-controller'
    scheme: http
    metrics_path: /metrics
    static_configs:
    - targets:
      - my-github-controller-actions-runner-controll-metrics-service.actions-runner-system.svc:8443

In https://actions-runner-controller.github.io/actions-runner-controller

metrics:
  serviceAnnotations: {}
  serviceMonitor: false
  serviceMonitorLabels: {}
  port: 8443
  proxy:
    enabled: false
    image:
      repository: quay.io/brancz/kube-rbac-proxy
      tag: v0.13.0

Disabled proxy because I don't need RBAC at the moment. Disabled serviceMonitor because I am not using Prometheus Operator

If you use Prometheus Operator, and enable serviceMonitor I am assuming this helm chart will create the required RBAC permissions.

0 replies

gthomson31 · 2022-08-16T15:10:36Z

gthomson31
Aug 16, 2022
Author

Strangely when trying to hit the endpoint I am getting Connection Refused ?

curl 10.200.20.60:8443/metrics
curl: (7) Failed to connect to 10.200.20.60 port 8443 after 37 ms: Connection refused

However when I enable port forwarding it works

kubectl port-forward svc/github-actions-runner-controller-metrics-service  5555:8443
Forwarding from 127.0.0.1:5555 -> 8443
Forwarding from [::1]:5555 -> 8443


curl 127.0.0.1:5555/metrics
# HELP certwatcher_read_certificate_errors_total Total number of certificate read errors
# TYPE certwatcher_read_certificate_errors_total counter
certwatcher_read_certificate_errors_total 0
# HELP certwatcher_read_certificate_total Total number of certificate reads
# TYPE certwatcher_read_certificate_total counter
certwatcher_read_certificate_total 1

0 replies

leadelngalame1611 · 2023-02-08T13:15:03Z

leadelngalame1611
Feb 8, 2023

Please i need some help in getting all the metrics provided by ARC.
From my understanding, the following metrics

horizontalrunnerautoscaler_replicas_desired
horizontalrunnerautoscaler_runners
horizontalrunnerautoscaler_runners_registered
horizontalrunnerautoscaler_runners_busy
horizontalrunnerautoscaler_terminating_busy
horizontalrunnerautoscaler_necessary_replicas
horizontalrunnerautoscaler_workflow_runs_completed
horizontalrunnerautoscaler_workflow_runs_in_progress
horizontalrunnerautoscaler_workflow_runs_queued
horizontalrunnerautoscaler_workflow_runs_unknown

have been added to the new release ( runner-controller 0.22.0 ). So i upgraded my controllers to the new version but still just seeing only the following metrics

horizontalrunnerautoscaler_spec_min_replicas
horizontalrunnerautoscaler_spec_max_replicas
horizontalrunnerautoscaler_status_desired_replicas

We do use CDK to deploy the Helm resources in Eks and my configuration is as follows

const actionsRunnerCtrlHelmChart = new eks.HelmChart(this, "gh", {
      cluster,
      chart: "actions-runner-controller",
      repository: "https://actions-runner-controller.github.io/actions-runner-controller",
      namespace,
      createNamespace: false,
      wait: true,
      values: {
        labels: {
          service: "github-action-runner",
        },
        syncPeriod: "1m",
        defaultScaleDownDelay: "2m",
        scope: {
          watchNamespace: namespace,
          singleNamespace: true,
        },
        authSecret: {
          name: "github-runner-token",
          create: false,
          enabled: false,
        },
        env: [
          {
            name: "GITHUB_TOKEN",
            valueFrom: {
              secretKeyRef: {
                name: "github-runner-token",
                key: "github_token",
              },
            },
          },
        ],
        additionalVolumes: [
          {
            name: "github-runner-token-volume",
            csi: {
              driver: "secrets-store.csi.k8s.io",
              readOnly: true,
              volumeAttributes: {
                secretProviderClass: "github-runner-secret-provider-class",
              },
            },
          },
        ],
        additionalVolumeMounts: [
          {
            name: "github-runner-token-volume",
            mountPath: "/token",
            readOnly: true,
          },
        ],
        serviceAccount: {
          name: serviceAccountName,
          create: false,
        },
        metrics: {
          serviceMonitor: true,
          serviceMonitorLabels: {
            service: "github-action-runner",
          },
          port: 8080,
          proxy: {
            enabled: false,
          },
        },
        podAnnotations,
        tolerations,
        affinity,
        githubWebhookServer: {
          logFormat: "json",
          logLevel: "info",
          fullnameOverride: "gh-webhook",
          nameOverride: "gh-webhook",
          enabled: true,
          tolerations: [...tolerations, ...tolerations_graviton],
          affinity,
          podAnnotations,
          service: {
            type: "NodePort",
            annotations: {},
            ports: [
              {
                protocol: "TCP",
                port: 80,
                targetPort: "http",
                name: "http",
              },
            ],
          },
          ingress: {
            enabled: true,
            ingressClassName: "alb",
            annotations: {
              // "kubernetes.io/ingress.class": "alb",   ##This chart requires to set ingressClassName as above
              "alb.ingress.kubernetes.io/tags": `team=me,environment=${props!.environment},service=me`,
              "alb.ingress.kubernetes.io/scheme": "private",
              "alb.ingress.kubernetes.io/target-type": "instance",
              "cert-manager.io/cluster-issuer": "default-cluster-issuer",
              "alb.ingress.kubernetes.io/listen-ports": '[{"HTTPS":443}]',
              "alb.ingress.kubernetes.io/group.name": "private",
            },
            hosts: [
              {
                host: "example.com",
                paths: [
                  {
                    path: "/actions-runner-controller-github-webhook-server",
                    pathType: "Prefix",
                  },
                ],
              },
            ],
            tls: [
              {
                hosts: ["example.com"],
                secretName: "gh-scaler-cert",
              },
            ],
          },
        },
      },
    });

Chart-Version: actions-runner-controller-0.22.0
App-Version: 0.27.0

Please can someone help me figure this out?

8 replies

Moser-ss Mar 27, 2023

Just be aware that Helm doesn't update CRDs, if that is necessary to have the metrics you need to update manually the CRD

rekha-prakash-maersk Nov 8, 2023

Hi @Moser-ss , can you please point me to a document on how to manually update the CRD ?

I am also using the helm chart and only seeing half of the metrics coming up.

Moser-ss Nov 8, 2023

The CDRs are just normal Kubernetes resources so you can use kubectl apply and update CDR . The CDRs are located here charts/actions-runner-controller/crds
Basically you can do `kubectl apply -f

Also the metrics server works with the input of the webhooks

rekha-prakash-maersk Nov 9, 2023

Hi @Moser-ss Thank you for your reply.

I have the metrics annotation as mentioned in the troubleshooting document, and the prometheus available at cluster level

spec:
replicas: 1
template:
metadata:
labels:
product: "sh-runner"
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "8080"

But the workflow metrics alone not coming up in the grafana, which is the metric I am really after, so I am kind of stuck. can you please help !

rekha-prakash-maersk Nov 14, 2023

@Moser-ss , I use fluxcd for deployment , so added crds upgrade in the template itself, but still only workflow metrics are missing though. any idea what else I might be missing ? TIA!!

install:
    remediation:
      retries: -1
    crds: CreateReplace
  upgrade:
    crds: CreateReplace

philthynz · 2023-02-13T01:31:34Z

philthynz
Feb 13, 2023

Running chart version 0.22.0 I also can't see any of the new metrics. I haven't upgraded, I've just added some values to my Prometheus and ARC Helm charts:

kube-prometheus:
Done this to get Prometheus operator working accross namespaces.

prometheus:
  prometheusSpec:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: agentpool
              operator: In
              values:
              - "${agentPool}"
    podMonitorNamespaceSelector:
      any: true
    podMonitorSelector: {}
    podMonitorSelectorNilUsesHelmValues: false
    ruleNamespaceSelector:
      any: true
    ruleSelector: {}
    ruleSelectorNilUsesHelmValues: false
    serviceMonitorNamespaceSelector:
      any: true
    serviceMonitorSelector: {}
    serviceMonitorSelectorNilUsesHelmValues: false

ARC:
Added additional config to create the service monitors

metrics:
  serviceAnnotations: {}
  serviceMonitor: true
  serviceMonitorLabels: {}
  port: 8443
  proxy:
    enabled: false
    image:
      repository: quay.io/brancz/kube-rbac-proxy
      tag: v0.13.1

actionsMetrics:
  serviceAnnotations: {}
  serviceMonitor: true
  serviceMonitorLabels: {}
  port: 8443
  proxy:
    enabled: false
    image:
      repository: quay.io/brancz/kube-rbac-proxy
      tag: v0.13.1

actionsMetricsServer:
  enabled: true

Pods:

NAME                                        READY   STATUS    RESTARTS        AGE
build-arc-bc4f7cfff-f7hsr                   1/1     Running   0          105s
build-arc-metrics-server-6c4c8747d-zt6g9    1/1     Running   0          33s
build-arc-webhook-server-84d44644d5-4pnd8   1/1     Running   0          26s
build-general-linux-dev-2qrk2-fgmrz         1/1     Running   0          5d21h
build-general-linux-dev-2qrk2-tbq2v         1/1     Running   0          5d22h
build-general-linux-dev-2qrk2-vjzwt         1/1     Running   0          5d21h
build-general-linux-dev-2qrk2-x6p82         1/1     Running   0          5d21h

The build-arc-bc4f7cfff-f7hsr build-arc-metrics-server-6c4c8747d-zt6g9 and build-arc-webhook-server-84d44644d5-4pnd8 pods, are running the metrics ports:

- name: metrics-port
  containerPort: 8443
  protocol: TCP

I can't see those new metrics on the build-arc-bc4f7cfff-f7hsr pod.

runner@build-general-linux-dev-2qrk2-tbq2v:/$ curl -k http://172.25.53.8:8443/metrics | grep horizontalrunnerautoscaler_
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

# HELP horizontalrunnerautoscaler_spec_max_replicas maxReplicas of HorizontalRunnerAutoscaler
# TYPE horizontalrunnerautoscaler_spec_max_replicas gauge
horizontalrunnerautoscaler_spec_max_replicas{horizontalrunnerautoscaler="build-general-linux-dev-autoscaler",namespace="actions-runner-system-build"} 8
100# HELP horizontalrunnerautoscaler_spec_min_replicas minReplicas of HorizontalRunnerAutoscaler
 6# TYPE horizontalrunnerautoscaler_spec_min_replicas gauge
587horizontalrunnerautoscaler_spec_min_replicas{horizontalrunnerautoscaler="build-general-linux-dev-autoscaler",namespace="actions-runner-system-build"} 4
5    0 65875    0     0  27.4M      0 --:--:-- --:--:-- --:--:-- 31.4M

1 reply

stelligent-joao Feb 13, 2023

I have the exact same problem. The pods are up but no metrics coming through

fralvarop · 2023-10-18T09:09:14Z

fralvarop
Oct 18, 2023

Also this issue is open from another user: actions/actions-runner-controller#2799

I'm having a similar issue. I have ARC release update to the last version, CRDs are update also. But I just get 2 metrics

horizontalrunnerautoscaler_spec_max_replicas
horizontalrunnerautoscaler_spec_min_replicas

This is my configuration

podAnnotations:
  "prometheus.io/scrape": "true"
  "prometheus.io/path": "/metrics"
  "prometheus.io/port": "8080"

metrics:
  serviceMonitor:
    enable: true
  port: 8080
  proxy:
    enabled: false

githubWebhookServer:
  enabled: true
  ports:
  - nodePort: 33080

actionsMetrics:
  serviceMonitor:
    enable: true
    port: 8080
  proxy:
    enabled: false

1 reply

m-perrott Oct 26, 2023

This is the same for us. Everything on latest but only able to get these 2 metrics whilst using prometheus-community helm chart

rekha-prakash-maersk · 2023-11-08T14:21:25Z

rekha-prakash-maersk
Nov 8, 2023

Hi, I am using v.027.5 image version and though I have jobs in queue, I dont find see these metrics reflecting the same,

# HELP workqueue_depth Current depth of workqueue
# TYPE workqueue_depth gauge
workqueue_depth{name="horizontalrunnerautoscaler-controller"} 0
workqueue_depth{name="runner-controller"} 0
workqueue_depth{name="runnerdeployment-controller"} 0
workqueue_depth{name="runnerpersistentvolume-controller"} 0
workqueue_depth{name="runnerpersistentvolumeclaim-controller"} 0
workqueue_depth{name="runnerpod-controller"} 0
workqueue_depth{name="runnerreplicaset-controller"} 0
workqueue_depth{name="runnerset-controller"} 0

I am looking for a metric on how many jobs are in queue, so I can setup alert on that, if the queue is too big.

0 replies

dhutsj · 2024-03-08T07:28:42Z

dhutsj
Mar 8, 2024

I can also only see two 2 metrics with everything latest? Any idea?

horizontalrunnerautoscaler_spec_max_replicas
horizontalrunnerautoscaler_spec_min_replicas

0 replies

jameshounshell · 2025-02-14T15:52:49Z

jameshounshell
Feb 14, 2025

Anyone looking for a dashboard that works with the new metrics should check out this blog: https://www.kenmuse.com/blog/enabling-github-arc-metrics/

I have found that the cardinality on the gha_* metrics can be insane because they include runner_name (pod name) and runner_id (job id) essentially making one series per data point which prevents even using rate() without a subquery to discard runner_name/runner_id. Additionally the two histogram metrics have 48 buckets which is much higher than the prometheus default/recommended 10.

To clean up your metrics try using a service monitor configured like this. You'll need two, one for the controller, one for the listener (bulk of metrics). This is the one for the listener:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: scaleset-metrics
  namespace: github-actions
spec:
  endpoints:
  - interval: 1m
    metricRelabelings:
      # drop all infinite cardinality labels
      - action: labeldrop
        regex: ^runner_id$
      - action: labeldrop
        regex: ^runner_name$
      # For histograms gha_job_execution_duration and gha_job_startup_duration
      # which have 48 buckets by default
      # pick your buckets to keep, must keep "+Inf" bucket
      # mind the final empty else `(...|)`, which is required for this sneaky regex to work
      - action: keep
        regex: (10|60|300|600|3600|\+Inf|)
        sourceLabels: [le]
    path: /metrics
    port: http-metrics
    scrapeTimeout: 30s
    # we use istio so this may not be necessary for you
    scheme: https
    tlsConfig:
      caFile: /etc/prom-certs/root-cert.pem
      certFile: /etc/prom-certs/cert-chain.pem
      insecureSkipVerify: true
      keyFile: /etc/prom-certs/key.pem
  selector:
    matchLabels:
      app.kubernetes.io/component: runner-scale-set-listener
      app.kubernetes.io/part-of: gha-runner-scale-set

8 replies

jameshounshell Feb 14, 2025

Whelp update. It seems like runner_id/runner_name is a known issue and the devs haven't fixed it. For now you can use the subquery I wrote to actually graph the metrics but the growth of the number of time series is enough to make me drop the metric entirely until a fix is made. We may try to deploy a copy of the code that removes those labels.

actions/actions-runner-controller#3670

actions/actions-runner-controller#3671

rekha-prakash-maersk Feb 20, 2025

@jameshounshell , the query seems to fit my usecase, but what I am really trying to achieve is to get the total minutes/seconds of execution time taken by all workflows in a single value for a selected time window, like the usage minutes in GitHub billing. do you think it is possible with current structure of metric? please share your thoughts? thanks!

jameshounshell Feb 24, 2025

I forked/edited the code to drop runner_id/runner_name and reduce the number of buckets.

If you want to try running my version of the runner pull my repo and run this:

DOCKER_USER=docker.mycompany.com/automation make docker-buildx (change the url to be the docker repo you want, the image will always be called actions-runner-controller:) (WARNING: IT WILL TRY TO PUSH THE IMAGE IMMEDIATELY SO MAKE SURE YOU HAVE THE CORRECT DOCKER REPO)

There's some silly checks that I couldn't figure out so I set the version to be the version number listed in the helm chart (currently ) otherwise the controller complains that the listener isn't the right version and won't deploy it.

Add this to your values.yaml for the gha-runner-scale-set-controller helm chart

image:
  pullPolicy: Always
  repository: "docker.mycompany.com/automation/actions-runner-controller"# prior "ghcr.io/actions/gha-runner-scale-set-controller"
  tag: "0.10.1"  # THIS MUST MATCH THE VERSION LISTED IN THE HELM CHART

jameshounshell Feb 24, 2025

@rekha-prakash-maersk
Something like this will give you the average. Sum by whatever label you want.

sum(gha_job_execution_duration_seconds_sum) by (repository) / sum ((gha_job_execution_duration_seconds_count) or vector(1)) by (repository)

Here are the official prometheus docs for histogram manipulation https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations

rekha-prakash-maersk Apr 9, 2025

Hi @jameshounshell , with new release of option to customise the bucket values, did you cut down from the default of 40 + to any reasonable smaller bucket values? if yes, can you please share it here for me to refer ?

I looked at this default values but not sure if its right fit to this job execution time metric- https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#pkg-variables

nocturne1 · 2025-04-28T22:14:15Z

nocturne1
Apr 28, 2025

I noticed that @kenmuse had updated the sample Grafana dashboard and docs (thanks Ken!) at https://github.com/actions/actions-runner-controller/tree/master/docs/gha-runner-scale-set-controller/samples/grafana-dashboard , but I'm running into a roadblock with metrics for actions_github_com_scale_set_name being called, yet I don't see that provided anywhere in the metrics config, nor the codebase. I also don't see it when manually looking at the metrics for the listener.

I'm assuming that there's some other scrape config that is adding that to the listener metrics? Anyone have any ideas here?

0 replies

kenmuse · 2025-05-01T16:59:57Z

kenmuse
May 1, 2025

@nocturne1 - Good question. I'll try to add some additional documentation to cover that.

TL;DR - You're spot on. It is coming from a scrape config. It's how many systems have configured the scrape config for Prometheus. It has a relabelling step that scrapes the labels for one or more Pods and includes those details.

Longer version

That value originates from the Prometheus scrape config. While many default configurations already have a job that scrapes labels for all of the pods, I'll provide a config example that's a bit more tuned to ARC.

scrape_configs:
  - job_name: arc-metrics
    honor_labels: true
    kubernetes_sd_configs:
      - role: pod
         namespaces:
            names:
              - arc-systems
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)

The listeners (and runner pods) have labels like these:

Labels:
  actions.github.com/scale-set-name=arc-runner-set
  actions.github.com/scale-set-namespace=arc-runners  
  ...

The kubernetes_sd_configs lets Prometheus scrape some details from the REST APIs. In the example above, the arc-metrics job is scraping the pod details (role:pod). With the additional namespace attribute (optional), I've narrowed my example to just processing the pods in the namespace arc-systems.

The scrape will return the labels as:

 __meta_kubernetes_pod_label_actions_github_com_scale_set_name=arc-runner-set
 __meta_kubernetes_pod_label_actions_github_com_scale_set_namespace=arc-runners

The labelmap action defaults to renaming the label to the first capture group (the stuff in parentheses in the regex). So, that creates the values:

 actions_github_com_scale_set_name=arc-runner-set
 actions_github_com_scale_set_namespace=arc-runners

Hope that helps!

You may also want to simplify the names provided for the pod name/namespace:

 - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: namespace
 - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: pod

You may find it helpful to review this documentation for deeper details.

0 replies

GlacierWalrus · 2025-06-09T18:33:49Z

GlacierWalrus
Jun 9, 2025

@kenmuse Thank you for the explanation, and your invaluable contributions in providing a dashboard starter for the community.

I actually in principle agree with the team's decision to not render coreos CRDs in the arc charts- it's not a standard and I strongly dislike having to install coreos CRDs when I don't run prometheus.

Unfortunately coreos monitoring resources are a de-fact standard amongst community charts, and I'm equally irked by having to bring my own scrape config to these charts.

I've used pod annotations to expose these metrics to my monitoring solution by way of pod discovery (via the prometheus.io annotations).

I'm wondering if either:

the listener pod should add relevant labels itself, since it "knows" its own name/namespace at release time, and that can be supplied to the metrics exporter runtime.
those labels (actions_github_com_scale_set_name, actions_github_com_scale_set_namespace) should be dropped from the docs/example dashboard, since it's not clear that they're not being supplied by the exporter itself (also imho they're gnarly keys for labels). The same information is available by way of the scrape-target pod's name/namespace, and applying relabelling with whatever solution people use. e.g. for my copy of your dashboard, I'm drawing on the job label which itself is derived from the app.kubernetes.io/name annotation

My scrape config:

relabel_configs:
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace
  - target_label: job
    replacement: monitoring/annotations-discovery
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
    target_label: job
    regex: (.+)
    replacement: ${1}

As an aside: I don't think any of this is yours/github's responsibility, and I think the k8s core team should look at a solution to help us not require coreos CRD installations to support these flows, but I'm sure they have other things to prioritise.

0 replies

How are you doing metrics? #160451

Uh oh!

Replies: 14 comments · 28 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gthomson31 Aug 16, 2022 Author

Uh oh!

Uh oh!

gthomson31 Aug 16, 2022 Author

Uh oh!

gthomson31 Aug 16, 2022 Author

Uh oh!

gthomson31 Aug 16, 2022 Author

Uh oh!

Uh oh!

Uh oh!

gthomson31 Aug 16, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 14 comments 28 replies

gthomson31
Aug 16, 2022
Author

gthomson31 Aug 16, 2022
Author

gthomson31 Aug 16, 2022
Author

gthomson31 Aug 16, 2022
Author

gthomson31
Aug 16, 2022
Author