Skip to content

resource_group: add client-side observability metrics for Controller#10494

Open
JmPotato wants to merge 2 commits intotikv:masterfrom
JmPotato:resource-control/client-core-metrics
Open

resource_group: add client-side observability metrics for Controller#10494
JmPotato wants to merge 2 commits intotikv:masterfrom
JmPotato:resource-control/client-core-metrics

Conversation

@JmPotato
Copy link
Copy Markdown
Member

@JmPotato JmPotato commented Mar 26, 2026

What problem does this PR solve?

Issue Number: ref #10488

What is changed and how does it work?

This is Phase 1 of the Resource Control observability improvements. It adds client-side metrics to the Resource Group Controller (client/resource_group/controller/) for better visibility into token consumption, limiter state, and throttle behavior.

Changes:

A1 — Token consumption histogram always-on + RU type breakdown

  • Remove the enableControllerTraceLog guard on TokenConsumedHistogram so token consumption is always observable in production
  • Add new consume_by_type counter with rru/wru labels for per-type breakdown
  • Original histogram label signature unchanged (backward compatible with existing Grafana panels)

A2 — Limiter state gauges

  • resource_manager_client_resource_group_token_balance — current available token balance
  • resource_manager_client_resource_group_fill_rate — current effective fill rate (RU/s)
  • resource_manager_client_resource_group_burst_limit — current burst limit
  • Add GetFillRate() exported method to Limiter

A3 — Average RU consumption rate gauge

  • resource_manager_client_resource_group_avg_ru_per_sec — EMA estimated consumption rate
  • Combined with fill_rate, operators can detect consumption > refill before throttling occurs

A5 — Throttled state gauge

  • resource_manager_client_resource_group_throttled — 1 if in trickle mode, 0 if normal

All new gauge metrics are observed in the 1Hz updateRunState() tick (not on the request hot path). Only the histogram observation (~200ns) runs on the request path.

Check List

Tests

  • Unit test
  • Existing tests verified (pre-existing failures confirmed on master)

Summary by CodeRabbit

  • New Features
    • Extended monitoring with per-type resource consumption counters for read/write RU.
    • Added real-time limiter observability: token balance, fill rate, burst capacity, and throttling status.
    • Introduced average RU-per-second tracking for consumption visibility.
    • Consumption is now recorded consistently for both acquisition and response paths.
    • Automatically removes related metrics when a resource group is deleted.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 26, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 26, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign qiuyesuifeng for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 26, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added Prometheus metrics and instrumentation for resource-group token consumption and limiter state: per-RU-type counters, consumption histogram, limiter gauges (token balance, fill rate, burst), avg RU/sec, throttled gauge, registration/cleanup, always-on consumption observation, and Limiter.GetFillRate.

Changes

Cohort / File(s) Summary
Metrics Definition
client/resource_group/controller/metrics/metrics.go
Added resource_group subsystem, renamed label constant to typeLabel, introduced exported metrics: TokenConsumedByTypeCounter, TokenBalanceGauge, FillRateGauge, BurstLimitGauge, AvgRUPerSecGauge, ThrottledGauge; registered them in InitAndRegisterMetrics.
Controller Integration
client/resource_group/controller/group_controller.go
Initialized new metrics with labels; added observeConsumption(delta) helper; always observe token consumption (histogram + per-type counters) during acquisition and response handling; update limiter gauges (tokenBalance, fillRate, burstLimit, avgRUPerSec, throttled) on run-state refresh.
Limiter API & Locking
client/resource_group/controller/limiter.go
Changed limiter mutex to sync.RWMutex for read accessors; converted IsLowTokens, GetBurst, AvailableTokens to use read lock; added exported GetFillRate() float64.
Global Cleanup
client/resource_group/controller/global_controller.go
Extend resource-group cleanup to delete newly added per-group metric label values (histogram, per-type counters for rru/wru, and all new gauges).

Sequence Diagram(s)

sequenceDiagram
  participant Controller as Controller\n(group_controller.go)
  participant Limiter as Limiter\n(limiter.go)
  participant Metrics as Prometheus\n(metrics/metrics.go)

  Controller->>Limiter: request tokens / compute delta
  Limiter-->>Controller: acquisition result
  Controller->>Metrics: observe consumeTokenHistogram(amount)
  Controller->>Metrics: increment TokenConsumedByTypeCounter{resource_group,type}
  Controller->>Limiter: read state (isThrottled, GetFillRate)
  Limiter-->>Controller: fillRate / state
  Controller->>Metrics: set TokenBalanceGauge, FillRateGauge, BurstLimitGauge, AvgRUPerSecGauge, ThrottledGauge
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

Suggested labels

release-note-none, type/development

Suggested reviewers

  • rleungx
  • disksing
  • nolouch

Poem

🐇 I hopped through tokens, gauges in my paw,
Counters and fill-rates now hum like a law,
RRU, WRU — each counted in rhyme,
Throttles and averages ticking in time,
Hop — metrics bloom, observability for all.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main change: adding client-side observability metrics to the Resource Group Controller.
Description check ✅ Passed The description includes issue reference, clear explanation of changes (A1-A5), implementation details, and confirms unit tests are included.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@JmPotato JmPotato force-pushed the resource-control/client-core-metrics branch 3 times, most recently from 76ed193 to bf1961d Compare March 26, 2026 04:56
Add new metrics to improve observability of the Resource Group Controller
on the client side. This is the first phase of the observability
improvements tracked in tikv#10488.

Changes:
- Remove the `enableControllerTraceLog` guard on `TokenConsumedHistogram`
  so token consumption is always observable in production
- Add `consume_by_type` counter with RRU/WRU breakdown
- Add limiter state gauges: `token_balance`, `fill_rate`, `burst_limit`
- Add `avg_ru_per_sec` gauge (EMA consumption rate estimate)
- Add `throttled` gauge (1 = trickle mode, 0 = normal)
- Add `GetFillRate()` exported method to Limiter

All new gauge metrics are observed in the 1Hz `updateRunState()` tick,
not on the request hot path. The existing `TokenConsumedHistogram` label
signature is unchanged to preserve Grafana panel compatibility.

Issue Number: ref tikv#10488

Signed-off-by: JmPotato <github@ipotato.me>
@JmPotato JmPotato force-pushed the resource-control/client-core-metrics branch from bf1961d to 6cc02ff Compare March 26, 2026 05:02
Add a new "Resource Control" row to the PD Grafana dashboard with 5
panels that visualize the client-side metrics introduced in this PR:

1. Consumption Rate vs Fill Rate — overlays avg_ru_per_sec and fill_rate
   to show whether consumption outpaces refill (the key throttle signal)
2. Token Balance — current available tokens per group, decline indicates
   depletion
3. RU Consumption by Type — stacked RRU/WRU rate breakdown
4. Token Consumption per Request — P50/P99/P999 histogram quantiles
5. Throttled Resource Groups — binary throttle state per group

All panels use the standard dashboard label filters
(k8s_cluster, tidb_cluster) and the existing datasource variable.

Issue Number: ref tikv#10488

Signed-off-by: JmPotato <github@ipotato.me>
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant