resource_group: add client-side observability metrics for Controller#10494
resource_group: add client-side observability metrics for Controller#10494JmPotato wants to merge 2 commits intotikv:masterfrom
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded Prometheus metrics and instrumentation for resource-group token consumption and limiter state: per-RU-type counters, consumption histogram, limiter gauges (token balance, fill rate, burst), avg RU/sec, throttled gauge, registration/cleanup, always-on consumption observation, and Limiter.GetFillRate. Changes
Sequence Diagram(s)sequenceDiagram
participant Controller as Controller\n(group_controller.go)
participant Limiter as Limiter\n(limiter.go)
participant Metrics as Prometheus\n(metrics/metrics.go)
Controller->>Limiter: request tokens / compute delta
Limiter-->>Controller: acquisition result
Controller->>Metrics: observe consumeTokenHistogram(amount)
Controller->>Metrics: increment TokenConsumedByTypeCounter{resource_group,type}
Controller->>Limiter: read state (isThrottled, GetFillRate)
Limiter-->>Controller: fillRate / state
Controller->>Metrics: set TokenBalanceGauge, FillRateGauge, BurstLimitGauge, AvgRUPerSecGauge, ThrottledGauge
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
76ed193 to
bf1961d
Compare
Add new metrics to improve observability of the Resource Group Controller on the client side. This is the first phase of the observability improvements tracked in tikv#10488. Changes: - Remove the `enableControllerTraceLog` guard on `TokenConsumedHistogram` so token consumption is always observable in production - Add `consume_by_type` counter with RRU/WRU breakdown - Add limiter state gauges: `token_balance`, `fill_rate`, `burst_limit` - Add `avg_ru_per_sec` gauge (EMA consumption rate estimate) - Add `throttled` gauge (1 = trickle mode, 0 = normal) - Add `GetFillRate()` exported method to Limiter All new gauge metrics are observed in the 1Hz `updateRunState()` tick, not on the request hot path. The existing `TokenConsumedHistogram` label signature is unchanged to preserve Grafana panel compatibility. Issue Number: ref tikv#10488 Signed-off-by: JmPotato <github@ipotato.me>
bf1961d to
6cc02ff
Compare
Add a new "Resource Control" row to the PD Grafana dashboard with 5 panels that visualize the client-side metrics introduced in this PR: 1. Consumption Rate vs Fill Rate — overlays avg_ru_per_sec and fill_rate to show whether consumption outpaces refill (the key throttle signal) 2. Token Balance — current available tokens per group, decline indicates depletion 3. RU Consumption by Type — stacked RRU/WRU rate breakdown 4. Token Consumption per Request — P50/P99/P999 histogram quantiles 5. Throttled Resource Groups — binary throttle state per group All panels use the standard dashboard label filters (k8s_cluster, tidb_cluster) and the existing datasource variable. Issue Number: ref tikv#10488 Signed-off-by: JmPotato <github@ipotato.me>
What problem does this PR solve?
Issue Number: ref #10488
What is changed and how does it work?
This is Phase 1 of the Resource Control observability improvements. It adds client-side metrics to the Resource Group Controller (
client/resource_group/controller/) for better visibility into token consumption, limiter state, and throttle behavior.Changes:
A1 — Token consumption histogram always-on + RU type breakdown
enableControllerTraceLogguard onTokenConsumedHistogramso token consumption is always observable in productionconsume_by_typecounter withrru/wrulabels for per-type breakdownA2 — Limiter state gauges
resource_manager_client_resource_group_token_balance— current available token balanceresource_manager_client_resource_group_fill_rate— current effective fill rate (RU/s)resource_manager_client_resource_group_burst_limit— current burst limitGetFillRate()exported method toLimiterA3 — Average RU consumption rate gauge
resource_manager_client_resource_group_avg_ru_per_sec— EMA estimated consumption ratefill_rate, operators can detectconsumption > refillbefore throttling occursA5 — Throttled state gauge
resource_manager_client_resource_group_throttled— 1 if in trickle mode, 0 if normalAll new gauge metrics are observed in the 1Hz
updateRunState()tick (not on the request hot path). Only the histogram observation (~200ns) runs on the request path.Check List
Tests
Summary by CodeRabbit