Conversation
* initial scaffolding * working nutanix.health.up metric * working nutanix.cluster.count metric * work in progress * work in progress: assert tags in unit tests * tests cleanup * cleanup * health check * collecting basic cluster metrics * collecting cluster stats and basic node metrics * remove upgrade status tag and lint * fix basic auth + add integration tests * refactor and cleanup * collecting node stats metrics * lint * add cluster namespace to cluster metrics * rename metrics, remove unit suffixes * use host instead of nodes as much as possible * collecting basic vm metrics * fix query param typo * collecting vm stats metric * add missing argument required for VmStats and passing integration tests * add metadata.csv * fix typo in test name * update integration tests to stop checking for values * add nutanix overview dashboard * update manifest description and classifier tags * update manifest metric to check for * little cleanup * remove unused dependency from pyproject * set default min_collection_interval to 120s * update dashboard with more units and improvements * update dashboard description * report host metrics and vm metrics with their correspondig hostname * report external host tags for hosts and vms * switch to list all vm stats endpoint for better rate limit - update metdata.csv with new metrics * add ntnx_type:host and ntnx_type:vm as tags * add cluster_name and host_name tags to all hosts and vm metrics + fix integration tests * improve metrics descriptions in metadata.csv * update dashboard * add compact legend to all cluster/host/vm widgets for better ux * fix stats sampling interval to match the min_collection_interval * add support for pagination * add page_limit parameter for pagination size limit * update fixtures and tests for the new paginated requests * rename paginated methods to start with list instead of get * add support for retry logic to handle PC rate limiting * add process signatures * update nutanix process signatures * fix error deleting page and limit params * fix manifest.json extra comma in process_signatures * collect events * add bash script to record fixtures * Fix log message for error collecting vm metrics * refactor pagination method and improve logging * ddev validate ci --sync * update dashboard and add new nutanix logos * add debug logs for HTTP requests and payloads * add support for port in pc_ip * swap nutanix.vm.hypervisor.memory_usage_ppm with nutanix.vm.memory.usage_ppm for more accurate VM memory usage * improve logging: reduce HTTP logging noise to only rate limits and error responses * fix validate dashboards * bump python version to 3.13 and min base check version * fix typo in min base check * fix Mock() has no len error in test_retry.py * wip * add collect_events property * change remaining references to nutanix.vm.hypervisor.memory_usage_ppm to nutanix.vm.memory.usage_ppm to fix VM memory usage widgets * add support for tasks collection, update fixtures * add ntnx_type tag to events and tasks * small cleanup * dashboard: change all bytes in binary to bytes in decimal * cleanup and small refactor * make events and tasks match implementation, fix handling of start_time, improve tests and small refactor * split check.py into modules, fix integration tests * improve error messages for non 2xx http error responses * add missing dd licence headers to some files * rename health_check_score metric * improve metric names batch 1 * improve cluster and host metric names * improve vm metric names * split unit tests into multiple files * add support for audits collection * cleanup and improving tests setup * improve duplication logic tests in events,audits and tasks * add support for alerts collection * use alerts v4.2 API that supports filtering by creationTime * sync all API calls to use the same time window (start, end) * add extra filtering to avoid events/tasks/audits/alerts duplicates * fallback to alerts v4.0 API if v4.2 is not available * fix self.last__x_collection_time fields to be the max timestamp: fixes duplicates * persist information about v4.2 API in the persistence cache * wip: host and vm stats not working? * improve vm stats collection by cluster, improve info logs and debug logs * improve type hints and method comments * add support for capacity metrics * add nutanix tag to all entities * report node status metric * ddev validate models and config * add collect_tasks and collect_audits properties for nutanix * add filter propreties for alerts * add filter by severity and type for alerts * add filter events by type * add filter tasks by status * add resource filters support for infra resources and activity resources * improve resource_filters * cleanup * fetch and cache categories * attach categories as tags with option to add ntnx_ suffix * improve categories collection/attachment, improve tests, update all fixtures * improve categories collection and testing * add owner to manifest.json * remove duplicate self.last_audit_collection_time assignment * fix alert messages parameter rendering * add more tests * reduce info logs, improve info log summary, and change rest of logs to debug * improve audits timpestamp tracking, improve logging, code cleanup * improve resource_filters logging, log error messages * fix integration tests + add support for fake docker server testing * fix nutanix wheel version * reset teleport change * ddev validate ci --sync * fix licence headers * fix more licence headers * fix one more licence header * ddev validate labeler --sync * Fix labeler config * reduce audits.json size * reduce audits.json to 50KB * reduce alerts.json to 20 items * replace bash script for recording fixtures with python implementation * update resource_filters description * add starting check info log and add comment about sampling interval * improve categories tests around default behavior, remove duplicate record_fixtures.py * udpate resource_filters to match default desired category behavior * add collect_subtasks properties, use persistent cache to track last collected items correctly before filtering * Apply suggestions from code review Co-authored-by: dkirov-dd <166512750+dkirov-dd@users.noreply.github.com> * cleanup rate limit retry implementation * add nutanix.api.rate_limited metric for visibility * fix port configuration log message test * fix rate limit tests and update with the new metric * improve retry_limit implementation * update README.md * add missing prefix_category_tags property * add tests for prefix_category_tags property * address david review: always add ntnx_is_agent_vm tag * document categories in the README * add example in the README for collecting categories * fix dashboard validation * fix readme validate * address review: raise ConfigurationError when its not set * fix ntnx_is_agent_vm tag in tests * add missing spec.yaml properties * update powerConsumptionInstantWatt name for consistency * set creates_events to true in manifest.json * fix retry_on_rate_limit behavior on non 429 responses * fix changelog file name * remove unnecessary file * fix retry_on_rate_limit behavior and add type hints * address sarah review: add log message when skipping a resource * address sarah review: add test all metrics + fix missing metric in metadata.csv * update fixtures with new prism_central url and new VM in OFF state * report nutanix.vm.status metric * address review: by default only collect VMs with powerState ON * collect only ON VM even if other VM resource_filters are set that are not powerState * fix imports and paths for record_fixtures.py * add note about duplicate hostname issue * sync config with the new VM collection comment * update README to explain that a single agent can monitor a prism central environment * fix license headers * refactor activity_monitor by reducing code duplication and extraching a sharing method * refactor infrastructure_monitor * refactor resource_filters * refactor check.py / simplyfing * add support for rendering audit messages * add support for rendering nutanix event messages * remove all X_id tags * add caches for activity entities * update test fixtures and adjust tests to work with the new data * add support for displaying affected alerts in tasks * refactor: reduce code duplication in activity_monitor * remove debugging code * improve readability of activity_monitor.py code related to filtering * code cleanup + reduce code duplication * split collect_cluster_metrics into isolated collection phases to enable partial data collection on failures * improve error isolation for host processing to allow non-blocking errors when single host fails * replace log.exception with log.error for more user friendly log message for known errors * make resource_filters proprety always required and remove its default values * switch super() call to python3 style * address some code smells * update page_limit default value from 50 to 100 to reduce API calls * Revert accidental version modification for dev Co-authored-by: Sarah Witt <sarah.witt@datadoghq.com> * add support for batch_vm_collection by default to avoid rate limits * fix batch collection mode to process all vms regardless of the batch mode * improve testing vms filtering in both vm collection modes --------- Co-authored-by: dkirov-dd <166512750+dkirov-dd@users.noreply.github.com> Co-authored-by: Sarah Witt <sarah.witt@datadoghq.com> (cherry picked from commit ae8e33e)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c1d62c32ac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| del params["$filter"] | ||
| return self.check._get_paginated_request_data("api/monitoring/v4.0/serviceability/alerts", params=params) |
There was a problem hiding this comment.
Keep alert time filtering when falling back to API v4.0
When alerts_v42_supported is false, this removes the server-side $filter and fetches the full alert history; _collect only applies client-side time filtering from last_alert_collection_time, which is empty on first run. In Prism versions that only support v4.0, a fresh check instance will therefore emit all historical alerts instead of only the current collection window, which can create a large event burst and unexpected ingestion.
Useful? React with 👍 / 👎.
| max_retries: int = self.instance.get('pc_max_retries', 3) | ||
| base_backoff: int = self.instance.get('pc_base_backoff_seconds', 1) | ||
| max_backoff: int = self.instance.get('pc_max_backoff_seconds', 30) | ||
| attempts = max(1, max_retries) |
There was a problem hiding this comment.
Treat pc_max_retries as retries, not total attempts
This computes total attempts as max(1, max_retries), so pc_max_retries: 1 results in no retry after the first 429 response. Because the option is documented as retry attempts, users configuring low retry counts get fewer retries than intended and may fail requests prematurely under brief rate limiting.
Useful? React with 👍 / 👎.
| if self.pc_ip and ":" in self.pc_ip: | ||
| host, _, port = self.pc_ip.rpartition(":") | ||
| if port.isdigit(): |
There was a problem hiding this comment.
Avoid parsing bare IPv6 host segments as a port
The inline port parsing splits on the last : and treats a numeric suffix as a port, which misparses bare IPv6 addresses (for example 2001:db8::1 becomes host 2001:db8: and port 1). That produces an invalid base URL and breaks connectivity for valid IPv6 pc_ip values.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
|
Added DOCS-13578 to track review |
(cherry picked from commit ae8e33e)
What does this PR do?
Motivation
Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged