Skip to content

Comments

[Tests][UserEvents] Add diagnostic logging#124678

Open
mdh1418 wants to merge 10 commits intomainfrom
copilot/add-diagnostic-logging
Open

[Tests][UserEvents] Add diagnostic logging#124678
mdh1418 wants to merge 10 commits intomainfrom
copilot/add-diagnostic-logging

Conversation

@mdh1418
Copy link
Member

@mdh1418 mdh1418 commented Feb 20, 2026

Investigating #123442

Copilot AI and others added 6 commits February 20, 2026 20:33
- Add --log-mode verbose and --log-filter * to record-trace invocation
- Set DOTNET_DiagnosticServerDiag=1 on tracee process for runtime logging
- Add fprintf diagnostic logging to ds-server.c (IPC command handling)
- Add fprintf diagnostic logging to ep-session-provider.c (tracepoint registration)
- Add fprintf diagnostic logging to ep-session.c (tracepoint write)

Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
…/arm64

Disable all CI pipeline jobs in runtime.yml except those needed to run
CoreCLR runtime tests on linux-x64 and linux-arm64 for investigating
flaky userevents tests (issue #123442).

Kept jobs:
- EvaluatePaths stage
- Libraries_CheckedCoreCLR build for linux_arm64
- Libraries_CheckedCoreCLR build for linux_x64
- CoreCLR test build (build-test-job.yml with CoreClrTestBuildHost)
- CoreCLR test execution (run-test-job.yml for linux_x64 and linux_arm64)

All other ~63 job blocks are disabled with condition: false.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per the one-collect commandline.rs API, --log-mode accepts 'disabled',
'console', or 'file', and --log-filter accepts Rust tracing EnvFilter
strings like 'trace', 'debug', 'info'.

Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
Change traceValidator signature to Func<int, EventPipeEventSource, bool>
to pass the tracee process ID. Each scenario validator now filters events
by ProcessID and logs a count of events from other processes.

Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
@mdh1418 mdh1418 requested a review from noahfalk as a code owner February 20, 2026 21:30
Copilot AI review requested due to automatic review settings February 20, 2026 21:30
@mdh1418 mdh1418 requested a review from lateralusX as a code owner February 20, 2026 21:30
@mdh1418 mdh1418 added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it and removed area-Tracing-coreclr labels Feb 20, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds additional diagnostic logging to help investigate flakiness in the src/tests/tracing/userevents functional tests (issue #123442), including filtering/annotating expected events by the tracee PID and enabling extra runtime diagnostics during test runs.

Changes:

  • Pass the tracee PID into per-scenario trace validators and ignore events from other processes in the trace validation.
  • Enable extra diagnostic output in the tracee runtime and in record-trace invocation to aid debugging.
  • Add temporary diagnostic logging in native EventPipe/DiagnosticServer code paths, and temporarily alter/disable large parts of the Azure Pipelines runtime matrix for investigation.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/tests/tracing/userevents/multithread/multithread.cs Filters validation to tracee PID and improves failure diagnostics.
src/tests/tracing/userevents/managedevent/managedevent.cs Filters validation to tracee PID and improves failure diagnostics.
src/tests/tracing/userevents/custommetadata/custommetadata.cs Filters validation to tracee PID and improves failure diagnostics.
src/tests/tracing/userevents/basic/basic.cs Filters validation to tracee PID and improves failure diagnostics.
src/tests/tracing/userevents/activity/activity.cs Filters validation to tracee PID and improves failure diagnostics.
src/tests/tracing/userevents/common/UserEventsTestRunner.cs Updates validator signature, adds record-trace logging args, and enables runtime diag env var for tracee.
src/native/eventpipe/ep-session.c Adds diagnostic logging around user-events tracepoint write path.
src/native/eventpipe/ep-session-provider.c Adds diagnostic logging around tracepoint registration.
src/native/eventpipe/ds-server.c Adds diagnostic logging around diagnostic server IPC handling.
eng/pipelines/runtime.yml Temporarily disables/narrows many CI legs for investigation.

Comment on lines +85 to +86
# TEMPORARY: disabled for userevents diagnostic investigation (https://github.com/dotnet/runtime/issues/123442)
condition: false
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change sets the job condition to false, effectively disabling this pipeline leg for all PRs/rolling builds. Even if intended for short-term investigation, it will prevent CI from validating unrelated changes in this pipeline and is very risky to merge. Please gate this behind a dedicated variable/parameter (e.g., a temporary feature flag) or revert before merging.

Copilot uses AI. Check for mistakes.
Comment on lines +371 to 372
# TEMPORARY: narrowed to linux_x64 only for userevents diagnostic investigation (https://github.com/dotnet/runtime/issues/123442), was also windows_x64
- linux_x64
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These platform matrix changes remove previously covered OSes (e.g., Windows/macOS) for this leg. That materially reduces test coverage and can hide regressions unrelated to the userevents investigation. Please avoid narrowing platforms in the main pipeline, or gate it behind a temporary variable so normal PR validation coverage is preserved.

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +165
traceeStartInfo.Environment["DOTNET_DiagnosticServerDiag"] = "1";

Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOTNET_DiagnosticServerDiag is being unconditionally enabled for every tracee run. This can significantly increase stderr output and potentially affect timing/behavior of the scenario under test, making results harder to interpret once the investigation is done. Consider enabling it only when a separate opt-in env var/command-line switch is set (or only under Helix/CI) so local/normal runs stay quiet by default.

Suggested change
traceeStartInfo.Environment["DOTNET_DiagnosticServerDiag"] = "1";
// This is opt-in locally via DOTNET_UserEvents_EnableDiagnosticServerDiag=1, and defaults to on under Helix.
bool enableDiagnosticServerDiag =
string.Equals(Environment.GetEnvironmentVariable("DOTNET_UserEvents_EnableDiagnosticServerDiag"), "1", StringComparison.Ordinal) ||
!string.IsNullOrEmpty(Environment.GetEnvironmentVariable("HELIX_WORKITEM_ROOT"));
if (enableDiagnosticServerDiag)
{
traceeStartInfo.Environment["DOTNET_DiagnosticServerDiag"] = "1";
}

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +40
static int ep_sess_diag_enabled = -1; // -1 = uninitialized
static inline int ep_sess_diag_is_enabled (void) {
if (ep_sess_diag_enabled == -1) {
const char *val = getenv ("DOTNET_DiagnosticServerDiag");
ep_sess_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0;
}
return ep_sess_diag_enabled;
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ep_sess_diag_enabled is lazily initialized without any synchronization. session_tracepoint_write_event can be called from multiple threads, so concurrent first-use can cause a data race/UB in C. Consider using an atomic/once-init pattern (e.g., store the flag in a volatile int64_t and initialize via ep_rt_atomic_compare_exchange_int64_t, or always compute from getenv without caching) to make this thread-safe.

Suggested change
static int ep_sess_diag_enabled = -1; // -1 = uninitialized
static inline int ep_sess_diag_is_enabled (void) {
if (ep_sess_diag_enabled == -1) {
const char *val = getenv ("DOTNET_DiagnosticServerDiag");
ep_sess_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0;
}
return ep_sess_diag_enabled;
static inline int ep_sess_diag_is_enabled (void) {
const char *val = getenv ("DOTNET_DiagnosticServerDiag");
return (val != NULL && val[0] == '1') ? 1 : 0;

Copilot uses AI. Check for mistakes.
Comment on lines +895 to +899
if (bytes_written == -1) {
EP_SESS_DIAG_LOG ("tracepoint_write_event: writev failed for provider='%s', event_id=%u, fd=%d, errno=%d (%s).\n", ep_provider_get_provider_name (provider), (unsigned)ep_event_get_event_id (ep_event), session->user_events_data_fd, errno, strerror (errno));
} else {
EP_SESS_DIAG_LOG ("tracepoint_write_event: writev success for provider='%s', event_id=%u, bytes_written=%zd.\n", ep_provider_get_provider_name (provider), (unsigned)ep_event_get_event_id (ep_event), bytes_written);
}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code calls strerror(errno) but ep-session.c doesn't include the header that declares strerror (typically <string.h>). With warnings-as-errors this can fail the build due to an implicit declaration. Please include the appropriate header (and/or follow the repo’s existing include pattern for strerror).

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +26
static int ep_sp_diag_enabled = -1; // -1 = uninitialized
static inline int ep_sp_diag_is_enabled (void) {
if (ep_sp_diag_enabled == -1) {
const char *val = getenv ("DOTNET_DiagnosticServerDiag");
ep_sp_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0;
}
return ep_sp_diag_enabled;
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ep_sp_diag_enabled is lazily initialized without synchronization. If this file’s logging is reachable from multiple threads during session setup/teardown, the first-use initialization can race and is undefined behavior in C. Please use an atomic/once-init pattern (or avoid caching) so the env-var check is thread-safe.

Suggested change
static int ep_sp_diag_enabled = -1; // -1 = uninitialized
static inline int ep_sp_diag_is_enabled (void) {
if (ep_sp_diag_enabled == -1) {
const char *val = getenv ("DOTNET_DiagnosticServerDiag");
ep_sp_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0;
}
return ep_sp_diag_enabled;
static inline int ep_sp_diag_is_enabled (void) {
const char *val = getenv ("DOTNET_DiagnosticServerDiag");
return (val != NULL && val[0] == '1') ? 1 : 0;

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +26
static int ds_diag_enabled = -1; // -1 = uninitialized
static inline int ds_diag_is_enabled (void) {
if (ds_diag_enabled == -1) {
const char *val = getenv ("DOTNET_DiagnosticServerDiag");
ds_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0;
}
return ds_diag_enabled;
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds_diag_enabled is lazily initialized without synchronization. If multiple threads can reach this logging (e.g., server loop interactions), the first-use initialization can race and is undefined behavior in C. Please switch to an atomic/once-init pattern (or avoid caching) to make the env-var check thread-safe.

Suggested change
static int ds_diag_enabled = -1; // -1 = uninitialized
static inline int ds_diag_is_enabled (void) {
if (ds_diag_enabled == -1) {
const char *val = getenv ("DOTNET_DiagnosticServerDiag");
ds_diag_enabled = (val != NULL && val[0] == '1') ? 1 : 0;
}
return ds_diag_enabled;
static inline int
ds_diag_is_enabled (void)
{
const char *val = getenv ("DOTNET_DiagnosticServerDiag");
return (val != NULL && val[0] == '1') ? 1 : 0;

Copilot uses AI. Check for mistakes.
The disabled osx_arm64 test run job depends on a Libraries_CheckedCoreCLR
build for osx_arm64 existing. When osx_arm64 was removed from the build
platforms list, Azure Pipelines YAML validation fails because the
dependency cannot be resolved (even though the test job has condition: false).

Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
Remove osx_arm64 from the Libraries_CheckedCoreCLR build platforms
(only linux-x64 and linux-arm64 needed for userevents investigation).
Also remove the osx_arm64 test job block entirely since setting
condition: false does not skip dependency validation in Azure Pipelines
YAML - the build job must still exist for the dependency to resolve.

Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 20, 2026 21:47
@mdh1418 mdh1418 review requested due to automatic review settings February 20, 2026 21:47
- Remove CLRTestTargetUnsupported from userevents Directory.Build.props
  (keep Mono exclusion only)
- Disable PR triggers for runtime-linker-tests, hardware-intrinsics,
  global-build (dev-innerloop), and runtime-diagnostics pipelines
- All browser-wasm jobs in runtime.yml already had condition: false

Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 21, 2026 00:27
@mdh1418 mdh1418 review requested due to automatic review settings February 21, 2026 00:27
The diagnostic logging at line 896 calls strerror(errno) but the file
did not include <string.h> which declares strerror. This could cause
an implicit function declaration warning/error depending on compiler
settings.

Co-authored-by: mdh1418 <16830051+mdh1418@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 21, 2026 02:34
@mdh1418 mdh1418 review requested due to automatic review settings February 21, 2026 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-Tracing-coreclr NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants