Skip to content

fix: introduce level-triggered waitable flag to trigger batcher...#1558

Merged
supervacuus merged 15 commits intomasterfrom
fix/introduce_level-triggered_wait_flag
Mar 6, 2026
Merged

fix: introduce level-triggered waitable flag to trigger batcher...#1558
supervacuus merged 15 commits intomasterfrom
fix/introduce_level-triggered_wait_flag

Conversation

@supervacuus
Copy link
Collaborator

@supervacuus supervacuus marked this pull request as ready for review March 6, 2026 11:14
@supervacuus
Copy link
Collaborator Author

Two things I changed/introduced here that might be of interest:

  • I currently don't use WaitOnAddress on Xbox because I don't know if the synchronization lib is part of the devkit (you can always enable it later if you want to).
  • I now let sentry-crash inherit the platform libs from above, because otherwise we have to introduce dependencies for the core twice (this was triggered by the E2E test for Windows). Let me know @mujacica if that would be a problem.

…meaningless than before...

we are not only bounded by the last flush, but since we now trigger with almost immediate wake on essentially any count in the buffer... we can get requests as high as the number of items we send. The important part is the total count and that one has been stable throughout.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

… could still be running after the example process terminated due to the crash
@supervacuus supervacuus merged commit a42aca5 into master Mar 6, 2026
50 checks passed
@supervacuus supervacuus deleted the fix/introduce_level-triggered_wait_flag branch March 6, 2026 15:43
jpnurmi added a commit that referenced this pull request Mar 6, 2026
The two-phase shutdown (begin/wait) was designed to signal both the logs
and metrics threads in parallel before joining either. With #1558
replacing the condvar with a level-triggered waitable flag, thread wake
latency drops to sub-millisecond (futex/WaitOnAddress), making the
parallel signaling unnecessary.

Merging into a single function also fixes a race where a concurrent
sentry_init() could replace g_batcher between the signal and join steps,
causing the new batcher to be shut down instead of the old one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jpnurmi added a commit that referenced this pull request Mar 9, 2026
…unting (#1556)

* test(logs): add stress test for batcher reinit hang

Spawns 8 producer threads continuously logging while the main thread
does 10 cycles of sentry_init/sentry_close. Reproduces a condvar
corruption hang in the batcher under TSan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(batcher): replace static global with dynamic allocation and refcounting

The static g_batcher persisted across init/close cycles, causing condvar
corruption when sentry__batcher_startup re-initialized the condvar while
old threads still used it. Dynamic allocation with refcounting solves
both the condvar corruption and the lifetime problem.

Introduces sentry_batcher_ref_t with acquire/release/swap API that
encapsulates a spinlock to make concurrent access from producer threads
and shutdown safe.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(batcher): release old batcher on swap in startup

When the previous startup failed to spawn its thread, shutdown_begin
returns false and shutdown_wait is never called. The old batcher stays
in g_batcher and leaks when startup swaps it out without releasing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(batcher): NULL-guard batcher in shutdown_wait

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update CHANGELOG.md

* fix(batcher): drain buffer items in release to prevent leaks

A producer that acquired a ref before shutdown could enqueue items after
the final flush. These items were leaked when the batcher was freed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(batcher): use SENTRY_MAKE macro for allocation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(batcher): destroy condvar before freeing batcher memory

Add sentry__cond_destroy for all platforms (pthread_cond_destroy on
POSIX, CloseHandle on pre-Vista Windows, no-op on Vista+) and call it
in sentry__batcher_release. Without this, each SDK re-initialization
leaks kernel handles on pre-Vista Windows and violates POSIX which
requires pthread_cond_destroy before freeing memory containing a
pthread_cond_t.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "fix(batcher): destroy condvar before freeing batcher memory"

This reverts commit 467bfb5ecb2c4528e1048dab1c467203fdc29cdf.

* refactor(batcher): merge two-phase shutdown into single function

The two-phase shutdown (begin/wait) was designed to signal both the logs
and metrics threads in parallel before joining either. With #1558
replacing the condvar with a level-triggered waitable flag, thread wake
latency drops to sub-millisecond (futex/WaitOnAddress), making the
parallel signaling unnecessary.

Merging into a single function also fixes a race where a concurrent
sentry_init() could replace g_batcher between the signal and join steps,
causing the new batcher to be shut down instead of the old one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(metrics): add reinit stress test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: speed up reinit stress tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(batcher): use lock-free peek in crash-safe flush path

The crash-safe flush functions called sentry__batcher_acquire, which
spins on a spinlock. If the crash occurs on the thread holding that
lock, the signal handler deadlocks. Replace with a new lock-free
sentry__batcher_peek that reads ref->ptr via atomic load without
taking the spinlock or bumping the refcount (safe because the process
is dying).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(batcher): use pointer-width atomics in batcher_peek for Windows

On Windows x64, `long` is 32-bit but pointers are 64-bit. Use
InterlockedCompareExchangePointer on Windows and __atomic_load
elsewhere to avoid truncating the pointer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs(batcher): add header comments for ref management functions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(batcher): fix race between force flush and reinit

force_flush_begin and force_flush_wait each acquired the batcher
independently. A concurrent sentry_init could swap the batcher between
the two calls, causing wait to flush the new empty batcher while the
original data is lost.

Fix by returning an opaque token (the acquired batcher ref) from begin
and passing it to wait, ensuring both operate on the same instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update src/sentry_logs.c

Co-authored-by: Mischan Toosarani-Hausberger <mischan@abovevacant.com>

* Update sentry__metrics_startup

* docs(batcher): clarify ownership semantics of options refs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Mischan Toosarani-Hausberger <mischan@abovevacant.com>
jpnurmi added a commit to getsentry/sentry-dotnet that referenced this pull request Mar 11, 2026
Fixes:
```
sentry-native.lib(sentry_batcher.obj) : error LNK2019: unresolved external symbol WaitOnAddress referenced in function batcher_thread_func
sentry-native.lib(sentry_batcher.obj) : error LNK2019: unresolved external symbol WakeByAddressSingle referenced in function sentry__batcher_enqueue
```

See also:
- getsentry/sentry-native#1558
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Structured Logs: follow up - replace cond_wake with level-triggered flag

2 participants