Skip to content

[Profiler] workaround for stack overflow during nested signal handling (#177)#8162

Open
korniltsev-grafanista wants to merge 1 commit intoDataDog:masterfrom
grafana:kk/mask-sigprof
Open

[Profiler] workaround for stack overflow during nested signal handling (#177)#8162
korniltsev-grafanista wants to merge 1 commit intoDataDog:masterfrom
grafana:kk/mask-sigprof

Conversation

@korniltsev-grafanista
Copy link

@korniltsev-grafanista korniltsev-grafanista commented Feb 5, 2026

There is a stack overflow on the altstack when coreclr's SIGSEGV handler is interrupted by SIGPROF signal.

Here is the trace from kernel (obtained with this vibecoded ebpf tracer https://github.com/korniltsev-grafanista/signalsnoop ):

rsp: 0x00007ee895e36b30
get_sigframe for tgid=974490 tid=974548 (.NET TP Worker), ret=0x7ee895e35f38
x64_setup_rt_frame failed for tgid=974490 tid=974548 (.NET TP Worker), sig=27, ret=-14
7ee895e36000-7ee895e39000 rw-p 00000000 00:00 0 // altstack
Details
get_sigframe for tgid=974490 tid=974548 (.NET TP Worker), ret=0x7ee895e35f38
        x64_setup_rt_frame
        arch_do_signal_or_restart
        irqentry_exit_to_user_mode
        asm_sysvec_apic_timer_interrupt
    Userspace registers:
        rip: 0x00007f2a70c16874  rsp: 0x00007ee895e36b30  flags: 0x0000000000000202
        rax: 0x00007ee895e37800  rbx: 0x00007ee87fbfdd68  rcx:   0x00007f2a70c16972
        rdx: 0x00007ee895e36b70  rsi: 0x0000000000000000  rdi:   0x00007ee895e36b90
        rbp: 0x00007ee895e377e0  r8:  0x00007ee895e37800  r9:    0x00007ee87418b629
        r10: 0x00000000000000f5  r11: 0x00007f2a714c1000  r12:   0x00007ee87fbfdd70
        r13: 0x00007ee895e36b90  r14: 0x00007ee895e38480  r15:   0x00007ee87fbfde00

x64_setup_rt_frame failed for tgid=974490 tid=974548 (.NET TP Worker), sig=27, ret=-14
    sa_flags: 0x14000004 (SA_RESTORER=true)
        arch_do_signal_or_restart
        irqentry_exit_to_user_mode
        asm_sysvec_apic_timer_interrupt
    Userspace registers:
        rip: 0x00007f2a70c16874  rsp: 0x00007ee895e36b30  flags: 0x0000000000000202
        rax: 0x00007ee895e37800  rbx: 0x00007ee87fbfdd68  rcx:   0x00007f2a70c16972
        rdx: 0x00007ee895e36b70  rsi: 0x0000000000000000  rdi:   0x00007ee895e36b90
        rbp: 0x00007ee895e377e0  r8:  0x00007ee895e37800  r9:    0x00007ee87418b629
        r10: 0x00000000000000f5  r11: 0x00007f2a714c1000  r12:   0x00007ee87fbfdd70
        r13: 0x00007ee895e36b90  r14: 0x00007ee895e38480  r15:   0x00007ee87fbfde00
    Stack probe:
        [sp+0] 0x00007ee895e36b30: 0x00007f2a70be4ad0
        [sp-128] 0x00007ee895e36ab0: 0x0000000000000000
        [sp-568] 0x00007ee895e368f8: 0x0000000000000000
        [sp-700] 0x00007ee895e36874: 0x0000000000000000

signal_setup_done failed for tgid=974490 tid=974548 (.NET TP Worker), sig=27, ret=1
        signal_setup_done
        arch_do_signal_or_restart
        irqentry_exit_to_user_mode
        asm_sysvec_apic_timer_interrupt
    Userspace registers:
        rip: 0x00007f2a70c16874  rsp: 0x00007ee895e36b30  flags: 0x0000000000000202
        rax: 0x00007ee895e37800  rbx: 0x00007ee87fbfdd68  rcx:   0x00007f2a70c16972
        rdx: 0x00007ee895e36b70  rsi: 0x0000000000000000  rdi:   0x00007ee895e36b90
        rbp: 0x00007ee895e377e0  r8:  0x00007ee895e37800  r9:    0x00007ee87418b629
        r10: 0x00000000000000f5  r11: 0x00007f2a714c1000  r12:   0x00007ee87fbfdd70
        r13: 0x00007ee895e36b90  r14: 0x00007ee895e38480  r15:   0x00007ee87fbfde00
    Stack probe:
        [sp+0] 0x00007ee895e36b30: 0x00007f2a70be4ad0
        [sp-128] 0x00007ee895e36ab0: 0x0000000000000000
        [sp-568] 0x00007ee895e368f8: 0x0000000000000000
        [sp-700] 0x00007ee895e36874: 0x0000000000000000

vfs_coredump for tgid=974490 tid=974548 (.NET TP Worker)
        vfs_coredump
        get_signal
        arch_do_signal_or_restart
        irqentry_exit_to_user_mode
        asm_sysvec_apic_timer_interrupt
    Userspace registers:
        rip: 0x00007f2a70c16874  rsp: 0x00007ee895e36b30  flags: 0x0000000000000202
        rax: 0x00007ee895e37800  rbx: 0x00007ee87fbfdd68  rcx:   0x00007f2a70c16972
        rdx: 0x00007ee895e36b70  rsi: 0x0000000000000000  rdi:   0x00007ee895e36b90
        rbp: 0x00007ee895e377e0  r8:  0x00007ee895e37800  r9:    0x00007ee87418b629
        r10: 0x00000000000000f5  r11: 0x00007f2a714c1000  r12:   0x00007ee87fbfdd70
        r13: 0x00007ee895e36b90  r14: 0x00007ee895e38480  r15:   0x00007ee87fbfde00
    Stack probe:
        [sp+0] 0x00007ee895e36b30: 0x00007f2a70be4ad0
        [sp-128] 0x00007ee895e36ab0: 0x0000000000000000
        [sp-568] 0x00007ee895e368f8: 0x0000000000000000
        [sp-700] 0x00007ee895e36874: 0x0000000000000000
    Process maps:
    
    
7ee895e36000-7ee895e39000 rw-p 00000000 00:00 0   <-- rsp=0x7ee895e36b30, rbp=0x7ee895e377e0, rax=0x7ee895e37800, rdx=0x7ee895e36b70, rdi=0x7ee895e36b90, r8=0x7ee895e37800, r13=0x7ee895e36b90, r14=0x7ee895e38480
7ee895e35f38


This PR adds wrappers to sigaction and pthread_sigmask.

sigaction masks SIGPROF if the signal is SIGSEGV - to prevent SIGSEGV stack overflow during signal delivering.
pthread_sigmask unblocks SIGPROF if we are unblocking SIGSEGV - the runtime unblocks SIGSEGV, and we should unblock SIGPROF as well to allow the profiler continue working and delivering SIGPROF signals.

This is the code I used to simplify reproducing - triggering null deref in managed code in a busy loop. https://github.com/grafana/pyroscope-dotnet/pull/177/files#diff-75c68efd0775648bc723a9557e3cc83a294c79852dc5dc2c733e8bb80e15a9fcR18

Disclaimer: the problem was initially reported by grafana user(using the fork https://github.com/grafana/pyroscope-dotnet) we then confirmed the same problem with the genuine (not fork) dd-trace-dotnet v0.34.0.

Summary of changes

Reason for change

Implementation details

Test coverage

Other details

)

There is a stack buffer overflow on the altstack when coreclr's SIGSEGV handler is interrupted by SIGPROF signal.

Here is the trace from kernel (obtained with this tracer https://github.com/korniltsev-grafanista/signalsnoop ):

rsp: 0x00007ee895e36b30
get_sigframe for tgid=974490 tid=974548 (.NET TP Worker), ret=0x7ee895e35f38
x64_setup_rt_frame failed for tgid=974490 tid=974548 (.NET TP Worker), sig=27, ret=-14
7ee895e36000-7ee895e39000 rw-p 00000000 00:00 0 // altstack

This PR adds wrappers to sigaction and pthread_sigmask.

sigaction masks SIGPROF if the signal is SIGSEGV - to prevent SIGSEGV stack overflow during signal delivering.
pthread_sigmask unblocks SIGPROF if we are unblocking SIGSEGV - the runtime unblocks SIGSEGV, and we should unblock SIGPROF as well to allow the profiler continue working and delivering SIGPROF signals.

Added an NPE class to Rideshare app for easier problem reproducing. // Copilot ignore this code, review the wrapper
@korniltsev-grafanista korniltsev-grafanista requested a review from a team as a code owner February 5, 2026 11:04
@gleocadie
Copy link
Collaborator

👋 @korniltsev-grafanista , thanks for your contribution.
Do you want to add a test in our integration tests or I should do it ?

@gleocadie gleocadie changed the title workaround for stack overflow during nested signal handling (#177) [Profiler] workaround for stack overflow during nested signal handling (#177) Feb 5, 2026
@gleocadie gleocadie added the area:profiler Issues related to the continous-profiler label Feb 5, 2026
@korniltsev-grafanista
Copy link
Author

please do it, I don't really have capacity at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:profiler Issues related to the continous-profiler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants