Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead lettering tests #101524

Closed
carlossanlop opened this issue Apr 25, 2024 · 12 comments
Closed

Dead lettering tests #101524

carlossanlop opened this issue Apr 25, 2024 · 12 comments
Labels
area-Infrastructure Known Build Error Use this to report build issues in the .NET Helix tab

Comments

@carlossanlop
Copy link
Member

carlossanlop commented Apr 25, 2024

Build Information

Build: https://dev.azure.com/dnceng-public/public/public%20Team/_build/results?buildId=654910
Build error leg or test failing: browser-wasm windows Release LibraryTests_Smoke_AOT

Error Message

{
  "ErrorMessage" : "If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.",
  "BuildRetry" : false,
  "ExcludeConsoleLog" : false
}
If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.

What this means:

- All attempts to retry execution of this work item were unable to complete.  This can be both for infrastructure reasons (problems within Azure) or issues with the work item (for instance, causing a machine to reboot unexpectedly or killing the Helix client on the machine will force a retry).
- No further work will be done for this specific work item, and its exit code is set to an artificial -1 (since it did not complete, there is no real exit code).

Common causes:

- Disabled queue (end-of-life Helix queues are automatically forwarded to deadletter and will fail instantly)
- Unhealthy Helix Client machine(s)
- Queue has been backed up heavily by a large amount of work and was manually purged by the engineering team
- Azure issues (e.g. Service Bus is overloaded)
- Malformed payloads; if Helix cannot download and unzip all payloads successfully, work will retry until dead-lettered.

For follow up:

- Check if your Helix Queue is still enabled, either via the metadata you see by browsing to https://helix.dot.net/api/info/queues?api-version=2019-06-17 or recent emails from the .NET Engineering Infrastructure team.
- Check that all work item payloads are accessible using a browser.
- If you are sending to a non-disabled queue and find this error repeatedly occurring, please contact the dnceng team.
- If a single, specific work item dead letters and others do not, consider local debugging; it may be causing spontaneous reboot (or trigging one intentionally).

Report

Build Definition Test Pull Request
710640 dotnet/runtime System.Reflection.Tests.WorkItemExecution #103595
710441 dotnet/runtime Workloads-Wasm.Build.NativeRebuild.Tests.OptimizationFlagChangeTests.WorkItemExecution
710444 dotnet/runtime System.Threading.Tasks.Parallel.Tests.WorkItemExecution
710229 dotnet/runtime System.Security.Cryptography.Csp.Tests.WorkItemExecution #103560
709965 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
709966 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
709954 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
709957 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
709335 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
709334 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
709332 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
709333 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
709068 dotnet/runtime System.Threading.Tasks.Dataflow.Tests.WorkItemExecution #100041
708944 dotnet/runtime System.Runtime.Tests.WorkItemExecution #103416
708854 dotnet/runtime System.Diagnostics.TraceSource.Tests.WorkItemExecution
708846 dotnet/runtime System.Diagnostics.TraceSource.Tests.WorkItemExecution
708853 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
708851 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
708463 dotnet/runtime System.Globalization.CalendarsWithConfigSwitch.Tests.WorkItemExecution
708201 dotnet/runtime System.Diagnostics.TraceSource.Tests.WorkItemExecution
708194 dotnet/runtime JIT.Generics.WorkItemExecution #103484
708055 dotnet/runtime System.ComponentModel.TypeConverter.Tests.WorkItemExecution #102927
708065 dotnet/runtime System.Linq.Queryable.Tests.WorkItemExecution #103479
708015 dotnet/runtime System.Runtime.Serialization.BinaryFormat.Tests.WorkItemExecution #103370
708020 dotnet/runtime System.IO.Tests.WorkItemExecution #103337
708039 dotnet/runtime System.CodeDom.Tests.WorkItemExecution #103456
707938 dotnet/runtime Workloads-Wasm.Build.Tests.SatelliteAssembliesTests.WorkItemExecution #98494
706459 dotnet/runtime iOS.Simulator.LibraryMode.Test.WorkItemExecution #102748
707775 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
707765 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
707769 dotnet/runtime System.Diagnostics.TraceSource.Config.Tests.WorkItemExecution
707764 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
703066 dotnet/runtime System.Runtime.Tests.WorkItemExecution #102670
707322 dotnet/runtime System.IO.Tests.WorkItemExecution #103442
707058 dotnet/runtime System.Diagnostics.Tools.Tests.WorkItemExecution
707052 dotnet/runtime Regression_3.WorkItemExecution
707057 dotnet/runtime System.Runtime.Extensions.Tests.WorkItemExecution
706980 dotnet/runtime System.IO.Tests.WorkItemExecution
706568 dotnet/runtime System.Composition.Convention.Tests.WorkItemExecution #103411
706525 dotnet/runtime System.Diagnostics.Tracing.Tests.WorkItemExecution
706528 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
706532 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
706522 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
706429 dotnet/runtime tvOS.Device.Aot.Test.WorkItemExecution #103232
706163 dotnet/runtime Regression_3.WorkItemExecution
706180 dotnet/runtime System.Diagnostics.TextWriterTraceListener.Tests.WorkItemExecution #103393
706187 dotnet/runtime System.Collections.Specialized.Tests.WorkItemExecution #103364
706167 dotnet/runtime System.Diagnostics.FileVersionInfo.Tests.WorkItemExecution
705894 dotnet/runtime Microsoft.Extensions.Diagnostics.Tests.WorkItemExecution #103379
705827 dotnet/runtime System.IO.Tests.WorkItemExecution #103366
2472316 dotnet-runtime System.Dynamic.Runtime.Tests.WorkItemExecution #40292
2471911 dotnet-runtime release.Partition0.WorkItemExecution
2472143 dotnet-runtime release.Partition0.WorkItemExecution
705295 dotnet/runtime System.IO.Tests.WorkItemExecution
2471843 dotnet-runtime System.Formats.Cbor.Tests.WorkItemExecution #40292
705034 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
705035 dotnet/runtime System.Diagnostics.Tracing.Tests.WorkItemExecution
705022 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
705026 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
704278 dotnet/runtime System.Collections.Specialized.Tests.WorkItemExecution #99747
704203 dotnet/runtime Regression_3.WorkItemExecution
704267 dotnet/runtime System.Globalization.Tests.WorkItemExecution
704117 dotnet/runtime System.Data.Odbc.Tests.WorkItemExecution
703927 dotnet/runtime System.Diagnostics.Tools.Tests.WorkItemExecution
703923 dotnet/runtime Regression_3.WorkItemExecution
703814 dotnet/runtime System.Runtime.Serialization.Schema.Tests.WorkItemExecution #103283
703591 dotnet/runtime System.Diagnostics.TraceSource.Tests.WorkItemExecution
703592 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
703588 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
703582 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
703614 dotnet/runtime Invariant.Tests.WorkItemExecution #100030
703283 dotnet/runtime Regression_3.WorkItemExecution
703219 dotnet/runtime iOS.Simulator.LibraryMode.Test.WorkItemExecution
703291 dotnet/runtime System.IO.Tests.WorkItemExecution #103266
703287 dotnet/runtime System.Diagnostics.DiagnosticSource.Switches.Tests.WorkItemExecution
703222 dotnet/runtime System.Globalization.Calendars.Tests.WorkItemExecution
703069 dotnet/runtime JIT_Math.WorkItemExecution
703205 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
703121 dotnet/runtime Regression_3.WorkItemExecution
703077 dotnet/runtime System.Globalization.Tests.WorkItemExecution
702874 dotnet/runtime iOS.Simulator.LibraryMode.Test.WorkItemExecution
702836 dotnet/runtime System.Console.Tests.WorkItemExecution
702918 dotnet/runtime Regression_3.WorkItemExecution
703011 dotnet/runtime System.Formats.Tar.Tests.WorkItemExecution
702973 dotnet/runtime System.IO.Compression.Tests.WorkItemExecution
702832 dotnet/runtime Regression_3.WorkItemExecution
702967 dotnet/runtime System.IO.Compression.Tests.WorkItemExecution
702934 dotnet/runtime System.Collections.Specialized.Tests.WorkItemExecution #103138
702711 dotnet/runtime Regression_3.WorkItemExecution
702715 dotnet/runtime System.Console.Tests.WorkItemExecution
2470077 dotnet-runtime System.Formats.Tar.Manual.Tests.WorkItemExecution #40222
702615 dotnet/runtime System.Diagnostics.TextWriterTraceListener.Tests.WorkItemExecution #103159
2470076 dotnet-runtime System.IO.FileSystem.Manual.Tests.WorkItemExecution #40221
702578 dotnet/runtime tvOS.Device.Aot.Test.WorkItemExecution #103226
702185 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
702184 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
702186 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
702182 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
701895 dotnet/runtime ComInterfaceGenerator.Unit.Tests.WorkItemExecution
701694 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
Displaying 100 of 338 results

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
8 77 338
@carlossanlop carlossanlop added arch-wasm WebAssembly architecture os-windows wasm-aot-test WebAssembly AOT Test Known Build Error Use this to report build issues in the .NET Helix tab os-browser Browser variant of arch-wasm labels Apr 25, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 25, 2024
Copy link
Contributor

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 25, 2024
@carlossanlop
Copy link
Member Author

In the same PR's build where the above dead-letter failure was found, there's another run for a similar queue but is Release. While it does not dead-letter immediately, it manages to print a couple lines, then dies:

Console log: 'WasmTestOnChrome-System.Runtime.Tests' from job 2736decc-83a1-4c32-9e0a-ef543e0d26f3 (windows.amd64.server2022.open.rt) using docker image mcr.microsoft.com/dotnet-buildtools/prereqs:windowsservercore-ltsc2022-helix-webassembly on a000NF0
running %HELIX_CORRELATION_PAYLOAD%\scripts\be8b1ad5c1e9498d89709f26e508c549\execute.cmd in C:\h\w\B0E6099F\w\A1EF089A\e max 3600 seconds

^ It just dies after printing the second line.

I do not want to open a KnownBuildError issue for this specific failure as it would end up grouping anything. I am concerned that people are going to get blocked on getting their PRs merged because they cannot bypass the merge on green restriction. For example, this PR is already blocked and I will only be able to merge it if I JIT elevate myself: #101498
@JulieLeeMSFT @hoyosjs @jkoritzinsky

@jkotas
Copy link
Member

jkotas commented Apr 25, 2024

I am concerned that people are going to get blocked on getting their PRs merged because they cannot bypass the merge on green restriction. For example, this PR is already blocked and I will only be able to merge it if I JIT elevate myself:

People should be able to use /~https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/failure-analysis.md#bypassing-build-analysis . Have you tried that before JIT elevating?

@carlossanlop
Copy link
Member Author

People should be able to use /~https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/failure-analysis.md#bypassing-build-analysis . Have you tried that before JIT elevating?

I was not aware of that. Thanks for sharing! I'll try it next time.

@agocke
Copy link
Member

agocke commented Apr 25, 2024

What does dead-lettering mean in this context? What is the case where this fails?

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 25, 2024
@lewing
Copy link
Member

lewing commented Apr 25, 2024

iirc Dead Lettering is usually infrastructure related, is that correct @steveisok ?

We often see queues fall over around branch time

If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.

What this means:

- All attempts to retry execution of this work item were unable to complete.  This can be both for infrastructure reasons (problems within Azure) or issues with the work item (for instance, causing a machine to reboot unexpectedly or killing the Helix client on the machine will force a retry).
- No further work will be done for this specific work item, and its exit code is set to an artificial -1 (since it did not complete, there is no real exit code).

Common causes:

- Disabled queue (end-of-life Helix queues are automatically forwarded to deadletter and will fail instantly)
- Unhealthy Helix Client machine(s)
- Queue has been backed up heavily by a large amount of work and was manually purged by the engineering team
- Azure issues (e.g. Service Bus is overloaded)
- Malformed payloads; if Helix cannot download and unzip all payloads successfully, work will retry until dead-lettered.

For follow up:

- Check if your Helix Queue is still enabled, either via the metadata you see by browsing to https://helix.dot.net/api/info/queues?api-version=2019-06-17 or recent emails from the .NET Engineering Infrastructure team.
- Check that all work item payloads are accessible using a browser.
- If you are sending to a non-disabled queue and find this error repeatedly occurring, please contact the dnceng team.
- If a single, specific work item dead letters and others do not, consider local debugging; it may be causing spontaneous reboot (or trigging one intentionally).

@steveisok
Copy link
Member

iirc Dead Lettering is usually infrastructure related, is that correct @steveisok ?

We often see queues fall over around branch time

Correct. @ilyas1974 is queue dead lettering manually driven or is there some automation involved?

Copy link
Contributor

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

@lewing lewing changed the title Dead lettering in wasm smoke AOT tests Dead lettering tests [disproportionately wasm] Apr 26, 2024
@lewing lewing removed the arch-wasm WebAssembly architecture label Apr 26, 2024
@ilyas1974
Copy link

For you dead lettering question, the answer is Yes - it's a manual and automated process. We manually deadletter a queue so the changes are immediate. We then add the deadletter information to the helix configuration, so he is persistent for whenever we make changes to helix.

@agocke
Copy link
Member

agocke commented Apr 26, 2024

Ok, so is the expectation that tests should be re-run when deadlettering happens? That basically, that run was invalid?

@lewing lewing changed the title Dead lettering tests [disproportionately wasm] Dead lettering tests May 2, 2024
@lewing lewing removed wasm-aot-test WebAssembly AOT Test os-browser Browser variant of arch-wasm labels May 2, 2024
@lewing
Copy link
Member

lewing commented May 2, 2024

I've removed the wasm references in the labels and title bits because wasm is no longer dominating the failures in any way (with the exception of preview4 which has known problems that are fixed in main)

@lewing lewing removed the os-windows label May 2, 2024
@agocke
Copy link
Member

agocke commented Jun 18, 2024

Bypassing these tests doesn't seem appropriate. I'm closing the issue.

@agocke agocke closed this as not planned Won't fix, can't repro, duplicate, stale Jun 18, 2024
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Jun 18, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure Known Build Error Use this to report build issues in the .NET Helix tab
Projects
None yet
Development

No branches or pull requests

8 participants