-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There are no more endpoints available from the endpoint mapper. #3844
Comments
dotnet/aspnetcore#57416 but for everybody |
We suspect dotnet/sdk#42870 is another manifestation of the same problem |
This seems to be an occurrence of System.Security.Cryptography.X509Certificates test failures (dotnet/runtime#70320) (also dotnet/runtime#20708, dotnet/runtime#74838, and dotnet/runtime#83226) which we encounter occasionally as a known build error. This is an external issue with Windows when the "CNG Key Isolation" windows service has crashed or has stopped. The only known way to resolve this is to reboot the affected machine as the underlying service may die and fail to restart when the host is severely resource starved. |
@jeffhandley so this is failing > 30x a day in CI right now, who and how can we do that? existing systems aren't working. There are multiple open issues across multiple repos and an active FR thread. |
I opened 4 random logs out of curiosity and it seems they run on 3 different machines so it seems a bit unlikely it's machine specific - the only thing they had in common was that they used "Windows-10-10.0.20348-SP0" - possibly there is something special about that build? Possibly someone from infra team could check if there were any recent updates before it started happening more frequently? (i.e. Aug 13/14 update) It's possible it's true that once we hit this it will fail until reboot but there is definitely something which makes it not work. What I see in common about the stack trace is:
I'm also seeing some of those tests call CreateSelfSigned from static constructor so it's possible it's related (it might be worth trying to move that logic out inside ASP.NET and see if that changes anything - my theory is that something could got partially initialized). so to me it sounds like one of the following is most likely:
|
Is it possible that we aren't disposing something (that CreateSelfSigned in the static constructor) and that thing accumulates resources in this external service? If it's happening more often we might be able to get a repro more easily. |
When we've seen this in the past, it has meant that the "CNG Key Isolation Service" service has exited (crashed, probably). In the past I have tried quite a lot to cause one of these failures to happen using .NET, so I could see how we can stop encountering them. I've never been able to do so. The OS team responsible for it says they only ever see these errors on test machines, and that their recommended answer is to just pave the machine and watch the problem go away. The best short-term fix I can think of is to just be aggressively rebooting the test machines... assuming they're not aggressively rebooted/reimaged/spun-up-from-nothing already. If I could remember the right way to suggest making that inquiry to the engineering first responders, I'd say so here... |
Uptime for machines managed by Helix depends on the queue demand. The machines are kept alive until the load diminishes, then the scaleset scales down by removing the oldest machines first. Out of curiosity, I pulled some data for the last 7 days for windows.amd64.vs2022.pre.open (chosen randomly) on machine uptime. I thought it would be interesting to drop here. |
@garath Am I reading the data correctly that the longest that particular machine went between reboots was 177 minutes, with a p95 uptime less than 60 minutes on any given day this past week? If so, that does invalidate the assertion that rebooting will address it. @bartonjs Are there folks you've talked to before from Windows that we could engage on this reoccurrence? |
Not any single machine, this is the 50th, 75th and 95th percentile for the "uptime" of all machines ever allocated to that queue on a given day. |
OK; thanks, @garath. I also checked the max in case there were machines outside the 95th percentile that could have had long uptimes and be the culprits. But the max I saw was 177 minutes; does that sound right to you that the longest any machine went without a reboot was 177 minutes? |
Huh, I had it in my head that the average Helix machine had an uptime more like several days (and the lowest-indexed ones in the scale... cluster... thing... having an uptime of "since patches were last installed"). I've reached out to Windows to see if it's a known thing with known workarounds (or potentially a thing they'd want to look at). |
This could be true for the very, very busy queues. I also believe it was more true a few years ago before we created our own custom autoscaler service. I'm happy to pull data if there are any questions. Just let me know what you'd like to see. |
Looks like that's exactly correct. Below are the top 10 longest uptimes for the last seven days in the Windows.Amd64.VS2022.Pre.Open queue.
|
The exception on aspnetcore builds is slightly different, I doubt it will help but in case someone looking looking for it:
|
Build
https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=781694
Build leg reported
Microsoft.NET.Build.Containers.UnitTests.RegistryTests.InsecureRegistry
Pull Request
/~https://github.com/dotnet/sdk.git/pull/42858
Known issue core information
Fill out the known issue JSON section by following the step by step documentation on how to create a known issue
@dotnet/dnceng
Release Note Category
Release Note Description
Additional information about the issue reported
No response
Known issue validation
Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=781694
Error message validated:
[There are no more endpoints available from the endpoint mapper.
]Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 8/21/2024 1:53:12 AM UTC
Report
Summary
The text was updated successfully, but these errors were encountered: