-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Prepare for strict coherency #23227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare for strict coherency #23227
Conversation
Strict coherency means that if your repo declares a dependency with a CPD attribute, the CPD parent must have a direct dependency on that dependency. This means for aspnetcore that it should add a dependency on System.Resources.Extensions because aspnetcore-tooling needs it. Reasons for this approach over the current approach: - This eliminates some ambiguous tie-breaking scenarios that are very problematic and dangerous in servicing - Those tie breaking scenarios require the use of BAR to break the ties. If the DB data were to be lost then the tie-breaking would do unexpected things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bizarre to need packages unused in this repo but 🆗
It's basically for the reasons listed above. Without this, we have to do complex tie-breaking which can be wrong and is fragile. |
@dougbu Any idea on these last two legs? This change is basically a no-op. |
@mmitche that looks like a provider failure. Our runner for Helix jobs should handle timeouts before the jobs timeout, making it unlikely to be a problem on our side. I reran the failed jobs in hope I'm correct. It's possible this relates to earlier failures @HaoK and @MattGal were discussing in the FR Teams room today. Not sure… |
I haven't made the FR issue yet but I'm tracking https://portal.microsofticm.com/imp/v3/incidents/details/195705642/home , timeouts being wrong will be this symptom. (EDIT: THis is not the problem) |
@dougbu I looked at the two failing legs from attempt 1: One is https://github.com/dotnet/core-eng/issues/10224 ... basically when Service Bus rolls out they break us. Ongoing thing for me. The helix test failures aren't related to anything I'm currently tracking but if you have a reason to be concerned I can look after I get off the current bridge. |
Ah, this is strange but the Helix failure initially looked like an infrastructure issue. Now I think the .NET SDK has changed its layout or is no longer internally consistent. Any suggestions @ViktorHofer, @dsplaisted or @wli3❔ (I'm including @ViktorHofer because dotnet/runtime#28894 is one of the few places I see exit code
|
And, no, I don't understand why the errors don't happen in our regular Windows build jobs. Must be something specific to what's on Windows.10.Amd64.Open agents. |
It's a pretty boring Server Windows 10 sku, generally if something's unusual it's going to be related to previous work on the same machine as we simply can't afford a clean machine per work item. If you see it keep reproing, ping me and i'll investigate. It may be worth making sure nothing modifies the contents of the correlation payload ( C:\h\w\A52B08F5\p ) since that's only unzipped once and then it's on the work items to not modify it in a "bad" way. |
@MattGal thanks❕
Right now /fyi @HaoK and @Pilchie because you're often the first to notice Helix infrastructure problems. @mmitche the rerun passed so you're all set. |
@bozturkMSFT for the install script issue |
It appears that the folder @dougbu Checking if something attempts to delete the folder might help. It may be that one of the files is still in use while deleting. |
Hmm, I don't know but thought the Helix workspace was cleaned up automatically. @MattGal❔ If not, I guess we could add an extra cleanup step to our runtests.cmd|sh scripts. We're also moving to use the built-in SDK installation with #23585. Does that change anything about cleanup between Helix jobs❔ |
There's a couple things to unpack here:
All that said, I'll take a look real quick and see if there's a story in the logs. It might still be possible via leaking a dotnet.exe just right that a machine tries to clean this correlation folder, and ends up with a dirty folder. |
@dougbu This is actually something I've been dealing with for some time, and have somewhat of a mitigation for it. The underlying problem is some ways away from being fixed. Switching to not running this install script (or installing to %TEMP%) should unblock you and is the preferable way to work around this. In the end you're hitting the problem described in https://github.com/dotnet/core-eng/issues/10224 . This is caused whenever ServiceBus "upgrades" their service fabric service which for unknown reasons they've been doing a lot more of lately. This causes them to abandon the lock on a given work item (stored currently in memory only) and give the work to a different Helix machine while they're upgrading. As a result, we reboot the machine and try again. At the same time, machines running your work item run 4 other jobs' work items (note we cache the previous 3) so we know that your correlation payload was attempted to be cleaned up (and in fact I found no evidence that it actually failed to do so) and the work item ran on a machine that had previously run Microsoft.AspNetCore.Diagnostics.HealthChecks.Tests--net5.0 from the same job. Until durable locking exists, I've merged a change (should roll out Weds) that detects this situation and ... just lets the work item that is running finish. There will be some weird duplications caused by this but which I believe we've already hardened against years ago. Regardless of how long this takes (and I'll be eagerly awaiting this vital feature), stopping running the SDK installer as part of work items and just bringing the SDK as a payload is the way forward. |
@HaoK this isn't the direction you've been moving with #23585 because we still need to install runtimes on the Helix agents. I'm also wondering if we need to include the |
Right, waiting for the helix changes to flow and we should be able to switch to the built in SDK support. We do still install the runtimes, if the helix SDK wants to add built in ability similar to the SDK support we'd obviously love to switch to that too :). RunTests shouldn't matter as it doesn't do anything to the correlation payloads |
note that under normal circumstances, modifying and even leaking file handles to the payload directory is unadvised but won't hurt anything. Wednesday's workaround might make it stop more or less entirely for you, but please do still avoid modifying the correlation payloads in general whenever possible. |
Not sure what you mean by not getting what we want, prebuilding the dotnet folder image at build time rather than having the work items install the runtime, sounds like we would end up building a lot of permutations of helix payloads, while we could certainly do that, why is that better than what we do today? |
The build agent already has the SDK and all necessary runtimes for Windows installed -- including the just-built
I'm hoping the main downsides are (a) the tested SDK isn't exactly what we would build for the target OS and (b) we upload the SDK and runtimes everywhere instead of installing them only in specific runs e.g. for the templates tests i.e. the correlation payload gets larger.
Because, as @MattGal has made clear above, we're changing the correlation payload folders on the Helix agents and that's unsupported. If we do #23585, I'm not seeing a real improvement in this area. Yes, it may improve the SDK installation leftovers but doesn't help w/ the runtimes. I agree the ideal would be the Helix SDK supporting runtime and SDK installation simultaneously (at different versions). But, without that feature, it's not clear the Helix SDK feature helps much. |
The biggest issue is we don't build our helix payloads on different agents, we only build on windows and publish to all queues, and then on once on linux arm for the arm64 queues. I guess if the .dotnet folder is truly cross plat that will work? |
If the main concern is just about modifying the correlation payloads, we don't do that on non windows since its not allowed, and we install to a different folder in the workitem root, we intentionally weren't doing that on windows to take advantage of installing it once. We can install the runtime into the workitem root just like the shell scripts if that's the main concern. |
I'm not sure what you mean in the later sentences because this conversation is about our changes in correlation folders on Windows agents. |
I'm saying if we just don't want to do modify the correlation folders, we can just install it somewhere else rather than there, that's what the .sh files do. |
@HaoK that sounds good to me. Now that I think about it more, using the Helix SDK for the .NET SDK part of it won't hurt and may improve our |
At least on Helix machines, TMP pollution should be solved; you now have a TMP directory per job, which gets automatically cleaned up as long as you resolve paths using TMP/TEMP env vars. |
Excellent. Even so, using the Helix SDK if it installs outside the correlation folders at least reduces our infrastructure duplication. Any other upsides I'm forgetting❔ |
Just to be clear, we don't control where the helix sdk installs the dotnet SDK, I'm just talking about where we install our dotnet runtime |
dotnet SDK is a first class feature from the helix sdk, the whole idea is we should try to use it as is... right? |
If it can be used, great. The lack of support for installing additional runtimes is a pain but not horrendous. |
Strict coherency means that if your repo declares a dependency with a CPD attribute, the CPD parent must have a direct dependency on that dependency.
This means for aspnetcore that it should add a dependency on System.Resources.Extensions because aspnetcore-tooling needs it.
Reasons for this approach over the current approach: