Prepare for strict coherency #23227

mmitche · 2020-06-22T18:53:43Z

Strict coherency means that if your repo declares a dependency with a CPD attribute, the CPD parent must have a direct dependency on that dependency.
This means for aspnetcore that it should add a dependency on System.Resources.Extensions because aspnetcore-tooling needs it.

Reasons for this approach over the current approach:

This eliminates some ambiguous tie-breaking scenarios that are very problematic and dangerous in servicing
Those tie breaking scenarios require the use of BAR to break the ties. If the DB data were to be lost then the tie-breaking would do unexpected things.

Strict coherency means that if your repo declares a dependency with a CPD attribute, the CPD parent must have a direct dependency on that dependency. This means for aspnetcore that it should add a dependency on System.Resources.Extensions because aspnetcore-tooling needs it. Reasons for this approach over the current approach: - This eliminates some ambiguous tie-breaking scenarios that are very problematic and dangerous in servicing - Those tie breaking scenarios require the use of BAR to break the ties. If the DB data were to be lost then the tie-breaking would do unexpected things.

dougbu

Bizarre to need packages unused in this repo but 🆗

mmitche · 2020-06-22T19:05:33Z

Bizarre to need packages unused in this repo but 🆗

It's basically for the reasons listed above. Without this, we have to do complex tie-breaking which can be wrong and is fragile.

…trict

mmitche · 2020-07-08T20:54:06Z

@dougbu Any idea on these last two legs? This change is basically a no-op.

dougbu · 2020-07-08T22:34:16Z

@mmitche that looks like a provider failure. Our runner for Helix jobs should handle timeouts before the jobs timeout, making it unlikely to be a problem on our side. I reran the failed jobs in hope I'm correct.

It's possible this relates to earlier failures @HaoK and @MattGal were discussing in the FR Teams room today. Not sure…

MattGal · 2020-07-08T23:01:51Z

@mmitche that looks like a provider failure. Our runner for Helix jobs should handle timeouts before the jobs timeout, making it unlikely to be a problem on our side. I reran the failed jobs in hope I'm correct.

It's possible this relates to earlier failures @HaoK and @MattGal were discussing in the FR Teams room today. Not sure…

I haven't made the FR issue yet but I'm tracking https://portal.microsofticm.com/imp/v3/incidents/details/195705642/home , timeouts being wrong will be this symptom. (EDIT: THis is not the problem)

MattGal · 2020-07-08T23:12:25Z

@dougbu I looked at the two failing legs from attempt 1:

One is https://github.com/dotnet/core-eng/issues/10224 ... basically when Service Bus rolls out they break us. Ongoing thing for me.

The helix test failures aren't related to anything I'm currently tracking but if you have a reason to be concerned I can look after I get off the current bridge.

dougbu · 2020-07-08T23:32:21Z

Ah, this is strange but the Helix failure initially looked like an infrastructure issue. Now I think the .NET SDK has changed its layout or is no longer internally consistent. Any suggestions @ViktorHofer, @dsplaisted or @wli3❔ (I'm including @ViktorHofer because dotnet/runtime#28894 is one of the few places I see exit code -2147450735 mentioned.)

GET https://dot.net/v1/dotnet-install.ps1
Download of 'https://dot.net/v1/dotnet-install.ps1' complete, saved to C:\h\w\A52B08F5\w\C07B0A70\e\dotnet-install.ps1...
Download of dotnet-install.ps1 complete...
Installing SDK...& C:\h\w\A52B08F5\w\C07B0A70\e\dotnet-install.ps1 -Architecture x64 -Version 5.0.100-preview.7.20330.3 -InstallDir C:\h\w\A52B08F5\p\sdk\x64
dotnet-install: .NET Core SDK version 5.0.100-preview.7.20330.3 is already installed.
dotnet-install: Adding to current process PATH: "C:\h\w\A52B08F5\p\sdk\x64\". Note: This change will not be visible if PowerShell was run as a child process.
Installing Runtime...& C:\h\w\A52B08F5\w\C07B0A70\e\dotnet-install.ps1 -Architecture x64 -Runtime dotnet -Version 5.0.0-preview.8.20357.14 -InstallDir C:\h\w\A52B08F5\p\sdk\x64
dotnet-install: Downloading link: https://dotnetcli.azureedge.net/dotnet/Runtime/5.0.0-preview.8.20357.14/dotnet-runtime-5.0.0-preview.8.20357.14-win-x64.zip
dotnet-install: Extracting zip from https://dotnetcli.azureedge.net/dotnet/Runtime/5.0.0-preview.8.20357.14/dotnet-runtime-5.0.0-preview.8.20357.14-win-x64.zip
dotnet-install: Installation finished
InstallDotNet.ps1 complete...
"Restore: dotnet restore RunTests\RunTests.csproj --source https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet5/nuget/v3/index.json --ignore-failed-sources..."
Found .NET SDK, but did not find dotnet.dll at [C:\h\w\A52B08F5\p\sdk\x64\sdk\5.0.100-preview.7.20330.3\dotnet.dll]
"Running tests: dotnet run --project RunTests\RunTests.csproj -- --target IIS.FunctionalTests.dll --sdk 5.0.100-preview.7.20330.3 --runtime 5.0.0-preview.8.20357.14 --queue Windows.10.Amd64.Open --arch x64 --quarantined false --ef 5.0.0-preview.8.20357.7 --aspnetruntime Microsoft.AspNetCore.App.Runtime.win-x64.5.0.0-ci.nupkg --aspnetref Microsoft.AspNetCore.App.Ref.5.0.0-ci.nupkg --helixTimeout 01:00:00..."
Found .NET SDK, but did not find dotnet.dll at [C:\h\w\A52B08F5\p\sdk\x64\sdk\5.0.100-preview.7.20330.3\dotnet.dll]
neq was unexpected at this time.
"Finished running tests: exit_code=-2147450735"

dougbu · 2020-07-08T23:34:02Z

And, no, I don't understand why the errors don't happen in our regular Windows build jobs. Must be something specific to what's on Windows.10.Amd64.Open agents.

MattGal · 2020-07-09T00:39:40Z

And, no, I don't understand why the errors don't happen in our regular Windows build jobs. Must be something specific to what's on Windows.10.Amd64.Open agents.

It's a pretty boring Server Windows 10 sku, generally if something's unusual it's going to be related to previous work on the same machine as we simply can't afford a clean machine per work item.

If you see it keep reproing, ping me and i'll investigate. It may be worth making sure nothing modifies the contents of the correlation payload ( C:\h\w\A52B08F5\p ) since that's only unzipped once and then it's on the work items to not modify it in a "bad" way.

dougbu · 2020-07-09T01:53:01Z

@MattGal thanks❕

If you see it keep reproing, ping me

Right now runfo search-helix --definition=aspnet --count=50 --value='did not find dotnet.dll' --pr only matched the validation build for this PR. So, a bizarre one-off failure at this point.

/fyi @HaoK and @Pilchie because you're often the first to notice Helix infrastructure problems.

@mmitche the rerun passed so you're all set.

wli3 · 2020-07-09T05:14:08Z

@bozturkMSFT for the install script issue

bekir-ozturk · 2020-07-10T09:46:11Z

It appears that the folder C:\h\w\A52B08F5\p\sdk\x64\sdk\5.0.100-preview.7.20330.3\ exists, but the file C:\h\w\A52B08F5\p\sdk\x64\sdk\5.0.100-preview.7.20330.3\dotnet.dll doesn't. Since the folder is there, the install script assumes that the sdk is installed and doesn't attempt to reinstall it.

@dougbu Checking if something attempts to delete the folder might help. It may be that one of the files is still in use while deleting.

dougbu · 2020-07-10T17:18:19Z

Hmm, I don't know but thought the Helix workspace was cleaned up automatically. @MattGal❔

If not, I guess we could add an extra cleanup step to our runtests.cmd|sh scripts.

We're also moving to use the built-in SDK installation with #23585. Does that change anything about cleanup between Helix jobs❔

MattGal · 2020-07-13T15:29:00Z

Hmm, I don't know but thought the Helix workspace was cleaned up automatically. @MattGal❔

There's a couple things to unpack here:

All payloads, and (recently added) the redirected Temp folder are cleaned up automatically for Helix work items.
From the above path, C:\h\w\A52B08F5, as a folder, will be deleted entirely as we see new work on a given machine. We preserve the most recently seen previous three correlation payload folders on Helix machines to avoid excessive network usage.
C:\h\w\A52B08F5\p\ is the "correlation" payload directory. We're ensure that every file you include with your work item as a correlation payload (that is, shared amongst every work item) is unzipped and has the correct size exactly once. After this time, if a work item deletes files inside it, it will stay this way for subsequent work items. It is not advised that correlation payloads change, though in non-docker scenarios we don't block users from doing so.
There are established mechanisms I've discussed with @HaoK (and I think he's used) that make it so you don't have to run any install scripts, opting instead for bringing along your SDK as a correlation payload.

All that said, I'll take a look real quick and see if there's a story in the logs. It might still be possible via leaking a dotnet.exe just right that a machine tries to clean this correlation folder, and ends up with a dirty folder.

MattGal · 2020-07-13T15:47:44Z

@dougbu This is actually something I've been dealing with for some time, and have somewhat of a mitigation for it. The underlying problem is some ways away from being fixed. Switching to not running this install script (or installing to %TEMP%) should unblock you and is the preferable way to work around this.

In the end you're hitting the problem described in https://github.com/dotnet/core-eng/issues/10224 . This is caused whenever ServiceBus "upgrades" their service fabric service which for unknown reasons they've been doing a lot more of lately. This causes them to abandon the lock on a given work item (stored currently in memory only) and give the work to a different Helix machine while they're upgrading. As a result, we reboot the machine and try again.

At the same time, machines running your work item run 4 other jobs' work items (note we cache the previous 3) so we know that your correlation payload was attempted to be cleaned up (and in fact I found no evidence that it actually failed to do so) and the work item ran on a machine that had previously run Microsoft.AspNetCore.Diagnostics.HealthChecks.Tests--net5.0 from the same job.

Until durable locking exists, I've merged a change (should roll out Weds) that detects this situation and ... just lets the work item that is running finish. There will be some weird duplications caused by this but which I believe we've already hardened against years ago. Regardless of how long this takes (and I'll be eagerly awaiting this vital feature), stopping running the SDK installer as part of work items and just bringing the SDK as a payload is the way forward.

dougbu · 2020-07-13T16:51:34Z

stopping running the SDK installer as part of work items and just bringing the SDK as a payload is the way forward.

@HaoK this isn't the direction you've been moving with #23585 because we still need to install runtimes on the Helix agents. I'm also wondering if we need to include the RunTests bin/ folder in the correlation payload to avoid these problems. Thoughts❔

HaoK · 2020-07-13T18:02:24Z

Right, waiting for the helix changes to flow and we should be able to switch to the built in SDK support. We do still install the runtimes, if the helix SDK wants to add built in ability similar to the SDK support we'd obviously love to switch to that too :). RunTests shouldn't matter as it doesn't do anything to the correlation payloads

MattGal · 2020-07-13T18:26:28Z

note that under normal circumstances, modifying and even leaking file handles to the payload directory is unadvised but won't hurt anything. Wednesday's workaround might make it stop more or less entirely for you, but please do still avoid modifying the correlation payloads in general whenever possible.

dougbu · 2020-07-13T22:35:30Z

@HaoK why are we using the Helix SDK to install the .NET SDK if it doesn't get us where we want. Put another way, why not copy the .dotnet/ folder into the correlation payload (assuming that can be done reliably while running dotnet msbuild)❔ That sounds better to me than the half-step in #23585.

HaoK · 2020-07-14T02:55:55Z

Not sure what you mean by not getting what we want, prebuilding the dotnet folder image at build time rather than having the work items install the runtime, sounds like we would end up building a lot of permutations of helix payloads, while we could certainly do that, why is that better than what we do today?

dougbu · 2020-07-14T03:30:22Z

@HaoK

prebuilding the dotnet folder image at build time

The build agent already has the SDK and all necessary runtimes for Windows installed -- including the just-built Microsoft.AspNetCore.App.* bits. That should be a superset of what would be needed in the Helix jobs i.e. it just adds native files. The change @MattGal and I suggest is a matter of including the existing $(RepoRoot).dotnet/ folder in the correlation payload.

end up building a lot of permutations of helix payloads

I'm hoping the main downsides are (a) the tested SDK isn't exactly what we would build for the target OS and (b) we upload the SDK and runtimes everywhere instead of installing them only in specific runs e.g. for the templates tests i.e. the correlation payload gets larger.

why is that better than what we do today?

Because, as @MattGal has made clear above, we're changing the correlation payload folders on the Helix agents and that's unsupported. If we do #23585, I'm not seeing a real improvement in this area. Yes, it may improve the SDK installation leftovers but doesn't help w/ the runtimes.

I agree the ideal would be the Helix SDK supporting runtime and SDK installation simultaneously (at different versions). But, without that feature, it's not clear the Helix SDK feature helps much.

HaoK · 2020-07-14T03:57:15Z

The biggest issue is we don't build our helix payloads on different agents, we only build on windows and publish to all queues, and then on once on linux arm for the arm64 queues. I guess if the .dotnet folder is truly cross plat that will work?

HaoK · 2020-07-14T05:39:12Z

If the main concern is just about modifying the correlation payloads, we don't do that on non windows since its not allowed, and we install to a different folder in the workitem root, we intentionally weren't doing that on windows to take advantage of installing it once. We can install the runtime into the workitem root just like the shell scripts if that's the main concern.

dougbu · 2020-07-14T05:45:58Z

The .dotnet/ folder isn't quite "truly cross plat" but it should work. This is an expansion of the current reuse of Windows Microsoft.AspNetCore.App.* bits in non-Windows tests.
Yup, that's my main concern. First comply with the Helix requirements, then (maybe) focus on tests being true-to-life.

I'm not sure what you mean in the later sentences because this conversation is about our changes in correlation folders on Windows agents.

HaoK · 2020-07-14T05:48:25Z

I'm saying if we just don't want to do modify the correlation folders, we can just install it somewhere else rather than there, that's what the .sh files do.

dougbu · 2020-07-14T17:20:25Z

@HaoK that sounds good to me.

Now that I think about it more, using the Helix SDK for the .NET SDK part of it won't hurt and may improve our $TMP pollution issue. But, … only if the Helix SDK does its installation somewhere other than the correlation folders.

MattGal · 2020-07-14T17:26:03Z

@HaoK that sounds good to me.

Now that I think about it more, using the Helix SDK for the .NET SDK part of it won't hurt and may improve our $TMP pollution issue. But, … only if the Helix SDK does its installation somewhere other than the correlation folders.

At least on Helix machines, TMP pollution should be solved; you now have a TMP directory per job, which gets automatically cleaned up as long as you resolve paths using TMP/TEMP env vars.

dougbu · 2020-07-14T17:33:41Z

TMP pollution should be solved

Excellent. Even so, using the Helix SDK if it installs outside the correlation folders at least reduces our infrastructure duplication. Any other upsides I'm forgetting❔

HaoK · 2020-07-14T17:38:52Z

Just to be clear, we don't control where the helix sdk installs the dotnet SDK, I'm just talking about where we install our dotnet runtime

HaoK · 2020-07-14T17:39:41Z

dotnet SDK is a first class feature from the helix sdk, the whole idea is we should try to use it as is... right?

dougbu · 2020-07-14T17:42:58Z

If it can be used, great. The lack of support for installing additional runtimes is a pain but not horrendous.

mmitche requested a review from dougbu as a code owner June 22, 2020 18:53

ghost added the area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework label Jun 22, 2020

dougbu approved these changes Jun 22, 2020

View reviewed changes

mmitche added 5 commits June 22, 2020 16:13

Fixup property name

1d55051

Fixup property name

3571753

Merge remote-tracking branch 'upstream/master' into prepare-for-cpd-s…

98b639f

…trict

Update version details xml

9ed9d48

Merge branch 'master' into prepare-for-cpd-strict

7d06caa

mmitche merged commit 8c75a8e into dotnet:master Jul 9, 2020

github-actions bot locked and limited conversation to collaborators Dec 8, 2023

Prepare for strict coherency #23227

Prepare for strict coherency #23227

Uh oh!

Conversation

mmitche commented Jun 22, 2020

Uh oh!

dougbu left a comment

Choose a reason for hiding this comment

Uh oh!

mmitche commented Jun 22, 2020

Uh oh!

mmitche commented Jul 8, 2020

Uh oh!

dougbu commented Jul 8, 2020

Uh oh!

MattGal commented Jul 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MattGal commented Jul 8, 2020

Uh oh!

dougbu commented Jul 8, 2020

Uh oh!

dougbu commented Jul 8, 2020

Uh oh!

MattGal commented Jul 9, 2020

Uh oh!

dougbu commented Jul 9, 2020

Uh oh!

wli3 commented Jul 9, 2020

Uh oh!

bekir-ozturk commented Jul 10, 2020

Uh oh!

dougbu commented Jul 10, 2020

Uh oh!

MattGal commented Jul 13, 2020

Uh oh!

MattGal commented Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dougbu commented Jul 13, 2020

Uh oh!

HaoK commented Jul 13, 2020

Uh oh!

MattGal commented Jul 13, 2020

Uh oh!

dougbu commented Jul 13, 2020

Uh oh!

HaoK commented Jul 14, 2020

Uh oh!

dougbu commented Jul 14, 2020

Uh oh!

HaoK commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HaoK commented Jul 14, 2020

Uh oh!

dougbu commented Jul 14, 2020

Uh oh!

HaoK commented Jul 14, 2020

Uh oh!

dougbu commented Jul 14, 2020

Uh oh!

MattGal commented Jul 14, 2020

Uh oh!

dougbu commented Jul 14, 2020

Uh oh!

HaoK commented Jul 14, 2020

Uh oh!

HaoK commented Jul 14, 2020

Uh oh!

dougbu commented Jul 14, 2020

Uh oh!

Uh oh!

MattGal commented Jul 8, 2020 •

edited

Loading

MattGal commented Jul 13, 2020 •

edited

Loading

HaoK commented Jul 14, 2020 •

edited

Loading