Skip to content

Prepare for strict coherency #23227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 9, 2020
Merged

Conversation

mmitche
Copy link
Member

@mmitche mmitche commented Jun 22, 2020

Strict coherency means that if your repo declares a dependency with a CPD attribute, the CPD parent must have a direct dependency on that dependency.
This means for aspnetcore that it should add a dependency on System.Resources.Extensions because aspnetcore-tooling needs it.

Reasons for this approach over the current approach:

  • This eliminates some ambiguous tie-breaking scenarios that are very problematic and dangerous in servicing
  • Those tie breaking scenarios require the use of BAR to break the ties. If the DB data were to be lost then the tie-breaking would do unexpected things.

Strict coherency means that if your repo declares a dependency with a CPD attribute, the CPD parent must have a direct dependency on that dependency.
This means for aspnetcore that it should add a dependency on System.Resources.Extensions because aspnetcore-tooling needs it.

Reasons for this approach over the current approach:
- This eliminates some ambiguous tie-breaking scenarios that are very problematic and dangerous in servicing
- Those tie breaking scenarios require the use of BAR to break the ties. If the DB data were to be lost then the tie-breaking would do unexpected things.
@mmitche mmitche requested a review from dougbu as a code owner June 22, 2020 18:53
@ghost ghost added the area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework label Jun 22, 2020
Copy link
Contributor

@dougbu dougbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bizarre to need packages unused in this repo but 🆗

@mmitche
Copy link
Member Author

mmitche commented Jun 22, 2020

Bizarre to need packages unused in this repo but 🆗

It's basically for the reasons listed above. Without this, we have to do complex tie-breaking which can be wrong and is fragile.

@mmitche
Copy link
Member Author

mmitche commented Jul 8, 2020

@dougbu Any idea on these last two legs? This change is basically a no-op.

@dougbu
Copy link
Contributor

dougbu commented Jul 8, 2020

@mmitche that looks like a provider failure. Our runner for Helix jobs should handle timeouts before the jobs timeout, making it unlikely to be a problem on our side. I reran the failed jobs in hope I'm correct.

It's possible this relates to earlier failures @HaoK and @MattGal were discussing in the FR Teams room today. Not sure…

@MattGal
Copy link
Member

MattGal commented Jul 8, 2020

@mmitche that looks like a provider failure. Our runner for Helix jobs should handle timeouts before the jobs timeout, making it unlikely to be a problem on our side. I reran the failed jobs in hope I'm correct.

It's possible this relates to earlier failures @HaoK and @MattGal were discussing in the FR Teams room today. Not sure…

I haven't made the FR issue yet but I'm tracking https://portal.microsofticm.com/imp/v3/incidents/details/195705642/home , timeouts being wrong will be this symptom. (EDIT: THis is not the problem)

@MattGal
Copy link
Member

MattGal commented Jul 8, 2020

@dougbu I looked at the two failing legs from attempt 1:

One is https://github.com/dotnet/core-eng/issues/10224 ... basically when Service Bus rolls out they break us. Ongoing thing for me.

The helix test failures aren't related to anything I'm currently tracking but if you have a reason to be concerned I can look after I get off the current bridge.

@dougbu
Copy link
Contributor

dougbu commented Jul 8, 2020

Ah, this is strange but the Helix failure initially looked like an infrastructure issue. Now I think the .NET SDK has changed its layout or is no longer internally consistent. Any suggestions @ViktorHofer, @dsplaisted or @wli3❔ (I'm including @ViktorHofer because dotnet/runtime#28894 is one of the few places I see exit code -2147450735 mentioned.)

GET https://dot.net/v1/dotnet-install.ps1
Download of 'https://dot.net/v1/dotnet-install.ps1' complete, saved to C:\h\w\A52B08F5\w\C07B0A70\e\dotnet-install.ps1...
Download of dotnet-install.ps1 complete...
Installing SDK...& C:\h\w\A52B08F5\w\C07B0A70\e\dotnet-install.ps1 -Architecture x64 -Version 5.0.100-preview.7.20330.3 -InstallDir C:\h\w\A52B08F5\p\sdk\x64
dotnet-install: .NET Core SDK version 5.0.100-preview.7.20330.3 is already installed.
dotnet-install: Adding to current process PATH: "C:\h\w\A52B08F5\p\sdk\x64\". Note: This change will not be visible if PowerShell was run as a child process.
Installing Runtime...& C:\h\w\A52B08F5\w\C07B0A70\e\dotnet-install.ps1 -Architecture x64 -Runtime dotnet -Version 5.0.0-preview.8.20357.14 -InstallDir C:\h\w\A52B08F5\p\sdk\x64
dotnet-install: Downloading link: https://dotnetcli.azureedge.net/dotnet/Runtime/5.0.0-preview.8.20357.14/dotnet-runtime-5.0.0-preview.8.20357.14-win-x64.zip
dotnet-install: Extracting zip from https://dotnetcli.azureedge.net/dotnet/Runtime/5.0.0-preview.8.20357.14/dotnet-runtime-5.0.0-preview.8.20357.14-win-x64.zip
dotnet-install: Installation finished
InstallDotNet.ps1 complete...
"Restore: dotnet restore RunTests\RunTests.csproj --source https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet5/nuget/v3/index.json --ignore-failed-sources..."
Found .NET SDK, but did not find dotnet.dll at [C:\h\w\A52B08F5\p\sdk\x64\sdk\5.0.100-preview.7.20330.3\dotnet.dll]
"Running tests: dotnet run --project RunTests\RunTests.csproj -- --target IIS.FunctionalTests.dll --sdk 5.0.100-preview.7.20330.3 --runtime 5.0.0-preview.8.20357.14 --queue Windows.10.Amd64.Open --arch x64 --quarantined false --ef 5.0.0-preview.8.20357.7 --aspnetruntime Microsoft.AspNetCore.App.Runtime.win-x64.5.0.0-ci.nupkg --aspnetref Microsoft.AspNetCore.App.Ref.5.0.0-ci.nupkg --helixTimeout 01:00:00..."
Found .NET SDK, but did not find dotnet.dll at [C:\h\w\A52B08F5\p\sdk\x64\sdk\5.0.100-preview.7.20330.3\dotnet.dll]
neq was unexpected at this time.
"Finished running tests: exit_code=-2147450735"

@dougbu
Copy link
Contributor

dougbu commented Jul 8, 2020

And, no, I don't understand why the errors don't happen in our regular Windows build jobs. Must be something specific to what's on Windows.10.Amd64.Open agents.

@MattGal
Copy link
Member

MattGal commented Jul 9, 2020

And, no, I don't understand why the errors don't happen in our regular Windows build jobs. Must be something specific to what's on Windows.10.Amd64.Open agents.

It's a pretty boring Server Windows 10 sku, generally if something's unusual it's going to be related to previous work on the same machine as we simply can't afford a clean machine per work item.

If you see it keep reproing, ping me and i'll investigate. It may be worth making sure nothing modifies the contents of the correlation payload ( C:\h\w\A52B08F5\p ) since that's only unzipped once and then it's on the work items to not modify it in a "bad" way.

@dougbu
Copy link
Contributor

dougbu commented Jul 9, 2020

@MattGal thanks❕

If you see it keep reproing, ping me

Right now runfo search-helix --definition=aspnet --count=50 --value='did not find dotnet.dll' --pr only matched the validation build for this PR. So, a bizarre one-off failure at this point.

/fyi @HaoK and @Pilchie because you're often the first to notice Helix infrastructure problems.


@mmitche the rerun passed so you're all set.

@wli3
Copy link

wli3 commented Jul 9, 2020

@bozturkMSFT for the install script issue

@mmitche mmitche merged commit 8c75a8e into dotnet:master Jul 9, 2020
@bekir-ozturk
Copy link
Contributor

It appears that the folder C:\h\w\A52B08F5\p\sdk\x64\sdk\5.0.100-preview.7.20330.3\ exists, but the file C:\h\w\A52B08F5\p\sdk\x64\sdk\5.0.100-preview.7.20330.3\dotnet.dll doesn't. Since the folder is there, the install script assumes that the sdk is installed and doesn't attempt to reinstall it.

@dougbu Checking if something attempts to delete the folder might help. It may be that one of the files is still in use while deleting.

@dougbu
Copy link
Contributor

dougbu commented Jul 10, 2020

Hmm, I don't know but thought the Helix workspace was cleaned up automatically. @MattGal

If not, I guess we could add an extra cleanup step to our runtests.cmd|sh scripts.

We're also moving to use the built-in SDK installation with #23585. Does that change anything about cleanup between Helix jobs❔

@MattGal
Copy link
Member

MattGal commented Jul 13, 2020

Hmm, I don't know but thought the Helix workspace was cleaned up automatically. @MattGal

There's a couple things to unpack here:

  • All payloads, and (recently added) the redirected Temp folder are cleaned up automatically for Helix work items.
  • From the above path, C:\h\w\A52B08F5, as a folder, will be deleted entirely as we see new work on a given machine. We preserve the most recently seen previous three correlation payload folders on Helix machines to avoid excessive network usage.
  • C:\h\w\A52B08F5\p\ is the "correlation" payload directory. We're ensure that every file you include with your work item as a correlation payload (that is, shared amongst every work item) is unzipped and has the correct size exactly once. After this time, if a work item deletes files inside it, it will stay this way for subsequent work items. It is not advised that correlation payloads change, though in non-docker scenarios we don't block users from doing so.
  • There are established mechanisms I've discussed with @HaoK (and I think he's used) that make it so you don't have to run any install scripts, opting instead for bringing along your SDK as a correlation payload.

All that said, I'll take a look real quick and see if there's a story in the logs. It might still be possible via leaking a dotnet.exe just right that a machine tries to clean this correlation folder, and ends up with a dirty folder.

@MattGal
Copy link
Member

MattGal commented Jul 13, 2020

@dougbu This is actually something I've been dealing with for some time, and have somewhat of a mitigation for it. The underlying problem is some ways away from being fixed. Switching to not running this install script (or installing to %TEMP%) should unblock you and is the preferable way to work around this.

In the end you're hitting the problem described in https://github.com/dotnet/core-eng/issues/10224 . This is caused whenever ServiceBus "upgrades" their service fabric service which for unknown reasons they've been doing a lot more of lately. This causes them to abandon the lock on a given work item (stored currently in memory only) and give the work to a different Helix machine while they're upgrading. As a result, we reboot the machine and try again.

At the same time, machines running your work item run 4 other jobs' work items (note we cache the previous 3) so we know that your correlation payload was attempted to be cleaned up (and in fact I found no evidence that it actually failed to do so) and the work item ran on a machine that had previously run Microsoft.AspNetCore.Diagnostics.HealthChecks.Tests--net5.0 from the same job.

Until durable locking exists, I've merged a change (should roll out Weds) that detects this situation and ... just lets the work item that is running finish. There will be some weird duplications caused by this but which I believe we've already hardened against years ago. Regardless of how long this takes (and I'll be eagerly awaiting this vital feature), stopping running the SDK installer as part of work items and just bringing the SDK as a payload is the way forward.

@dougbu
Copy link
Contributor

dougbu commented Jul 13, 2020

stopping running the SDK installer as part of work items and just bringing the SDK as a payload is the way forward.

@HaoK this isn't the direction you've been moving with #23585 because we still need to install runtimes on the Helix agents. I'm also wondering if we need to include the RunTests bin/ folder in the correlation payload to avoid these problems. Thoughts❔

@HaoK
Copy link
Member

HaoK commented Jul 13, 2020

Right, waiting for the helix changes to flow and we should be able to switch to the built in SDK support. We do still install the runtimes, if the helix SDK wants to add built in ability similar to the SDK support we'd obviously love to switch to that too :). RunTests shouldn't matter as it doesn't do anything to the correlation payloads

@MattGal
Copy link
Member

MattGal commented Jul 13, 2020

note that under normal circumstances, modifying and even leaking file handles to the payload directory is unadvised but won't hurt anything. Wednesday's workaround might make it stop more or less entirely for you, but please do still avoid modifying the correlation payloads in general whenever possible.

@dougbu
Copy link
Contributor

dougbu commented Jul 13, 2020

@HaoK why are we using the Helix SDK to install the .NET SDK if it doesn't get us where we want. Put another way, why not copy the .dotnet/ folder into the correlation payload (assuming that can be done reliably while running dotnet msbuild)❔ That sounds better to me than the half-step in #23585.

@HaoK
Copy link
Member

HaoK commented Jul 14, 2020

Not sure what you mean by not getting what we want, prebuilding the dotnet folder image at build time rather than having the work items install the runtime, sounds like we would end up building a lot of permutations of helix payloads, while we could certainly do that, why is that better than what we do today?

@dougbu
Copy link
Contributor

dougbu commented Jul 14, 2020

@HaoK

prebuilding the dotnet folder image at build time

The build agent already has the SDK and all necessary runtimes for Windows installed -- including the just-built Microsoft.AspNetCore.App.* bits. That should be a superset of what would be needed in the Helix jobs i.e. it just adds native files. The change @MattGal and I suggest is a matter of including the existing $(RepoRoot).dotnet/ folder in the correlation payload.

end up building a lot of permutations of helix payloads

I'm hoping the main downsides are (a) the tested SDK isn't exactly what we would build for the target OS and (b) we upload the SDK and runtimes everywhere instead of installing them only in specific runs e.g. for the templates tests i.e. the correlation payload gets larger.

why is that better than what we do today?

Because, as @MattGal has made clear above, we're changing the correlation payload folders on the Helix agents and that's unsupported. If we do #23585, I'm not seeing a real improvement in this area. Yes, it may improve the SDK installation leftovers but doesn't help w/ the runtimes.

I agree the ideal would be the Helix SDK supporting runtime and SDK installation simultaneously (at different versions). But, without that feature, it's not clear the Helix SDK feature helps much.

@HaoK
Copy link
Member

HaoK commented Jul 14, 2020

The biggest issue is we don't build our helix payloads on different agents, we only build on windows and publish to all queues, and then on once on linux arm for the arm64 queues. I guess if the .dotnet folder is truly cross plat that will work?

@HaoK
Copy link
Member

HaoK commented Jul 14, 2020

If the main concern is just about modifying the correlation payloads, we don't do that on non windows since its not allowed, and we install to a different folder in the workitem root, we intentionally weren't doing that on windows to take advantage of installing it once. We can install the runtime into the workitem root just like the shell scripts if that's the main concern.

@dougbu
Copy link
Contributor

dougbu commented Jul 14, 2020

  1. The .dotnet/ folder isn't quite "truly cross plat" but it should work. This is an expansion of the current reuse of Windows Microsoft.AspNetCore.App.* bits in non-Windows tests.
  2. Yup, that's my main concern. First comply with the Helix requirements, then (maybe) focus on tests being true-to-life.

I'm not sure what you mean in the later sentences because this conversation is about our changes in correlation folders on Windows agents.

@HaoK
Copy link
Member

HaoK commented Jul 14, 2020

I'm saying if we just don't want to do modify the correlation folders, we can just install it somewhere else rather than there, that's what the .sh files do.

@dougbu
Copy link
Contributor

dougbu commented Jul 14, 2020

@HaoK that sounds good to me.

Now that I think about it more, using the Helix SDK for the .NET SDK part of it won't hurt and may improve our $TMP pollution issue. But, … only if the Helix SDK does its installation somewhere other than the correlation folders.

@MattGal
Copy link
Member

MattGal commented Jul 14, 2020

@HaoK that sounds good to me.

Now that I think about it more, using the Helix SDK for the .NET SDK part of it won't hurt and may improve our $TMP pollution issue. But, … only if the Helix SDK does its installation somewhere other than the correlation folders.

At least on Helix machines, TMP pollution should be solved; you now have a TMP directory per job, which gets automatically cleaned up as long as you resolve paths using TMP/TEMP env vars.

@dougbu
Copy link
Contributor

dougbu commented Jul 14, 2020

TMP pollution should be solved

Excellent. Even so, using the Helix SDK if it installs outside the correlation folders at least reduces our infrastructure duplication. Any other upsides I'm forgetting❔

@HaoK
Copy link
Member

HaoK commented Jul 14, 2020

Just to be clear, we don't control where the helix sdk installs the dotnet SDK, I'm just talking about where we install our dotnet runtime

@HaoK
Copy link
Member

HaoK commented Jul 14, 2020

dotnet SDK is a first class feature from the helix sdk, the whole idea is we should try to use it as is... right?

@dougbu
Copy link
Contributor

dougbu commented Jul 14, 2020

If it can be used, great. The lack of support for installing additional runtimes is a pain but not horrendous.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants