-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Use less stack for HttpResponseHeaders.CopyToFast #7724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should change the ordering of headers for HTTP responses. Seems like a fairly decent breaking change, especially in tests that verify response content in customer applications.
@jkotalik header ordering is not meaningful, especially if you avoid re-ordering multiple instances of the same header. It is quite a bit of test churn. |
Keeping the number of jit temps under 512 is advisable in perf-sensitive code. 512 is the maximum number of things the jit will model for liveness (which is a "dense" data flow problem that requires using bitvectors, so lots of jit time & memory). See dotnet/coreclr#13280 for some context. When you go beyond this number, the jit marks 512 temps as "tracked" temps and the remainder as "untracked" temps. It chooses the set of tracked temps greedily, using block-weighed aggregate appearances as a priority function. Untracked GC ref temps prolog zeroing. In a method like this where average path length is fairly small (even though method itself is quite large) the prolog costs can predominate. There are a bunch of potential mitigation strategies in the JIT for this, but nothing that is being actively worked on right now. I wonder though -- if the JIT sees we're in the untracked regime, perhaps it should somewhat prioritize tracking GC refs over non-GC temps... hmm (alternatively: consider an untracked GC ref incurs an implicit extra "prolog" weight from the zeroing that we don't account for -- perhaps we should make that explicit). The goal should really be to minimize the overall post-sort cost, and we don't quite do that right now. Let me play around with this idea for a bit. |
9104076
to
3934001
Compare
@jkotalik changed to maintain ContentLength's header position ; so no test churn |
At least in the before example, better temp sorting by the JIT would not help reduce prolog cost. Despite there being 585 temps, all the ref type temps, and all but one of the byref type temps end up getting tracked. EG the
The bulk of of the prolog zeroing thus comes from GC refs within structs. There are 118 struct temps, all marked as |
A fair boost on the Method | Type | Mean | Op/s |
------- |--------------------------- |---------:|-------------:|
- Output | LiveAspNet | 587.5 ns | 1,702,161.8 |
+ Output | LiveAspNet | 511.5 ns | 1,955,163.5 |
- Output | PlaintextChunked | 139.8 ns | 7,151,853.6 |
+ Output | PlaintextChunked | 86.4 ns | 11,574,316.8 |
- Output | PlaintextChunkedWithCookie | 276.3 ns | 3,618,669.2 |
+ Output | PlaintextChunkedWithCookie | 194.3 ns | 5,148,107.5 |
- Output | PlaintextWithCookie | 289.5 ns | 3,454,397.7 |
+ Output | PlaintextWithCookie | 181.2 ns | 5,520,292.8 |
- Output | TechEmpowerPlaintext | 175.4 ns | 5,700,681.7 |
+ Output | TechEmpowerPlaintext | 101.1 ns | 9,891,448.7 | |
And as for why there are so many unpromoted structs in the before version -- perhaps no great mystery: we run into the too many temps throttling in the inliner, after 328 successful inlines. Similar to the kinds of things I'm tracking in dotnet/coreclr#22240.
Probably worth running TechEmpower scenarios with JIT inline ETL enabled to see if this is a common failure reason, and if so, where we might want to reduce method complexity. Am going to temporarily remove this check and let the inliner run free, just to see what happens: |
Plaintext inlines https://gist.github.com/benaadams/1e08f4fe89b3a07a64a915d388a63336 "FAILED: too many locals" seems limited to this method |
Inliner can handle it, but not the rest of the jit. Way more temps, lots of untracked, etc.
That looks like the only method in the core assembly that hits this limit. Simplifying it is the right thing to do. We may get the jit to the point where it can handle things like the before case better someday, but that day is a ways off. |
Thanks for checking |
Cleaned up the other switch jump tables to use named labels rather than just numbers so its easier to compare the label name to what its outputting |
3ff109b
to
220a502
Compare
public string TestNotBit() => $"(_bits & {1L << Index}L) == 0"; | ||
public string SetBit() => $"_bits |= {1L << Index}L"; | ||
public string ClearBit() => $"_bits &= ~{1L << Index}L"; | ||
public string TestBit() => $"(_bits & {"0x" + (1L << Index).ToString("x")}L) != 0"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Changed to loop; perf is very nice (also has ROS opt etc, so isn't like for like on before loop) Method | Type | Mean | Op/s |
------- |--------------------------- |----------:|-------------:|
Output | LiveAspNet | 462.28 ns | 2,163,172.9 |
Output | PlaintextChunked | 84.38 ns | 11,850,576.2 |
Output | PlaintextChunkedWithCookie | 179.53 ns | 5,570,065.5 |
Output | PlaintextWithCookie | 169.24 ns | 5,908,899.3 |
Output | TechEmpowerPlaintext | 86.15 ns | 11,607,037.7 | |
AspNetCore-ci failed due to weird (Code_check) issues; raised issue for it #7839 |
Any changes in public APIs? Rebase and run |
Don't think so
Will do |
e2ca205
to
80d514f
Compare
src/Servers/Kestrel/Core/ref/Microsoft.AspNetCore.Server.Kestrel.Core.netcoreapp3.0.cs
Outdated
Show resolved
Hide resolved
src/Servers/Kestrel/Core/src/Internal/Http/HttpResponseTrailers.cs
Outdated
Show resolved
Hide resolved
5aace8e
to
87182af
Compare
Thanks! |
The Jit seems unhappy to optimize this method due to the number of locals and tmps this method creates (585 locals/tmps; including all the new
ReadOnlySpan<byte>
s).This change slims the method by 9638 bytes of asm from 13112 bytes to 3474 bytes; and reduces the stack clearing from 416 bytes to 8 bytes,
It does this by using a single output point for the headers (other than the unusual ones Raw and ContentLength) where a single ReadOnlySpan is created. To achieve this it is a little convoluted using goto and switch...
Before
/cc @AndyAyersMS