Skip to content

[benchmark] Add ReplaceSubrange benchmark #25310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 2, 2019
Merged
1 change: 1 addition & 0 deletions benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ set(SWIFT_BENCH_MODULES
single-source/StringInterpolation
single-source/StringMatch
single-source/StringRemoveDupes
single-source/StringReplaceSubrange
single-source/StringTests
single-source/StringWalk
single-source/Substring
Expand Down
116 changes: 116 additions & 0 deletions benchmark/single-source/StringReplaceSubrange.swift
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
//===--- StringReplaceSubrange.swift -------------------------------------------===//
//
// This source file is part of the Swift.org open source project
//
// Copyright (c) 2014 - 2019 Apple Inc. and the Swift project authors
// Licensed under Apache License v2.0 with Runtime Library Exception
//
// See https://swift.org/LICENSE.txt for license information
// See https://swift.org/CONTRIBUTORS.txt for the list of Swift project authors
//
//===----------------------------------------------------------------------===//

import TestsUtils

let tags: [BenchmarkCategory] = [.validation, .api, .String]

public let StringReplaceSubrange = [
BenchmarkInfo(
name: "Str.replaceSubrange.SmallLiteral.String",
runFunction: { replaceSubrange($0, "coffee", with: "t") },
tags: tags
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind explaining what is the difference between the small literal string vs the large managed string? Is this something related to this small string optimization? If the string fits 15 ASCII characters length, it won't be allocated in the heap memory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. See: _SmallString and _StringGuts for implementation details if you're interested.

Copy link
Contributor Author

@keitaito keitaito Jul 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the links! I will take a look at them πŸ”

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small form accommodates 15 UTF-8 code units in length (not just ASCII)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying the length of the small string, Michael πŸ˜„

),
BenchmarkInfo(
name: "Str.replaceSubrange.LargeManaged.String",
runFunction: { replaceSubrange($0, largeManagedString, with: "t") },
tags: tags,
setUpFunction: setupLargeManagedString
),
BenchmarkInfo(
name: "Str.replaceSubrange.SmallLiteral.Substr",
runFunction: { replaceSubrange($0, "coffee", with: getSubstring("t")) },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to put an optimization barrier (by calling the getSubstring from TestUtils) here.
Shorter way to get a Substring is to get a full subrange like this: "t"[...].

For an implementation symmetry, I'd also extract the "coffee" into smallString constant, so that our benchmark definitions would vary only like this:

  • runFunction: { replaceSubrange($0, largeString, with: "t") }
  • runFunction: { replaceSubrange($0, smallString, with: "t") }
  • runFunction: { replaceSubrange($0, largeString, with: "t"[...]) }
  • runFunction: { replaceSubrange($0, smallString, with: "t"[...]) }

tags: tags
),
BenchmarkInfo(
name: "Str.replaceSubrange.LargeManaged.Substr",
runFunction: { replaceSubrange($0, largeManagedString, with: getSubstring("t")) },
tags: tags,
setUpFunction: setupLargeManagedString
),
BenchmarkInfo(
name: "Str.replaceSubrange.SmallLiteral.ArrChar",
runFunction: { replaceSubrange($0, "coffee", with: getArrayCharacter(Array<Character>(["t"]))) },
tags: tags
),
BenchmarkInfo(
name: "Str.replaceSubrange.LargeManaged.ArrChar",
runFunction: { replaceSubrange($0, largeManagedString, with: getArrayCharacter(Array<Character>(["t"]))) },
tags: tags,
setUpFunction: setupLargeManagedString
),
BenchmarkInfo(
name: "Str.replaceSubrange.SmallLiteral.RepeatedChar",
runFunction: { replaceSubrange($0, "coffee", with: getRepeatedCharacter(repeatedCharacter)) },
tags: tags
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark name "Str.replaceSubrange.SmallLiteral.RepeatedChar" is longer than 40 characters, but I couldn't think a better name fitting 40. Maybe it can be like "Str.replaceSubrange.LargeManagedRepChar", but I was concerned "RepChar" is a little bit hard to understand that it means Repeated<Character>. @palimondo What do you think about this naming? Please let me know if you have any suggestions on it πŸ™‚

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at what @milseman writes in SR-8905:

replaceSubrange<C: Collection>(_:C)

  • Arguments of types String, Substring, Array<Character>, Repeated<Character>, etc

I'd say the naming convention calls for base name of String.replaceSubrange which varies across the argument type (String, Substring, ArrChar, RepChar) for the general case of large strings. Then we'll denote the special optimization for small strings with a simple .Small suffix and we'll get these benchmarks:

  • String.replaceSubrange.String
  • String.replaceSubrange.Substring
  • String.replaceSubrange.ArrChar
  • String.replaceSubrange.RepChar
  • String.replaceSubrange.String.Small
  • String.replaceSubrange.Substring.Small
  • String.replaceSubrange.ArrChar.Small
  • String.replaceSubrange.RepChar.Small

The longest one is String.replaceSubrange.Substring.Small at 39 characters, just under the 40 chars limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an awesome naming idea. I will use them. Thanks for your suggestion!

),
BenchmarkInfo(
name: "Str.replaceSubrange.LargeManaged.RepeatedChar",
runFunction: { replaceSubrange($0, largeManagedString, with: getRepeatedCharacter(repeatedCharacter)) },
tags: tags,
setUpFunction: setupLargeManagedString
),
]

// MARK: - Privates for String

private var largeManagedString: String = {
return getString("coffee\u{301}coffeecoffeecoffeecoffee")
}()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\u{301} was suggested to be added to the test string when I paired with Michael. @milseman would you mind reminding me the reason for this? I don't remember it now πŸ˜…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll guess it was in order for the string to be in particular normalization form. @milseman Do you want to vary the benchmarks also for different normalization forms? SR-8905 doesn't mention that…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if the result of testing with the string "coffeΓ©coffeecoffeecoffeecoffee" would be different from the one with "coffeecoffeecoffeecoffeecoffee" πŸ€” If there is distinct difference, maybe we could add two benchmarks for the one with the acute accent character and the other one without it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grapheme segmentation is relevant to this benchmark, but not normalization (since there's no comparison). The difference is that the precomposed representation is a single scalar per grapheme cluster, while the decomposed (multi-scalar) form is not. The single-scalar one will hit our grapheme breaking fast-paths while the multi-scalar one will call out to ICU. Alternatively, you could use other kinds of multi-scalar graphemes clusters, such as complex emoji. I just mentioned "\u{301}" because you can just stick it after an "e" to get the same effect.


private func setupLargeManagedString() {
_ = largeManagedString
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should largeManagedString be called from setupFunction closure before it is used for benchmarks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that in the replaceSubrange(::with:) you already do the var copy = getString(string), this whole dance here is unnecessary. You should declare this as simple let largeString = "coffee\u{301}coffeecoffeecoffeecoffee" and drop all the setUpFunctions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Thanks for explaining!


// MARK: - Privates for Repeated<Character>

private let repeatedCharacter: Repeated<Character> = {
let character = Character("c")
return repeatElement(character, count: 1)
}()


@inline(never)
private func replaceSubrange(_ N: Int, _ string: String, with replacingString: String) {
var copy = getString(string)
let range = string.startIndex..<string.index(after: string.startIndex)
for _ in 0 ..< 5_000 * N {
copy.replaceSubrange(range, with: replacingString)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the criteria for choosing this multiplying number like 5000? Does this depend on the benchmark time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. We are just trying to size the workload to run in 20–1000 ΞΌs, so that it is in a measurement sweet spot for our system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining!

}

@inline(never)
private func replaceSubrange(_ N: Int, _ string: String, with replacingSubstring: Substring) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand @milseman's intention from SR-8905 correctly, we are designing benchmark for the generic replaceSubstring method that varies across the String, Substring, Array<Character> and Repeat<Character> specializations.

Therefore we should be able to define single shared generic test function and vary the parameter in runFunction closure in BenchmarkInfo declarations.

For an example of such benchmarks, see append variants in DataBenchmarks

@milseman Any thoughts on keeping or dropping the @inline(never) annotation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could refactor this to a generic @inline(__always) implementation function that takes a scale factor (because the String version should be much faster than the Repeat version) and does the loop and replaceSubrange calls. You'd likely want to make a @inline(never) top-level function for each individual benchmark. It doesn't save you a whole lot.

var copy = getString(string)
let range = string.startIndex..<string.index(after: string.startIndex)
for _ in 0 ..< 5_000 * N {
copy.replaceSubrange(range, with: replacingSubstring)
}
}

@inline(never)
private func replaceSubrange(_ N: Int, _ string: String, with replacingArrayCharacter: Array<Character>) {
var copy = getString(string)
let range = string.startIndex..<string.index(after: string.startIndex)
for _ in 0 ..< 5_000 * N {
copy.replaceSubrange(range, with: replacingArrayCharacter)
}
}

@inline(never)
private func replaceSubrange(_ N: Int, _ string: String, with replacingRepeatedCharacter: Repeated<Character>) {
var copy = getString(string)
let range = string.startIndex..<string.index(after: string.startIndex)
for _ in 0 ..< 5_000 * N {
copy.replaceSubrange(range, with: replacingRepeatedCharacter)
}
}
8 changes: 8 additions & 0 deletions benchmark/utils/TestsUtils.swift
Original file line number Diff line number Diff line change
Expand Up @@ -322,3 +322,11 @@ public func getString(_ s: String) -> String { return s }
// The same for Substring.
@inline(never)
public func getSubstring(_ s: Substring) -> Substring { return s }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to define any new optimization barrier functions...

// The same for Array<Character>.
@inline(never)
public func getArrayCharacter(_ a: Array<Character>) -> Array<Character> { return a }

// The same for Repeated<Character>.
@inline(never)
public func getRepeatedCharacter(_ r: Repeated<Character>) -> Repeated<Character> { return r }
2 changes: 2 additions & 0 deletions benchmark/utils/main.swift
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ import StringEnum
import StringInterpolation
import StringMatch
import StringRemoveDupes
import StringReplaceSubrange
import StringTests
import StringWalk
import Substring
Expand Down Expand Up @@ -339,6 +340,7 @@ registerBenchmark(StringInterpolationManySmallSegments)
registerBenchmark(StringMatch)
registerBenchmark(StringNormalization)
registerBenchmark(StringRemoveDupes)
registerBenchmark(StringReplaceSubrange)
registerBenchmark(StringTests)
registerBenchmark(StringWalk)
registerBenchmark(SubstringTest)
Expand Down