Skip to content

Benchmarks: Dsl.scala vs Monix vs Cats Effect vs Scalaz Concurrent vs Scala Async vs Scala Continuation

杨博 (Yang Bo) edited this page May 8, 2018 · 26 revisions

We created some benchmarks to evaluate the computational performance of code generated by our compiler plug-in, especially, we are interesting how our !-notation and other direct style DSL affect the performance in an effect system that support both asynchronous and synchronous effects.

In spite of keywords of adapters to monads or other effect systems (see domain.cats and domain.scalaz), the preferred effect system for Dsl.scala is Task, the type alias of vanilla continuation-passing style function, defined as:

type !![Domain, Value] = (Value => Domain) => Domain
type TaskDomain = TailRec[Unit] !! Throwable
type Task[Value] = TaskDomain !! Value

Our benchmarks measured the performance of Task in Dsl.scala, along with other combination of effect system with direct style DSL, listed in the following table:

The combination of effect system and direct style DSL being benchmarked
Effect System direct style DSL
Dsl.scala Task, an alias of vanilla continuation-passing style functions !-notation provided by Dsl.scala
Scala Future Scala Async
Scala Continuation library Scala Continuation compiler plug-in
Monix tasks for comprehension
Cats effects v comprehension
Scalaz Concurrent for comprehension

The performance of recursive tasks in each effect systems

The purpose of the first benchmark is to determine the performance of recursive functions in various effect system, especially when a direct style DSL is used.

The performance baseline

In order to measure the performance impact due to direct style DSLs, we have to measure the performance baseline of different effect systems at first. We created some benchmarks for the most efficient implementation of a sum function in each effect system. These benchmarks perform the following computation:

  • Creating a List[X[Int]] of 1000 tasks, where X is the data type of task in the effect system.
  • Performing recursive right-associated "binds" on each element to add the Int to an accumulator, and finally produce a X[Int] as a task of the sum result.
  • Running the task and blocking awaiting the result.

Note that the tasks in the list is executed in the current thread or in a thread pool. We keep each task returning a simple pure value, because we want to measure the overhead of effect systems, not the task itself.

The “bind” operation means the primitive operation of each effect system. For Monix tasks, Cats effects, Scalaz Concurrent and Scala Continuations library, the “bind” operation is flatMap; for Dsl.scala, the “bind” operation is cpsApply, which may or may not be equivalent to flatMap according to the type of the current domain.

We use the !-notation to perform the cpsApply in Dsl.scala. The !-notation results the exact same Java bytecode to manually passing a callback function to cpsApply:

def loop(tasks: List[Task[Int]], accumulator: Int = 0)(callback: Int => TaskDomain): TaskDomain = {
  tasks match {
    case head :: tail =>
      // Expand to: implicitShift(head).cpsApply(i => loop(tail, i + accumulator)(callback))
      loop(tail, !head + accumulator)(callback)
    case Nil =>
      callback(accumulator)
  }
}

However, direct style DSLs for other effect systems are not used in favor of raw flatMap calls, in case of decay of the performance. The following code block shows the benchmark code for Scala Futures. The code for all the other effect systems are similar to it.

def loop(tasks: List[Future[Int]], accumulator: Int = 0): Future[Int] = {
  tasks match {
    case head :: tail =>
      head.flatMap { i =>
        loop(tail, i + accumulator)
      }
    case Nil =>
      Future.successful(accumulator)
  }
}

The benchmark result is shown below (larger score is better):

executedIn size
RawSum.cats thread-pool 1000 799.072 (\pm) 3.094
RawSum.cats current-thread 1000 26932.907 (\pm) 845.715
RawSum.dsl thread-pool 1000 729.947 (\pm) 4.359
RawSum.dsl current-thread 1000 31161.171 (\pm) 589.935
RawSum.future thread-pool 1000 575.403 (\pm) 3.567
RawSum.future current-thread 1000 876.377 (\pm) 8.525
RawSum.monix thread-pool 1000 743.340 (\pm) 11.314
RawSum.monix current-thread 1000 55421.452 (\pm) 251.530
RawSum.scalaContinuation thread-pool 1000 808.671 (\pm) 3.917
RawSum.scalaContinuation current-thread 1000 17391.684 (\pm) 385.138
RawSum.scalaz thread-pool 1000 722.743 (\pm) 11.234
RawSum.scalaz current-thread 1000 15895.606 (\pm) 235.992

The benchmark result of sum for performance baseline

The Task alias of continuation-passing style function used with Dsl.scala is quite fast. Dsl.scala, Monix and Cats Effects score on top 3 positions for either tasks running in the current thread or in a thread pool.

The performance impact of direct style DSLs

In this section, we will present the performance impact when different syntax notations are introduced. For vanilla CPS functions, we added one more !-notation to avoid manually passing the callback in the previous benchmark. For other effect systems, we refactored the previous sum benchmarks to use Scala Async, Scala Continuation’s @cps annotations, and for comprehension, respectively:

// Left associated sum in Dsl.scala
def loop(tasks: List[Task[Int]]): Task[Int] = _ {
  tasks match {
    case head :: tail =>
      !head + !loop(tail)
    case Nil =>
      0
  }
}

// Right associated sum in Dsl.scala
def loop(tasks: List[Task[Int]], accumulator: Int = 0): Task[Int] = _ {
  tasks match {
    case head :: tail =>
      !loop(tail, !head + accumulator)
    case Nil =>
      accumulator
  }
}

// Left associated sum in Scala Future
def loop(tasks: List[Future[Int]]): Future[Int] = async {
  tasks match {
    case head :: tail =>
      await(head) + await(loop(tail))
    case Nil =>
      0
  }
}

// Right associated sum in Scala Future
def loop(tasks: List[Future[Int]], accumulator: Int = 0): Future[Int] = async {
  tasks match {
    case head :: tail =>
      await(loop(tail, await(head) + accumulator))
    case Nil =>
      accumulator
  }
}

// Left associated sum in Scala Continuation
def loop(tasks: List[() => Int @suspendable]): Int @suspendable = {
  tasks match {
    case head :: tail =>
      head() + loop(tail)
    case Nil =>
      0
  }
}

// Right associated sum in Scala Continuation
def loop(tasks: List[() => Int @suspendable], accumulator: Int = 0): Int @suspendable = {
  tasks match {
    case head :: tail =>
      loop(tail, head() + accumulator)
    case Nil =>
      accumulator
  }
}

// Left associated sum in Scalaz Concurrent
def loop(tasks: List[Task[Int]]): Task[Int] = {
  tasks match {
    case head :: tail =>
      for {
        i <- head
        accumulator <- loop(tail)
      } yield i + accumulator
    case Nil =>
      Task(0)
  }
}

// Right associated sum in Scalaz Concurrent
def loop(tasks: List[Task[Int]], accumulator: Int = 0): Task[Int] = {
  tasks match {
    case head :: tail =>
      for {
        i <- head
        r <- loop(tail, i + accumulator)
      } yield r
    case Nil =>
      Task.now(accumulator)
  }
}

Note that reduced sum can be implemented in either left-associated recursion or right-associated recursion. The above code contains benchmark for both cases. The benchmark result is shown below:

executedIn size
LeftAssociatedSum.cats thread-pool 1000 707.940 (\pm) 10.497
LeftAssociatedSum.cats current-thread 1000 16165.442 (\pm) 298.072
LeftAssociatedSum.dsl thread-pool 1000 729.122 (\pm) 7.492
LeftAssociatedSum.dsl current-thread 1000 19856.493 (\pm) 386.225
LeftAssociatedSum.future thread-pool 1000 339.415 (\pm) 1.486
LeftAssociatedSum.future current-thread 1000 410.785 (\pm) 1.535
LeftAssociatedSum.monix thread-pool 1000 742.836 (\pm) 9.904
LeftAssociatedSum.monix current-thread 1000 19976.847 (\pm) 84.222
LeftAssociatedSum.scalaContinuation thread-pool 1000 657.721 (\pm) 9.453
LeftAssociatedSum.scalaContinuation current-thread 1000 15103.883 (\pm) 255.780
LeftAssociatedSum.scalaz thread-pool 1000 670.725 (\pm) 8.957
LeftAssociatedSum.scalaz current-thread 1000 5113.980 (\pm) 110.272

The benchmark result of left-associated sum in direct style DSLs

executedIn size
RightAssociatedSum.cats thread-pool 1000 708.441 (\pm) 9.201
RightAssociatedSum.cats current-thread 1000 15971.331 (\pm) 315.063
RightAssociatedSum.dsl thread-pool 1000 758.152 (\pm) 4.600
RightAssociatedSum.dsl current-thread 1000 22393.280 (\pm) 677.752
RightAssociatedSum.future thread-pool 1000 338.471 (\pm) 2.188
RightAssociatedSum.future current-thread 1000 405.866 (\pm) 2.843
RightAssociatedSum.monix thread-pool 1000 736.533 (\pm) 10.856
RightAssociatedSum.monix current-thread 1000 21687.351 (\pm) 107.249
RightAssociatedSum.scalaContinuation thread-pool 1000 654.749 (\pm) 7.983
RightAssociatedSum.scalaContinuation current-thread 1000 12080.619 (\pm) 274.878
RightAssociatedSum.scalaz thread-pool 1000 676.180 (\pm) 7.705
RightAssociatedSum.scalaz current-thread 1000 7911.779 (\pm) 79.296

The benchmark result of right-associated sum in direct style DSLs

The result demonstrates that the !-notation provided by Dsl.scala is faster than all other direct style DSLs in the right-associated sum benchmark. The Dsl.scala version sum consumes a constant number of memory during the loop, because we implemented a tail-call detection in our CPS-transform compiler plug-in, and the Dsl interpreter for Task use a trampoline technique . On the other hand, the benchmark result of Monix Tasks, Cats Effects and Scalaz Concurrent posed a significant performance decay, because they costs O(n) memory due to the map call generated by for comprehension, although those effect systems also built in trampolines. In general, the performance of recursive monadic binds in a for comprehension is always underoptimized due to the inefficient map.

The performance of collection manipulation in effect systems

The previous sum benchmarks measured the performance of manually written loops, but usually we may want to use higher-ordered functions to manipulate collections. We want to know how those higher-ordered functions can be expressed in direct style DSLs, and how would the performance be affected by direct style DSLs.

In this section, we will present the benchmark result for computing the Cartesian product of lists.

The performance baseline

As we did in sum benchmarks, we created some benchmarks to maximize the performance for Cartesian product. Our benchmarks create the Cartesian product from traverseM for Scala Future, Cats Effect, Scalaz Concurrent and Monix Tasks. The following code block shows the benchmark code for Scala Future.

import scala.concurrent.Future
import scalaz.std.list._
import scalaz.std.scalaFuture._
import scalaz.syntax.all._

def cellTask(taskX: Future[Int], taskY: Future[Int]): Future[List[Int]] = async {
  List(await(taskX), await(taskY))
}

def listTask(rows: List[Future[Int]], columns: List[Future[Int]]): Future[List[Int]] = {
  rows.traverseM { taskX =>
    columns.traverseM { taskY =>
      cellTask(taskX, taskY)
    }
  }
}

Scala Async or for comprehension is used in element-wise task cellTask, but the collection manipulation listTask is kept as manually written higher order function calls, because neither Scala Async nor for comprehension supports traverseM.

The benchmark for Dsl.scala is entirely written in !-notation:

def cellTask(taskX: Task[Int], taskY: Task[Int]): Task[List[Int]] = _ {
  List(!taskX, !taskY)
}

def listTask(rows: List[Task[Int]], columns: List[Task[Int]]): Task[List[Int]] = {
  cellTask(!Each(rows), !Each(columns))
}

The Each keyword is available here because it is adaptive. Each keyword can be used in not only List[_] domain, but also (_ !! Coll[_]) domain as long as Coll is a Scala collection type that supports CanBuildFrom type class.

We didn’t benchmark Scala Continuation here because all higher ordered functions for List do not work with Scala Continuation.

The benchmark result is shown below:

executedIn size
RawCartesianProduct.cats thread-pool 50 136.415 (\pm) 1.939
RawCartesianProduct.cats current-thread 50 1346.874 (\pm) 7.475
RawCartesianProduct.dsl thread-pool 50 140.098 (\pm) 2.062
RawCartesianProduct.dsl current-thread 50 1580.876 (\pm) 27.513
RawCartesianProduct.future thread-pool 50 100.340 (\pm) 1.894
RawCartesianProduct.future current-thread 50 93.678 (\pm) 1.829
RawCartesianProduct.monix thread-pool 50 142.071 (\pm) 1.299
RawCartesianProduct.monix current-thread 50 1750.869 (\pm) 18.365
RawCartesianProduct.scalaz thread-pool 50 78.588 (\pm) 0.623
RawCartesianProduct.scalaz current-thread 50 357.357 (\pm) 2.102

The benchmark result of Cartesian product for performance baseline

Monix tasks, Cats Effects and vanilla CPS functions created from Dsl.scala are still the top 3 scored effect systems.

The performance of collection manipulation in direct style DSLs

We then refactored the benchmarks to direct style DSLs. The following code blocks are DSLs for Scala Future, written in ListT monad transformer provided by Scalaz. The benchmarks for Monix tasks, Scalaz Concurrent are also rewritten in the similar style.

import _root_.scalaz.syntax.all._
import _root_.scalaz.ListT
import _root_.scalaz.std.scalaFuture._

def listTask(rows: List[Future[Int]], columns: List[Future[Int]]): Future[List[Int]] = {
  for {
    taskX <- ListT(Future.successful(rows))
    taskY <- ListT(Future.successful(columns))
    x <- taskX.liftM[ListT]
    y <- taskY.liftM[ListT]
    r <- ListT(Future.successful(List(x, y)))
  } yield r
}.run

With the help of ListT monad transformer, we are able to merge cellTask and listTask into one function in a direct style for comprehension, avoiding any manual written callback functions.

We also merged cellTask and listTask in the Dsl.scala version of benchmark:

def listTask: Task[List[Int]] = reset {
  List(!(!Each(inputDslTasks)), !(!Each(inputDslTasks)))
}

This time, Cats Effects are not benchmarked due to lack of ListT in Cats. The benchmark result are shown in Table.

executedIn size
CartesianProduct.dsl thread-pool 50 283.450 (\pm) 3.042
CartesianProduct.dsl current-thread 50 1884.514 (\pm) 47.792
CartesianProduct.future thread-pool 50 91.233 (\pm) 1.333
CartesianProduct.future current-thread 50 150.234 (\pm) 20.396
CartesianProduct.monix thread-pool 50 28.597 (\pm) 0.265
CartesianProduct.monix current-thread 50 120.068 (\pm) 17.676
CartesianProduct.scalaz thread-pool 50 31.110 (\pm) 0.662
CartesianProduct.scalaz current-thread 50 87.404 (\pm) 1.734

The benchmark result of Cartesian product in direct style DSLs

Despite the trivial manual lift calls in for comprehension, the monad transformer approach causes terrible computational performance in comparison to manually called traverseM. In contrast, the performance of Dsl.scala even got improved when cellTask is inlined into listTask.

Conclusion

Task in Dsl.scala is a simple vanilla continuation-passing style function, whose performance is comparable to Monix and Cats Effect.

When using a direct style DSL, our !-bang notation is the fastest implementation among for-comprehension, Scala Async, and Scala Continuation. Especially, when performing a complex task to manipulate collections, our !-bang notation can be 12.5 times faster than all other direct style DSL when running in current thread, and more than 3.1 times faster than all other direct style DSL when running in a thread pool.

Clone this wiki locally