Benchmarks: Dsl.scala vs Monix vs Cats Effect vs Scalaz Concurrent vs Scala Async vs Scala Continuation

We created some benchmarks to evaluate the computational performance of code generated by our compiler plug-in, especially, we are interesting how our !-notation and other direct style DSL affect the performance in an effect system that support both asynchronous and synchronous effects.

In spite of keywords of adapters to monads or other effect systems (see domain.cats and domain.scalaz), the preferred effect system for Dsl.scala is Task, the type alias of vanilla continuation-passing style function, defined as:

type !![Domain, Value] = (Value => Domain) => Domain
type TaskDomain = TailRec[Unit] !! Throwable
type Task[Value] = TaskDomain !! Value

Our benchmarks measured the performance of Task in Dsl.scala, along with other combination of effect system with direct style DSL, listed in the following table:

The combination of effect system and direct style DSL being benchmarked
Effect System	direct style DSL
Dsl.scala Task, an alias of vanilla continuation-passing style functions	!-notation provided by Dsl.scala
Scala Future	Scala Async
Scala Continuation library	Scala Continuation compiler plug-in
Monix tasks	`for` comprehension
Cats effects	v comprehension
Scalaz Concurrent	`for` comprehension

The performance of recursive tasks in each effect systems

The purpose of the first benchmark is to determine the performance of recursive functions in various effect system, especially when a direct style DSL is used.

The performance baseline

In order to measure the performance impact due to direct style DSLs, we have to measure the performance baseline of different effect systems at first. We created some benchmarks for the most efficient implementation of a sum function in each effect system. These benchmarks perform the following computation:

Creating a List[X[Int]] of 1000 tasks, where X is the data type of task in the effect system.
Performing recursive right-associated "binds" on each element to add the Int to an accumulator, and finally produce a X[Int] as a task of the sum result.
Running the task and blocking awaiting the result.

Note that the tasks in the list is executed in the current thread or in a thread pool. We keep each task returning a simple pure value, because we want to measure the overhead of effect systems, not the task itself.

The “bind” operation means the primitive operation of each effect system. For Monix tasks, Cats effects, Scalaz Concurrent and Scala Continuations library, the “bind” operation is flatMap; for Dsl.scala, the “bind” operation is cpsApply, which may or may not be equivalent to flatMap according to the type of the current domain.

We use the !-notation to perform the cpsApply in Dsl.scala. The !-notation results the exact same Java bytecode to manually passing a callback function to cpsApply:

def loop(tasks: List[Task[Int]], accumulator: Int = 0)(callback: Int => TaskDomain): TaskDomain = {
  tasks match {
    case head :: tail =>
      // Expand to: implicitShift(head).cpsApply(i => loop(tail, i + accumulator)(callback))
      loop(tail, !head + accumulator)(callback)
    case Nil =>
      callback(accumulator)
  }
}

However, direct style DSLs for other effect systems are not used in favor of raw flatMap calls, in case of decay of the performance. The following code block shows the benchmark code for Scala Futures. The code for all the other effect systems are similar to it.

def loop(tasks: List[Future[Int]], accumulator: Int = 0): Future[Int] = {
  tasks match {
    case head :: tail =>
      head.flatMap { i =>
        loop(tail, i + accumulator)
      }
    case Nil =>
      Future.successful(accumulator)
  }
}

The benchmark result is shown below (larger score is better):

	`executedIn`	`size`
`RawSum.cats`	`thread-pool`	`1000`	`799.072`	(\pm) `3.094`
`RawSum.cats`	`current-thread`	`1000`	`26932.907`	(\pm) `845.715`
`RawSum.dsl`	`thread-pool`	`1000`	`729.947`	(\pm) `4.359`
`RawSum.dsl`	`current-thread`	`1000`	`31161.171`	(\pm) `589.935`
`RawSum.future`	`thread-pool`	`1000`	`575.403`	(\pm) `3.567`
`RawSum.future`	`current-thread`	`1000`	`876.377`	(\pm) `8.525`
`RawSum.monix`	`thread-pool`	`1000`	`743.340`	(\pm) `11.314`
`RawSum.monix`	`current-thread`	`1000`	`55421.452`	(\pm) `251.530`
`RawSum.scalaContinuation`	`thread-pool`	`1000`	`808.671`	(\pm) `3.917`
`RawSum.scalaContinuation`	`current-thread`	`1000`	`17391.684`	(\pm) `385.138`
`RawSum.scalaz`	`thread-pool`	`1000`	`722.743`	(\pm) `11.234`
`RawSum.scalaz`	`current-thread`	`1000`	`15895.606`	(\pm) `235.992`

The benchmark result of sum for performance baseline

The Task alias of continuation-passing style function used with Dsl.scala is quite fast. Dsl.scala, Monix and Cats Effects score on top 3 positions for either tasks running in the current thread or in a thread pool.

The performance impact of direct style DSLs

In this section, we will present the performance impact when different syntax notations are introduced. For vanilla CPS functions, we added one more !-notation to avoid manually passing the callback in the previous benchmark. For other effect systems, we refactored the previous sum benchmarks to use Scala Async, Scala Continuation’s @cps annotations, and for comprehension, respectively:

// Left associated sum in Dsl.scala
def loop(tasks: List[Task[Int]]): Task[Int] = _ {
  tasks match {
    case head :: tail =>
      !head + !loop(tail)
    case Nil =>
      0
  }
}

// Right associated sum in Dsl.scala
def loop(tasks: List[Task[Int]], accumulator: Int = 0): Task[Int] = _ {
  tasks match {
    case head :: tail =>
      !loop(tail, !head + accumulator)
    case Nil =>
      accumulator
  }
}

// Left associated sum in Scala Future
def loop(tasks: List[Future[Int]]): Future[Int] = async {
  tasks match {
    case head :: tail =>
      await(head) + await(loop(tail))
    case Nil =>
      0
  }
}

// Right associated sum in Scala Future
def loop(tasks: List[Future[Int]], accumulator: Int = 0): Future[Int] = async {
  tasks match {
    case head :: tail =>
      await(loop(tail, await(head) + accumulator))
    case Nil =>
      accumulator
  }
}

// Left associated sum in Scala Continuation
def loop(tasks: List[() => Int @suspendable]): Int @suspendable = {
  tasks match {
    case head :: tail =>
      head() + loop(tail)
    case Nil =>
      0
  }
}

// Right associated sum in Scala Continuation
def loop(tasks: List[() => Int @suspendable], accumulator: Int = 0): Int @suspendable = {
  tasks match {
    case head :: tail =>
      loop(tail, head() + accumulator)
    case Nil =>
      accumulator
  }
}

// Left associated sum in Scalaz Concurrent
def loop(tasks: List[Task[Int]]): Task[Int] = {
  tasks match {
    case head :: tail =>
      for {
        i <- head
        accumulator <- loop(tail)
      } yield i + accumulator
    case Nil =>
      Task(0)
  }
}

// Right associated sum in Scalaz Concurrent
def loop(tasks: List[Task[Int]], accumulator: Int = 0): Task[Int] = {
  tasks match {
    case head :: tail =>
      for {
        i <- head
        r <- loop(tail, i + accumulator)
      } yield r
    case Nil =>
      Task.now(accumulator)
  }
}

Note that reduced sum can be implemented in either left-associated recursion or right-associated recursion. The above code contains benchmark for both cases. The benchmark result is shown below:

	`executedIn`	`size`
`LeftAssociatedSum.cats`	`thread-pool`	`1000`	`707.940`	(\pm) `10.497`
`LeftAssociatedSum.cats`	`current-thread`	`1000`	`16165.442`	(\pm) `298.072`
`LeftAssociatedSum.dsl`	`thread-pool`	`1000`	`729.122`	(\pm) `7.492`
`LeftAssociatedSum.dsl`	`current-thread`	`1000`	`19856.493`	(\pm) `386.225`
`LeftAssociatedSum.future`	`thread-pool`	`1000`	`339.415`	(\pm) `1.486`
`LeftAssociatedSum.future`	`current-thread`	`1000`	`410.785`	(\pm) `1.535`
`LeftAssociatedSum.monix`	`thread-pool`	`1000`	`742.836`	(\pm) `9.904`
`LeftAssociatedSum.monix`	`current-thread`	`1000`	`19976.847`	(\pm) `84.222`
`LeftAssociatedSum.scalaContinuation`	`thread-pool`	`1000`	`657.721`	(\pm) `9.453`
`LeftAssociatedSum.scalaContinuation`	`current-thread`	`1000`	`15103.883`	(\pm) `255.780`
`LeftAssociatedSum.scalaz`	`thread-pool`	`1000`	`670.725`	(\pm) `8.957`
`LeftAssociatedSum.scalaz`	`current-thread`	`1000`	`5113.980`	(\pm) `110.272`

The benchmark result of left-associated sum in direct style DSLs

	`executedIn`	`size`
`RightAssociatedSum.cats`	`thread-pool`	`1000`	`708.441`	(\pm) `9.201`
`RightAssociatedSum.cats`	`current-thread`	`1000`	`15971.331`	(\pm) `315.063`
`RightAssociatedSum.dsl`	`thread-pool`	`1000`	`758.152`	(\pm) `4.600`
`RightAssociatedSum.dsl`	`current-thread`	`1000`	`22393.280`	(\pm) `677.752`
`RightAssociatedSum.future`	`thread-pool`	`1000`	`338.471`	(\pm) `2.188`
`RightAssociatedSum.future`	`current-thread`	`1000`	`405.866`	(\pm) `2.843`
`RightAssociatedSum.monix`	`thread-pool`	`1000`	`736.533`	(\pm) `10.856`
`RightAssociatedSum.monix`	`current-thread`	`1000`	`21687.351`	(\pm) `107.249`
`RightAssociatedSum.scalaContinuation`	`thread-pool`	`1000`	`654.749`	(\pm) `7.983`
`RightAssociatedSum.scalaContinuation`	`current-thread`	`1000`	`12080.619`	(\pm) `274.878`
`RightAssociatedSum.scalaz`	`thread-pool`	`1000`	`676.180`	(\pm) `7.705`
`RightAssociatedSum.scalaz`	`current-thread`	`1000`	`7911.779`	(\pm) `79.296`

The benchmark result of right-associated sum in direct style DSLs

The result demonstrates that the !-notation provided by Dsl.scala is faster than all other direct style DSLs in the right-associated sum benchmark. The Dsl.scala version sum consumes a constant number of memory during the loop, because we implemented a tail-call detection in our CPS-transform compiler plug-in, and the Dsl interpreter for Task use a trampoline technique . On the other hand, the benchmark result of Monix Tasks, Cats Effects and Scalaz Concurrent posed a significant performance decay, because they costs O(n) memory due to the map call generated by for comprehension, although those effect systems also built in trampolines. In general, the performance of recursive monadic binds in a for comprehension is always underoptimized due to the inefficient map.

The performance of collection manipulation in effect systems

The previous sum benchmarks measured the performance of manually written loops, but usually we may want to use higher-ordered functions to manipulate collections. We want to know how those higher-ordered functions can be expressed in direct style DSLs, and how would the performance be affected by direct style DSLs.

In this section, we will present the benchmark result for computing the Cartesian product of lists.

The performance baseline

As we did in sum benchmarks, we created some benchmarks to maximize the performance for Cartesian product. Our benchmarks create the Cartesian product from traverseM for Scala Future, Cats Effect, Scalaz Concurrent and Monix Tasks. The following code block shows the benchmark code for Scala Future.

import scala.concurrent.Future
import scalaz.std.list._
import scalaz.std.scalaFuture._
import scalaz.syntax.all._

def cellTask(taskX: Future[Int], taskY: Future[Int]): Future[List[Int]] = async {
  List(await(taskX), await(taskY))
}

def listTask(rows: List[Future[Int]], columns: List[Future[Int]]): Future[List[Int]] = {
  rows.traverseM { taskX =>
    columns.traverseM { taskY =>
      cellTask(taskX, taskY)
    }
  }
}

Scala Async or for comprehension is used in element-wise task cellTask, but the collection manipulation listTask is kept as manually written higher order function calls, because neither Scala Async nor for comprehension supports traverseM.

The benchmark for Dsl.scala is entirely written in !-notation:

def cellTask(taskX: Task[Int], taskY: Task[Int]): Task[List[Int]] = _ {
  List(!taskX, !taskY)
}

def listTask(rows: List[Task[Int]], columns: List[Task[Int]]): Task[List[Int]] = {
  cellTask(!Each(rows), !Each(columns))
}

The Each keyword is available here because it is adaptive. Each keyword can be used in not only List[_] domain, but also (_ !! Coll[_]) domain as long as Coll is a Scala collection type that supports CanBuildFrom type class.

We didn’t benchmark Scala Continuation here because all higher ordered functions for List do not work with Scala Continuation.

The benchmark result is shown below:

	`executedIn`	`size`
`RawCartesianProduct.cats`	`thread-pool`	`50`	`136.415`	(\pm) `1.939`
`RawCartesianProduct.cats`	`current-thread`	`50`	`1346.874`	(\pm) `7.475`
`RawCartesianProduct.dsl`	`thread-pool`	`50`	`140.098`	(\pm) `2.062`
`RawCartesianProduct.dsl`	`current-thread`	`50`	`1580.876`	(\pm) `27.513`
`RawCartesianProduct.future`	`thread-pool`	`50`	`100.340`	(\pm) `1.894`
`RawCartesianProduct.future`	`current-thread`	`50`	`93.678`	(\pm) `1.829`
`RawCartesianProduct.monix`	`thread-pool`	`50`	`142.071`	(\pm) `1.299`
`RawCartesianProduct.monix`	`current-thread`	`50`	`1750.869`	(\pm) `18.365`
`RawCartesianProduct.scalaz`	`thread-pool`	`50`	`78.588`	(\pm) `0.623`
`RawCartesianProduct.scalaz`	`current-thread`	`50`	`357.357`	(\pm) `2.102`

The benchmark result of Cartesian product for performance baseline

Monix tasks, Cats Effects and vanilla CPS functions created from Dsl.scala are still the top 3 scored effect systems.

The performance of collection manipulation in direct style DSLs

We then refactored the benchmarks to direct style DSLs. The following code blocks are DSLs for Scala Future, written in ListT monad transformer provided by Scalaz. The benchmarks for Monix tasks, Scalaz Concurrent are also rewritten in the similar style.

import _root_.scalaz.syntax.all._
import _root_.scalaz.ListT
import _root_.scalaz.std.scalaFuture._

def listTask(rows: List[Future[Int]], columns: List[Future[Int]]): Future[List[Int]] = {
  for {
    taskX <- ListT(Future.successful(rows))
    taskY <- ListT(Future.successful(columns))
    x <- taskX.liftM[ListT]
    y <- taskY.liftM[ListT]
    r <- ListT(Future.successful(List(x, y)))
  } yield r
}.run

With the help of ListT monad transformer, we are able to merge cellTask and listTask into one function in a direct style for comprehension, avoiding any manual written callback functions.

We also merged cellTask and listTask in the Dsl.scala version of benchmark:

def listTask: Task[List[Int]] = reset {
  List(!(!Each(inputDslTasks)), !(!Each(inputDslTasks)))
}

This time, Cats Effects are not benchmarked due to lack of ListT in Cats. The benchmark result are shown in Table.

	`executedIn`	`size`
`CartesianProduct.dsl`	`thread-pool`	`50`	`283.450`	(\pm) `3.042`
`CartesianProduct.dsl`	`current-thread`	`50`	`1884.514`	(\pm) `47.792`
`CartesianProduct.future`	`thread-pool`	`50`	`91.233`	(\pm) `1.333`
`CartesianProduct.future`	`current-thread`	`50`	`150.234`	(\pm) `20.396`
`CartesianProduct.monix`	`thread-pool`	`50`	`28.597`	(\pm) `0.265`
`CartesianProduct.monix`	`current-thread`	`50`	`120.068`	(\pm) `17.676`
`CartesianProduct.scalaz`	`thread-pool`	`50`	`31.110`	(\pm) `0.662`
`CartesianProduct.scalaz`	`current-thread`	`50`	`87.404`	(\pm) `1.734`

The benchmark result of Cartesian product in direct style DSLs

Despite the trivial manual lift calls in for comprehension, the monad transformer approach causes terrible computational performance in comparison to manually called traverseM. In contrast, the performance of Dsl.scala even got improved when cellTask is inlined into listTask.

Conclusion

Task in Dsl.scala is a simple vanilla continuation-passing style function, whose performance is comparable to Monix and Cats Effect.

When using a direct style DSL, our !-bang notation is the fastest implementation among for-comprehension, Scala Async, and Scala Continuation. Especially, when performing a complex task to manipulate collections, our !-bang notation can be 12.5 times faster than all other direct style DSL when running in current thread, and more than 3.1 times faster than all other direct style DSL when running in a thread pool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarks: Dsl.scala vs Monix vs Cats Effect vs Scalaz Concurrent vs Scala Async vs Scala Continuation

The performance of recursive tasks in each effect systems

The performance baseline

The performance impact of direct style DSLs

The performance of collection manipulation in effect systems

The performance baseline

The performance of collection manipulation in direct style DSLs

Conclusion

Uh oh!

Uh oh!

Clone this wiki locally