Monday, 18 January 2016

Scala vs. Go TCP Benchmark - why Scala scores so much higher (NOT)

UPDATE: A few days after posting my findings in the comments section of the article, the post was deleted. Click here to view view it at webarchive.org.

The claim: Scala is almost 7 times faster in TCP ping-pong

Some colleagues recently brought this benchmark to my attention, which compares Scala and Go performance over TCP. The authors claim that there's a six-fold performance improvement when using Scala to exchange ping-pong messages over TCP! The scenario goes like this:
  • a server process is listening for TCP connections on a port
  • a client process is fired up, which opens 100 TCP connections to the server
  • each of these connections is used so that the two exchange 10,000 pairs of 4-byte messages
  • the client sends the 4 bytes "Ping" and as soon as the server receives them, it replies with the 4 bytes "Pong"
Apparently the authors got an average time per ping-pong roundtrip of about 1.6 micros for Scala and 11 micros for Go. To quote:
To our surprise, Scala did quite a bit better than Go, with an average round trip time of ~1.6 microseconds (0.0016 milliseconds) vs. Go’s ~11 microseconds (0.011 milliseconds).

This claim seemed absurd to me. However, this is an old post, and there were 38 comments below it. I read through them and none of them seemed to explain what's happening. Some mentioned memory use due to the 4-byte array being recreated (with such a small footprint I doubt it's an issue). Some mentioned differing socket options/buffer-sizes and one in particular mentioned Naggle's algorithm, but I'm not convinced either (the benchmark alternates between reading/writing, so Naggle's algorithm will NOT get a chance to delay transmission). Some mentioned buffering in user space (e.g. use buffio package in Go) but again that can't be it (the test alternates between reading and writing just FOUR bytes).

In short, none of this made sense, so I decided to investigate.

Verifying the results

Naturally, the first thing I did was run the benchmark myself in disbelief. The process is like this: you launch the server which just sits there waiting for connections forever. You then launch the client, which creates the 100 connections, does the 10,000 ping-pongs per connection, and finally prints out an estimate how many microseconds each ping-pong took. The way it does this is by measuring the entire duration from start to end, and then dividing by 1 million, which is the total number of ping-pongs (100 times 10,000).

I should note that I ran all this on a late 2013 MacBook Pro, which has 16GB of RAM and a quad-core i7-4850HQ with hyper-threading. The software used was Scala 2.11.7 on Java 1.8.0_51, and Go 1.5.2. I measured 3 times for each of them (Scala/Go) and picked the best of 3 reported times to publish here. The results are:

Scala Go
1.742 9.887

Ouch! Go appears to be slower by a factor of 5.675 (that's 9.887 divide by 1.742)! I say appears because I simply can not believe this. Something has got to be wrong here. The next thing I did, was measure using mixed implementations, which one of the comments noted were unexpected. The "mixed implementation" test is basically to use the Scala client against the Go server and vice-versa.

UNIMPORTANT SIDE NOTE: Before I show the mixed implementation test results for my machine, I should mention that I made a minor change in the Scala code to set the connection backlog size to 200. This was the only way I could get the Go client to reliably connect to the Scala server. For more information see the documentation at which explains that the default is 50 concurrent connection initiations, before rejections start happening. This has no significant effect on the measurements. You can repeat them yourself (with and without this change) and you will see that they yield identical results. So here is the code I used for the Scala server in this second experiment, with the comment BACKLOG CHANGE marking the line I modified:

//SERVER
import java.net._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
 
object main{
 
    def handleClient(s: Socket) : Unit = {
      val in = s.getInputStream
      val out = s.getOutputStream
      while(s.isConnected){
        val buffer = Array[Byte](4)
        in.read(buffer)
        out.write("Pong".getBytes)
      }
    }
 
    def main(args: Array[String]){
      val server = new ServerSocket(1201, 200)  // BACKLOG CHANGE
      while(true){
        val s: Socket = server.accept()
        Future { handleClient(s) }
      }
    }
}

Anyway, let's get back to the mixed implementation results. I ran the servers and clients with crossed implementations and here's what I got:

Go server - Scala client Scala server - Go client
3.546 149.671

According to the results of my second experiment, the Scala client was half as good (3.546/1.742 is about 2) when running against a Go server. So it looks like the Go server is so bad that it will halve the speed with which the client can run. Let's say you believe that. Explain to me then why the Go client scored an astonishingly bad 149.671 when running against the "faster" Scala server? Since the server is so much faster, it should at least preserve the performance of the Go client, or even improve it (if the bottleneck is really the Go server)! This does not make sense.

Proving that something is broken on the Scala side

At this point I was very suspicious of the Scala code. One of the comments mentioned that there is a bug in the Scala code which defines a buffer variable of size 1 instead of 4, so is unable to read the entire "Ping/Pong" message content. This was dismissed on the ground of "we always send an entire Pong anyway" which didn't convince me either.

I decided to modify the source a bit to actually count and report the number of bytes read by the server. See the comments ADDITION below for my modifications. Basically I add a global counter which gets incremented after every read in a thread-safe manner. The number of bytes actually read is returned by the read() method.

//SERVER
import java.net._
import java.util.concurrent.atomic.AtomicInteger            //ADDITION
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
 
object main{
    val totalRead = new AtomicInteger(0)

    def handleClient(s: Socket) : Unit = {
      val in = s.getInputStream
      val out = s.getOutputStream
      while(s.isConnected){
        val buffer = Array[Byte](4)
        val actuallyRead = in.read(buffer)
        if (actuallyRead > 0)                   // ADDITION
          totalRead.addAndGet(actuallyRead)
        out.write("Pong".getBytes)
      }
    }
 
    def main(args: Array[String]){
      val server = new ServerSocket(1201, 200)  // BACKLOG CHANGE
      var client = 0
      while(true){
        val s: Socket = server.accept()
        client += 1
        val clientId = client
        val f = Future { handleClient(s) } 
        f onComplete {                          // ADDITION
          case _ => println("After client %d, total bytes read: %d".format(clientId, totalRead.get))
        }
      }
    }
}

Now, before you run this, consider for a moment what you'd expect. You have 100 connections, with 10,000 pings coming in from each one. Since each ping is 4 bytes long, you'd expect 40,000 bytes to be read from each connection, for a total of 4,000,000 bytes across all clients. Now fire up the Scala server and then the Scala client. Here's what I saw in my output:

...
After client 98, total bytes read: 996120
After client 99, total bytes read: 999428
After client 100, total bytes read: 1000921

So basically the server has only done a quarter of the expected work in terms of receiving data. We have silently lost about three quarters of the incoming bytes! What's the deal here? Does the client never send them? Or does the server never read them? Well, according to Java's documentation for the read() method, an attempt is made to read at least one byte. It blocks until at least one byte is available, and then reads as many bytes as possible, trying to fill the buffer provided as a parameter. So let's run the Go client against the same Scala server and see what happens:

...
After client 99, total bytes read: 3994164
After client 100, total bytes read: 3994164
After client 90, total bytes read: 3994164
After client 98, total bytes read: 3994164
After client 95, total bytes read: 3994164
After client 73, total bytes read: 3994164
After client 85, total bytes read: 3994164

We get almost all our bytes now, but not quite all of them. But evidently even though the Scala server is perfectly capable of receiving almost all of the data (even if it reads it one byte at a time) we can see that in practice there's something in the Scala client that causes it to stop after reading about a quarter of the stream!

Now, we can see from the server loop that as long as the socket is connected, the server will keep reading incoming data. So it just needs the client to stick around long enough with the connection open: the Scala server will eventually read through all the bytes. We can see that the Go client stays connected long enough to exchange all data, so the problem must be that the Scala client does not stick around to do the entire exchange! If that is the case, the number the Scala client reports must be wrong!

Let's just fix things and do a proper comparison

At this point I decided to simply change both the Scala client and the Scala server to use 4-byte buffers. I'm not so much worried about read() returning less than 4 bytes (as some comments suggested) because we are doing loopback communication here with 4-byte payloads: there's no chance these will be split across multiple TCP segments. I expect read() to always return 4 bytes as long as the buffer is at least that big. So here's the Scala code:

//SERVER
import java.net._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
 
object main{
 
    def handleClient(s: Socket) : Unit = {
      val in = s.getInputStream
      val out = s.getOutputStream
      while(s.isConnected){
        val buffer = Array[Byte](0, 0, 0, 0)        // BUG FIX
        in.read(buffer)
        out.write("Pong".getBytes)
      }
    }
 
    def main(args: Array[String]){
      val server = new ServerSocket(1201, 200)      // BACKLOG CHANGE
      while(true){
        val s: Socket = server.accept()
        Future { handleClient(s) }
      }
    }
}

And here is the fixed client:

//CLIENT
import scala.concurrent._
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.language.postfixOps
import java.net._
 
object main{
 
    def ping(timesToPing: Int) : Unit = {
        val socket = new Socket("localhost", 1201)
        val out = socket.getOutputStream
        val in = socket.getInputStream
        for (i <- 0 until timesToPing) {
            out.write("Ping".getBytes)
            val buffer = Array[Byte](0, 0, 0, 0)        // BUG FIX
            in.read(buffer)
        }
        socket.close
    }
 
    def main(args: Array[String]){
        var totalPings = 1000000
        var concurrentConnections = 100
        var pingsPerConnection : Int = totalPings/concurrentConnections
        var actualTotalPings : Int = pingsPerConnection*concurrentConnections
 
        val t0 = (System.currentTimeMillis()).toDouble
        var futures = (0 until concurrentConnections).map{_ => 
            Future(ping(pingsPerConnection))
        }
 
        Await.result(Future.sequence(futures), 1 minutes)
        val t1 = (System.currentTimeMillis()).toDouble
        println(1000*(t1-t0)/actualTotalPings)
    }
}

I now re-ran the benchmark tests and here are the results:

Scala Go Go server - Scala client Scala server - Go client
9.622 9.887 11.621 11.252

Suddenly the difference is negligible, less than 1 microsecond! We just need to make sure we're comparing the same thing! If you want, you can run the version of the Scala server that logs the number of bytes exchanged. You will then see that it is exactly 4,000,000 and that happens for both the Go and the Scala clients.

Explaining the impact of the 1-byte buffer

But why does this buffer size makes such a huge difference? The reason is that when the Scala server receives the first ping, it can immediately loop 4 times and send 4 pongs to the client, without having to wait for another ping from the Scala client. Similarly, when the Scala client receives the first pong it can immediately loop 4 times and send 4 more pings, without having to wait for the server to send another pong from the Scala server. Note that this happens for each message: whenever the client sends a ping, the server replies with 4 pongs. Similarly, whenever the client receives one of those pongs, it sends out 4 new pings instead of one. This means that the client and the server can be running at the same time rather than waiting for each other, due to the ever-increasing size of the burst. On the other hand, the Go versions empty the incoming buffer on both sides after each read. This means that when they loop to do the next read, they must wait for their counterparty to receive and send the next message. In effect, in the Go versions, there is never both a client and a server ready to run at the same time on different CPU cores! One always ends up in WAITING state so that the other can run and provide the data it is waiting for. They are running in lock-step, alternating from one to the other!

Another interesting dynamic here is the kernel-space buffering present in TCP. The operating system has a kernel-space buffer for receiving TCP data, which in the Scala case is combined with user-space buffering that happens in the JVM. The reason for the user-space buffering is due to the fact that Java's java.net.Socket.getInputStream() returns a buffered implementation. The reason this is important, is because system calls are expensive. System calls are the way the operating system kernel communicates and exchanges data with user processes. When a process has to fetch incoming data for the TCP socket it executes a system call. On a x86 architecture, this means that a software interrupt must happen, causing a jump to the interrupt hander the kernel has installed, with the CPU switching from protected to real mode, which then saves process state and starts copying memory between kernel and user space and so on... The JVM wants to avoid doing this as much as possible so it will create a large buffer in user-space to exploit each system call to the maximum. Calls to read() from the InputStream use the user-space buffer until it is depleted, at which point another expensive system call is made. The Go version does not have user-space buffering, but as you will soon see (read on) it would not have made a difference anyway.

So back to the system calls and the buffering in Scala: when the Scala bug is present, after the first ping is sent by its client, the server makes a system call and receives 4 bytes in user space. It them immediately loops 4 times and sends 4 pongs worth 16 bytes to the client in a burst (due to the bug which it reads one byte at a time from the ping which is already in the user-space buffer). Now, the client is waiting for a pong and when its read() executes it gets 4 of them! These 16 bytes are moved to user space by the InputStream in a SINGLE system call, allowing the client to then loop SIXTEEN times without waiting or doing a system call and send 16 new 4-byte pings worth 256 bytes! This effect is amplified by a factor of 4 with every iteration of the Scala code. So the first system call of the server gets 4 bytes, the next one 64, the next one 1024, and so on. Similarly, the client side gets 4 bytes, then 16, then 256, then 4096, and so on! Scala makes far fewer expensive system calls, as more and more data is moved across kernel-space and user-space with each one of them. Meanwhile, the Go code sends and receives all 4 bytes each time, draining the buffers altogether, and doing exactly 10,000 expensive system calls on each side, as intended. In fact, after each iteration, both server and client WAIT until the data is copied by the operating system across the TCP connection.

I should note however, that even if buffered I/O was used with GO it would not make a difference: each party waits for the entire ping (server) or pong (client) message to arrive, before sending its reply. So even if there was user-space buffering present in Go, it would never get a chance to carry more than 4 bytes over from kernel space to user space, because there would never be more that 4 bytes available! In the Go version (unlike the buggy Scala version), always has exactly 4 bytes in transit in either direction! The fact that user space buffering isn't important becomes evident when we fix the Scala code: there's no significant performance advantage. The user-space buffering is only important if there's more data travelling on the TCP channel than what the client side asks to read (which was the case with single-byte buffers in Scala).

Executive Summary

The reason for the perceived faster Scala implementation was the single-byte buffer. This had the following effects:

  1. The Scala client sent bursts of increasing size (1, 16, 256 and so on) of pings to the server, which replied in a similar fashion (4, 64, 1024 and so on) with pong bursts. This allowed the two to run concurrently, instead of waiting for each other. In contrast, the Go versions were running in alternating fashion, with each client and server go-routine switching between READY and WAITING state.
  2. Not only was the Scala client never switching to WAITING state during its 10,000 iterations, it also stopped immediately afterward reading just 10,000 bytes from the socket instead of 40,000 (because each iteration reads just one byte). At that point it would STOP and repor the time it took to read 10,000 bytes from each connection.
  3. During that time the user-space buffering of the Scala client was combined with the increased size of each burst, causing Scala to make far fewer system calls for copying data between user space and kernel space buffers. The Go version would make exactly 10,000 system calls for reading on each side, whereas the Scala version probably made less than 10!

As soon as the Scala version is fixed, it immediately drops to performance comparable to Go. The difference of less than 1 microsecond on my machine is negligible.

Concluding remarks

This "benchmark" is really about stress-testing the underlying operating system. It says nothing about Go or Scala. You could write the same code in any half-way decent language (Java, C++, Python, Ruby, etc) and get almost identical results (that is, as long as the two implementations are really doing the same thing).

The language is not really important in this case. It's the underlying OS that is doing most of the work here. Our processes spend 99% of their time either waiting for the I/O to complete, or in kernel-space, executing Operating System code for transferring data. That code is NOT the code that we wrote, but rather the code of our OS which was most likely written in C or C++, and is common no matter what language you used to write the benchmark.

The only thing this can be used for is to take an implementation of the test (the proper bug-free Scala one, or the Go one) and run it on the same hardware but with different underlying OS (Linux, FreeBSD, OS X, Windows...) to see which TCP/IP stack works faster for "loopback" (on same host) communication.

Or perhaps you could use it to fine-tune settings for the OS to improve the performance. In this case it would simply be a tool to review what impact changes of various networking-stack-related settings of the OS would have on I/O workloads such as this one (small alternating messages).

No comments:

Post a Comment