Saturday, April 18, 2009

Object#extend leaks memory on Ruby 1.9.1

The Garbage Collector is really strange business in the Ruby land. For me, it is the major performance drain currently. If you are not aware of the limitations here's a list:

  1. The GC is mark and sweep, it needs to scan the whole heap for each run. It is directly affected by heap size O(n).
  2. The GC cannot be interrupted and hence all threads must wait for it to finish (shameful pause on big heaps).
  3. The GC marks objects in the objects themselves destroying any value of copy on write.
  4. The GC does not (edit: usually) give memory back to the system. What goes in does not (edit: usually) go out.
  5. It is a bit on the conservative side. Meaning garbage can stay because it is not sure that it is so.


Needless to say, some of these are being addressed, specially 3 and 5. But the patches are not yet accepted in the current Ruby release. I believe though that they will find their way to 1.8.x which is being maintained by Engine Yard. The EY guys are really working hard to solve the issues of Ruby as a server platform which is the most popular use for it today thanks to Rails.


Alas, my issue today involves Ruby 1.9.1 (it does not affect 1.8.x). See I have built this toy server to experiment with multi process applications and some Unix IPC facilities. I did make the design a bit modular to make it easier to test and debug different aspects of the stack. So I have these tcp, http handler modules that extend the connection object (a socket) whenever a connection is accepted. Here's a sample:


conn = server_socket.accept
conn.extend HttpHandler
..
..


This worked really great and I was even able to chain handlers to get more stack functionality (a handler will simply include those that it requires). This worked great, until I looked at memory usage.

I discovered that after showering the server with requests it will start to grow in size. This is acceptable as it is making way for new objects. But given the way the GC works it should have allocated enough heap locations after a few of those ab runs. On the contraty, even when I am hitting the same file with ab the server keeps growing. After 10 or more ab runes (each doing 10000 requests) it is still consuming more memory. So I suspected there is a leak some where. I tested a hello world and found that the increase was very consistent. Every 10K requests the process gains 0.1 to 0.2 MB. (10 to 20 Bytes per request). So I started removing components one after another till I was left with a bare server that only requires socket and reactor.

 

When I tested that server the process started to gain memory then after like 3 or 4 ab runs it stabilized. It would no longer increase its allocated memory no matter how many times I run ab on it. So the next logical move was to re-insert the first level of the stack (the tcp handler module). Once I did that the issue started appearing again. So the next test was to disable the use of the tcp handler but still decorate my connections with it. The issue still appeared. Since the module is not overriding Module.extended to do any work upon it extending an object it became clear that it is the guilty party.

Instead of Object#extend I tried reopening the BasicSocket class and including the required module there. After doing that memory usage pattern resembled the bare bones server. It would increase for a few runs and then remain flat as long as you are hitting the same request.

To isolate the problem further I created this script:

# This code is Ruby 1.9.x and above only

@extend = ARGV[0]

module BetterHash
  def blabla
  end
end

unless @extend
  class Hash
  include BetterHash
  end
end

t = Time.now
1_000_000.times do
  s = {}
  s.extend BetterHash if @extend 
end
after = Time.now - t
puts "done with #{GC.count} gc runs after #{after} seconds"
sleep # so that it doesn't exit before we check the memory

using extend:
351 GC runs, 9.108 seconds, 18.7 MB

using include:
117 GC runs, 0.198 seconds, 2.8 MB

Besides being much faster, the resulting process was much smaller. Around 16MB smaller. I am suspecting that the leak is around 16 bytes or a little less per extend invokation. This means that a server that uses a single extend per request will increase around 160KB in size after every 10K requests. Not that huge but it will pile up fast if left for a while and the server is under heavy load. 



A quick grep in Rails sources showed that this pattern is being used heavily throughout the code. But it is used to extend base classes rather than objects. Hence it will not be invoked on every request and the effect will be mostly limited to the initial start size (a few bytes actually). You should avoid using it dynamically at request serving time though, till it gets fixed.

 

Monday, April 13, 2009

A fast, simple, pure Ruby reactor library

Please welcome Reactor, a reactor library with the very original name of "Reactor".

What is a reactor any way?

A reactor library is one that provides an asynchronus event handling mechanism. Ruby already has a couple of those. The most prominent are EventMachine and Rev.

Many high performing Ruby applications like Thin and Evented Mongrel are utilizing EventMachine for event handling. Both Rev and EventMachine build atop native reactor implementations written in C or C++. While this ensures high performance it makes some integration aspects with Ruby a bit quirky. Sometimes
even at a noticable performance cost.

This is why I thought of building Reactor. A much simpler reactor library in pure Ruby that attempts to use as much of the Ruby built in classes and standard libraries as possible. It only provides a minimal API that does not attempt to be so smart. It differs from EventMachine and Rev in the following aspects.


  1. Pure Ruby, no C or C++ code involved
  2. Very small (~100 lines of code)
  3. Uses the vanilla Ruby socket and server implementations
  4. Decent (high) performance on Ruby 1.9.1
  5. Ruby threading friendly (naturally)
  6. You can have multiple reactors running (like Rev and unlike EventMachine)
Usage is simple, here's a simple Echo server that uses Reactor
require 'reactor'
require 'socket'
reactor = Reactor::Base.new
server = TCPServer.new("0.0.0.0",8080)
reactor.attach(:read, server) do |server|
conn = server.accept
conn.write(conn.gets)
conn.close
end
reactor.run # blocking call, will run for ever

The server is a normal Ruby TCPServer. It attaches itself to the reactor and asks to be notified if there is data to be read on the wire. A block is provided that will handle those notifications. Alternatively, the server can implement a notify_readable method that will be fired instead.

Any IO object can be attached to the reactor but it doesn't make much sense to attach actual files since they will block upon reading or writing anyway. Sockets and pipes will work in a non-blocking manner though.

Reactor is using Ruby's IO.select behind the scenes. This limits its ability to scale in comparison to something like EventMachine or Rev which are able to utilize Epoll and Kqueue which scale much better. This is not a major concern though. Most servers listen to a few fds most of the time, which is a bit faster when using select. Besides one can hope that Ruby will be able to use Epoll and Kqueue some day which will translate to direct benefit to Reactor.

Sunday, April 05, 2009

Ruby Strikes Back

If you are not following Mauricio Fernandez's blog then please do yourself a favor and subscribe to it. Mauricio's writings are very interesting and informative. In one of his posts Mauricio gives a record of re-implementing his blog in OCaml using the OCsigen (webserver + framework) library. Mauricio did some benchmarking for the OCsigen environment against Rails and even a C fastcgi implementation. Naturally one would expect that OCaml will be orders of magnitude faster than Ruby. But the benchmark showed really abysmal performance for Rail vs. OCsigen. We are talking 260 request per second vs. 4500 requests per second for a single process test! That's north of 20X difference! I decided that Ruby can do better.

Looking at what OCsigen offers revealed that Rails is an overkill in comparison. I thought that for Ruby a nice alternative can be the mystery webserver + framework called unicycle (never heard of it? you don't know what you're missing). Since OCsigen offers LWt (a light weight cooperative threading library) at its core for concurrency I added a fiber wrapper to Unicycle's request processing path so that we get a similar overhead (the testing was done using Ruby 1.9.1).

Here are my results:

Hello World - Unicycle, Single Process: 7378 requests/second

Please note that this is running on my Intel Mobile Core 2 Duo 2.0 GHZ processor vs. the 3GHZ desktop AMD Athlon64 that was used for the original tests (it should be roughly 50% faster than my mobile core2).

I decided to take it further still. Mauricio mentioned that he was able to get performance above 2000 req/s from OCsigen when benchmarking the blog page that we are discussing here. So I created a sqlite database (he mentioned somewhere that he is using sqlite) and inserted the same blog entry (with very little modifications) in a structured manner. I didn't bother with comments though (out of being lazy). Sequel was used to connect and fetch the record from the database and an rhtml template that is rendered using Erubis. The result was a page very similar to the original blog post. ApacheBench was used to benchmark the page.

Unicyle + Fibers + Sequel (Sqlite) + Erubis, Single Process: 1296 requests/second

During the time of testing the Unicycle process was between 13MB and 21MB (that's 3x to 4x the size of the OCsigen process)

Considering that the components found in any laptop are usually inferior to their desktop counter parts I believe this at least equals the figure reported for OCsigen's performance.

How can Ruby achieve such performance? By careful selection of components:

First off, Ruby 1.9.1, everybody should start using it for their next project. It is much faster and much easier on memory

Unicycle is built atop EventMachine and the EventMachine HTTP Server. Both are C based speed demons. Unicycle itself is a minimal framework that doesn't attempt to be so smart.

Erubis is a nice surprise. Pure Ruby and decently fast are not commonly found together but kudos to the authors of Erubis, they somehow did it.

Conclusion

Ruby is faster the OCaml. Wrong! OCaml is a lot faster than Ruby. But thanks to hard work by some prominent Rubyists you can have a Ruby setup that performs decently enough to make you proud.

Thursday, January 15, 2009

We will NOT go down in the night

Friday, November 14, 2008

Ruby Networking on Steroids

Ruby provides several socket classes for various connection protocols. Those classes are arranged in a strange and a convoluted hierarchy.
This ASCII diagram explains this hierarchy

IO
|
BasicSocket
|
|-- IPSocket
| |
| |-- TCPSocekt
| | |
| | |-- TCPServer
| | |
| | |-- SocksSocket
| |
| |-- UDPSocket
|
|-- Socket
|
|-- UNIXSocket
|
UNIXServer

The BasicSocket class provides some common methods but you cannot instantiate it. You have to use one of the sub classes. We have three branches coming out from BasicSocket. One that implements the IP (and descendant) protocls the other implements the UNIX domain sockets protocol. A third branch provides a generic wrapper over FreeBSD sockets. The first problem with this branching strategy is that while the Socket class can be used as a parent class to both UNIXSocket and IPSocket classes the implementer chose to create a separate path for each of them. This results in that there exists lots of code duplication in the implementation that makes maintaining those classes a lot harder than it should be.

A prime example for this is the addition of non blocking features lately to the I/O and socket classes. Only the Socket class was lucky enough to get an accept_nonblocking method. The other classes sadly didn't get it. It is very important to be able to initiate network connections in a non blocking manner if you are using an evented framework (like NeverBlock for example).

What makes the problem worse is that major Ruby network libraries overlook the Socket class and use TCPSocket or UNIXSocket. Net/HTTP for example uses TCPSocket. Since NeverBlock tries to work in harmony with most Ruby libraries it attempts to make up for this inconsistency by altering the default heirarechy of socket classes. Ruby allows you to un-define constants in an object. We remove the TCPSocket and UNIXSocket classes and redefine them by inheriting from Socket and defining some methods to make up for any lost functionality.

After modifying the Socket classes NeverBlock support was integrated. This was done by rewriting the connect, read and write methods so that they would detect the presence of a NeverBlock fiber and operate in an aysnchronous way accordingly. If you use the new socket classes in a non NeverBlock context or in NeverBlock's blocking mode they will resort to the old blocking implementation.

So Here is an example. First we will create a server using EventMachine that takes 1 second to process each request.

server.rb

require 'eventmachine'

class Server < EM::Connection
# handle requests here
def receive_data data
# set the respnonse to be sent after 1 second
EM.add_timer(1) do
send_data "HTTP/1.1 200 OK\r\n\r\ndone"
close_connection_after_writing
end
end
end

EM.run do
EM.start_server('0.0.0.0',8080, Server)
end


Second we will create a client that will issue requests to the server

client.rb

require 'neverblock'
require 'net/http'
EM.run do
@pool = NB::FiberPool.new(20)
20.times do
@pool.spawn do
url = "http://localhost:8080"
res = Net::HTTP.start(url.host, url.port) { |http| http.get('/') }
end
end
end

Issuing 20 GET requests in NeverBlock fibers causes them to run concurrently. Even while our server process a request in one complete second, they all return after approximately 1 second.

Here is a blocking version

blocking_client.rb

require 'net/http'
20.times do
url = "http://localhost:8080"
res = Net::HTTP.start(url.host, url.port) { |http| http.get('/') }
end


The blocking client finishes after around 20 seconds.

Here's a teaser graph



The really good thing is that we used the Net/HTTP library transparently. Any Ruby library that relies on Ruby sockets will benefit from NeverBlock and gain the ability to run in a concurrent manner.

What does that mean?

Originally, NeverBlock only supported concurrent database access for PostgreSQL and MySQL. While this was good and all, databases usually were the bottlenecks of most applications. Unless you have something like a database cluster which can truly absorb any load. This was a shame, since NeverBlock is meant for high levels of concurrency that are only available with massively scalable back ends. With this new development, however, we are now one step closer to tapping into this realm of high performance and scalable web applications. Read on.

Enter AWS and the cloud

Amazon Web Services provide an example of a massively scalable backend that is accessible via HTTP. Services like S3, SimpleDB and SQS are all a URL away. Such services have a higher latency than your nearby database server but they more than make up for that by being able to absorb all the requests you through at them. Most of the Ruby libraries for accessing AWS rely on Net/HTTP in some way or another. This means we get NeverBlock support for those libraries. Now this is big news for those Ruby applications (including Rails ones) that rely on an AWS or a similar backend. For those types of apps, forget about a 10 or 20 fibers pool. We are talking a 1000 fibers pool here. Even higher numbers could be possible (once a nasty file descriptor bug in Ruby 1.9 is fixed).

Why Not Threads?

I have been claiming that Ruby fibers are faster than Ruby threads[1]. I have seen that in my tests but those were usually limited to testing a single performance metric. So I decided to simulate a very scalable back end and see which approach offers more scalability. For testing purposes I created two client applications. One is threaded and the other is based on NeverBlock. In the NeverBlock version I did not use the fiber pool though, I was creating a new fiber per operation to mimic the threaded app behavior. The simulated scalable back end consisted of an EventMachine based server that waits for a certain time before responding with 200 OK. The delay time is to simulate back end processing and network latencies. I testing using 0, 10, 50, 100 and 500 ms as delay values. Another client application was written that worked in the normal blocking mode for comparison.

The clients were tested using Ruby 1.8.6 and 1.9.1. The only exception was the NeverBlock client which was only tested with 1.9.1. This is due to the fact that the current fiber implementation for Ruby 1.8.x is based on threads so it will only reflect a threaded implementation performance. Ruby1.8 was introduced because I noticed problems with the Ruby 1.9 threading implementation regarding scalability and performance so I added Ruby1.8 to the mix which proved to have a (sometimes) faster and more scalable threading implementation.

The application will attempt to issue 1000 requests to the back end server and will try to do so in a concurrent fashion (except for the blocking version of course)

Here are the results



And the results in ASCII format (numbers in cells are requests/sec)

Server Delay 0ms 10ms 50ms 100ms 500ms

Ruby1.8 Blocking 2000 19 16 10 2

Ruby1.9 Blocking 2400 19 17 10 2

Ruby1.8 Threaded 1050 800 670 536 415

Ruby1.9 Threaded 618 470 451 441 395

Ruby1.9 NeverBlock 2360 1997 1837 1656 1031

Let's try to explain the results. For a server that has no delay whatsoever (a utopian assumption) we see that the blocking servers offer the greatest performance. Ruby 1.9 in blocking mode comes first mainly due to the fact that Ruby1.9 is faster than Ruby1.8 and also comes with a faster Net/HTTP library[1]. Why is blocking faster? Simply because the evented server is processing the requests serially and the latency is minimal. The request processing send a response and returns immediately so the server does not get a chance to process requests concurrently. This is the fastest that you can drive your processor.

The NeverBlock implementation comes as a very close second to the fastest client which shows that the overhead of using fibers is not that much. Actually we are cheating a bit here, because we make up for the overhead by sending the requests concurrently, and while the server is still processing the serially we are able to process the fiber pause and resume while the server is working.

Needless to say, NeverBlock is much ahead of the threaded clients (either 1.8 or 1.9) when working with the zero latency server. We also see that 1.8 threads are considerably faster than 1.9's.

When we start adding a simulated delay to the server we see that the blocking clients fall dramatically from the first position to the last. They become too slow that they are really not suitable for use in that setting any more. Please note that the results for the 500ms delay are extrapolations. I was to annoyed by the idea of waiting 500 seconds for a test to run, twice!

On the other hand, threaded and NeverBlock implementations are much less affected even though they lose ground as we increase the delay. NeverBlock maintains its lead though over threaded clients. It is generally 2.5X faster.

Here is a graph of the NeverBlock advantage over the fastest threaded client



And in ASCII format

Server Delay 0ms 10ms 50ms 100ms 500ms

NeverBlock Advantage 124.76% 149.63% 174.18% 208.96% 148.43%

Aside from the NeverBlock advantage the numbers themselves are very impressive. A single process can achieve ~1000 operations per second given that we have half a second processing and network latency. In a mutli process setup we should be able to achieve a lot more than that. For example, forking another NeverBlock client on my dual core notebook which hosts the client and the server apps adds a 50% performance gain.

Conclusion

NeverBlock really shines when the back end is highly scalable. The only problem I met was a Ruby1.9 bug that crashed the client when the file descriptors exceeded 1024. I hope this could be fixed as it will enable us to extract more performance from each process. Expect the socket support to be officially added to NeverBlock soon.

Friday, November 07, 2008

My US Visa Status

So, RubyConf 2008 has started. I was supposed to be presenting NeverBlock to the audience there. I didn't make it but thank God my coworker and friend Yasser was able to go toFlorida. I couldn't make it because I am still waiting for my visa clearance (the Americans changed presidents while I am still waiting!). The status kept saying "Under Processing" till a few days before the conference date. It changed to show the following:

Wednesday, September 03, 2008

Building the Never Blocking Rails, Making Rails 12X Faster

They told you it can't be done, they told you it has no scale. They told you lies!

What if you suddenly had the ability to serve mutliple concurrent requests in a single Rails instance? What if you had the ability to multiplex IO operations from a single Rails instance?

No more what ifs. It has been done.

I was testing NeverBlock support for Rails. For testing I built a normal Rails application. Nothing up normal here, you get the whole usual Rails deal, routes, controllers, ActiveRecord models and eRuby templates. I am using the Thin server for serving the application and PostgreSQL as a database server. The only difference is that I was not using the PostgreSQL adapter, rather I was using the NeverBlock::PostgreSQL adapter.

All I needed to do is to call the adapter in database.yml neverblock_postgresql instead of postgresql and require 'never_block/server/thin' in my production.rb

All this was working with Ruby 1.9, so I had to comment out the body of the load_rubygems method in config/boot.rb which is not needed in Ruby1.9 anyway.

Now what difference does this thing make?

It allows you to process multiple requests concurrently from a single Rails instance. It does this by utilizing the async features of the PG client interface coupled with Fibers and the EventMachine to provide transparent async operations.

So, when a Rails action issue any ActiveRecord operation it will be suspended and another Rails action can kick in. The first one will be resumed once PostgreSQL has provided us with the data.

To make a quick test, I created a controller which would use an AR model to issue the following sql command "select sleep(1)". (sleep does not come by default with PostgreSQL, you have to implement it yourself). I ran the applications with the normal postgresql adapter and used apache bench to measure the performance of 10 concurrent requests.

Here are the results:
Server Software:        thin
Server Hostname: localhost
Server Port: 3000

Document Path: /forums/sleep/
Document Length: 11 bytes

Concurrency Level: 10
Time taken for tests: 10.248252 seconds
Complete requests: 10
Failed requests: 0
Write errors: 0
Total transferred: 4680 bytes
HTML transferred: 110 bytes
Requests per second: 0.98 [#/sec] (mean)
Time per request: 10248.252 [ms] (mean)
Time per request: 1024.825 [ms] (mean, across all concurrent requests)
Transfer rate: 0.39 [Kbytes/sec] received


Almost 1 request per second. Which is what I expected. Now I switched to the new adapter, restarted thin and redid the test.

Here are the new results:

Server Software: thin
Server Hostname: localhost
Server Port: 3000

Document Path: /forums/sleep/
Document Length: 11 bytes

Concurrency Level: 10
Time taken for tests: 1.75797 seconds
Complete requests: 10
Failed requests: 0
Write errors: 0
Total transferred: 4680 bytes
HTML transferred: 110 bytes
Requests per second: 9.30 [#/sec] (mean)
Time per request: 1075.797 [ms] (mean)
Time per request: 107.580 [ms] (mean, across all concurrent requests)
Transfer rate: 3.72 [Kbytes/sec] received


Wow! a 9x speed improvement! The database requests were able to run concurrently and they all came back together.

I decided to simulate various work loads and test the new implementation against the old one. I devised the workloads taking into account that the test machine did have a rather bad IO perfromance so I decided to use queries that would not tax the IO but still would require the PostgreSQL to take it's time. The work loads were categorized as follows:

First a request would issue a "select 1" query, this is the fastest I can think of, then for the differen work loads

1 - Very light  work load,  every 200 requests, one "select sleep(1)" would be issued 
2 - Light work load, every 100 requests, one "select sleep(1)" would be issued
3 - Moderate work load, every 50 requests, one "select sleep(1)" would be issued
4 - Heavy work load, every 20 requests, one "select sleep(1)" would be issued
5 - Very heavy work load, every 10 requests, one "select sleep(1)" would be issued


I tested those workloads against the following

1 - 1 Thin server, normal postgreSQL Adapter
2 - 2 Thin servers (behind nginx), normal postgreSQL Adapter
3 - 4 Thin servers (behind nginx), normal postgreSQL Adapter
4 - 1 Thin server, neverblock postgreSQL Adapter


I tested with 1000 queries and a concurrency of 200 ( the mutliple thin servers were having problems above that figure, the new adapter scaled up to 1000 with no problems, usually with similar or slightly better results )

Here are the graphed results:



For the neverblock thin server I was using a pool of 12 connections. As you can see from the results, In very heavy workload I would perform on par with a 12 Thin cluster. Generally the NeverBlock Thin server easily outperforms the 4 Thin cluster. The margin increases as the work load gets heavier.

And here are the results for scaling the number of concurrent connections for a NeverBlock::Thin server



Traditionally we used to spawn as many thin servers as we can till we run out of memory. Now we don't need to do so, as a single process will maintain multiple connections and would be able to saturate a single cpu core, hence the perfect setup seems to be a single server instance for each processor core.

But to really saturate a CPU one has to do all the IO requests in a non-blocking manner, not just the database. This is exactly the next step after the DB implementation is stable, to enrich NeverBlock with a set of IO libraries that operate in a seemingly blocking way while they are doing all their IO in a totally transparent non-blocking manner, thanks to Fibers.

I am now wondering about the possibilities, the reduced memory footprint gains and what benefits such a solution can bring to the likes of dreamhost and all the Rails hosting companies.