Showing posts with label fibers. Show all posts
Showing posts with label fibers. Show all posts

Ruby Networking on Steroids

5

Labels: , , , , , ,

Ruby provides several socket classes for various connection protocols. Those classes are arranged in a strange and a convoluted hierarchy.
This ASCII diagram explains this hierarchy

IO
|
BasicSocket
|
|-- IPSocket
| |
| |-- TCPSocekt
| | |
| | |-- TCPServer
| | |
| | |-- SocksSocket
| |
| |-- UDPSocket
|
|-- Socket
|
|-- UNIXSocket
|
UNIXServer

The BasicSocket class provides some common methods but you cannot instantiate it. You have to use one of the sub classes. We have three branches coming out from BasicSocket. One that implements the IP (and descendant) protocls the other implements the UNIX domain sockets protocol. A third branch provides a generic wrapper over FreeBSD sockets. The first problem with this branching strategy is that while the Socket class can be used as a parent class to both UNIXSocket and IPSocket classes the implementer chose to create a separate path for each of them. This results in that there exists lots of code duplication in the implementation that makes maintaining those classes a lot harder than it should be.

A prime example for this is the addition of non blocking features lately to the I/O and socket classes. Only the Socket class was lucky enough to get an accept_nonblocking method. The other classes sadly didn't get it. It is very important to be able to initiate network connections in a non blocking manner if you are using an evented framework (like NeverBlock for example).

What makes the problem worse is that major Ruby network libraries overlook the Socket class and use TCPSocket or UNIXSocket. Net/HTTP for example uses TCPSocket. Since NeverBlock tries to work in harmony with most Ruby libraries it attempts to make up for this inconsistency by altering the default heirarechy of socket classes. Ruby allows you to un-define constants in an object. We remove the TCPSocket and UNIXSocket classes and redefine them by inheriting from Socket and defining some methods to make up for any lost functionality.

After modifying the Socket classes NeverBlock support was integrated. This was done by rewriting the connect, read and write methods so that they would detect the presence of a NeverBlock fiber and operate in an aysnchronous way accordingly. If you use the new socket classes in a non NeverBlock context or in NeverBlock's blocking mode they will resort to the old blocking implementation.

So Here is an example. First we will create a server using EventMachine that takes 1 second to process each request.

server.rb

require 'eventmachine'

class Server < EM::Connection
# handle requests here
def receive_data data
# set the respnonse to be sent after 1 second
EM.add_timer(1) do
send_data "HTTP/1.1 200 OK\r\n\r\ndone"
close_connection_after_writing
end
end
end

EM.run do
EM.start_server('0.0.0.0',8080, Server)
end


Second we will create a client that will issue requests to the server

client.rb

require 'neverblock'
require 'net/http'
EM.run do
@pool = NB::FiberPool.new(20)
20.times do
@pool.spawn do
url = "http://localhost:8080"
res = Net::HTTP.start(url.host, url.port) { |http| http.get('/') }
end
end
end

Issuing 20 GET requests in NeverBlock fibers causes them to run concurrently. Even while our server process a request in one complete second, they all return after approximately 1 second.

Here is a blocking version

blocking_client.rb

require 'net/http'
20.times do
url = "http://localhost:8080"
res = Net::HTTP.start(url.host, url.port) { |http| http.get('/') }
end


The blocking client finishes after around 20 seconds.

Here's a teaser graph



The really good thing is that we used the Net/HTTP library transparently. Any Ruby library that relies on Ruby sockets will benefit from NeverBlock and gain the ability to run in a concurrent manner.

What does that mean?

Originally, NeverBlock only supported concurrent database access for PostgreSQL and MySQL. While this was good and all, databases usually were the bottlenecks of most applications. Unless you have something like a database cluster which can truly absorb any load. This was a shame, since NeverBlock is meant for high levels of concurrency that are only available with massively scalable back ends. With this new development, however, we are now one step closer to tapping into this realm of high performance and scalable web applications. Read on.

Enter AWS and the cloud

Amazon Web Services provide an example of a massively scalable backend that is accessible via HTTP. Services like S3, SimpleDB and SQS are all a URL away. Such services have a higher latency than your nearby database server but they more than make up for that by being able to absorb all the requests you through at them. Most of the Ruby libraries for accessing AWS rely on Net/HTTP in some way or another. This means we get NeverBlock support for those libraries. Now this is big news for those Ruby applications (including Rails ones) that rely on an AWS or a similar backend. For those types of apps, forget about a 10 or 20 fibers pool. We are talking a 1000 fibers pool here. Even higher numbers could be possible (once a nasty file descriptor bug in Ruby 1.9 is fixed).

Why Not Threads?

I have been claiming that Ruby fibers are faster than Ruby threads[1]. I have seen that in my tests but those were usually limited to testing a single performance metric. So I decided to simulate a very scalable back end and see which approach offers more scalability. For testing purposes I created two client applications. One is threaded and the other is based on NeverBlock. In the NeverBlock version I did not use the fiber pool though, I was creating a new fiber per operation to mimic the threaded app behavior. The simulated scalable back end consisted of an EventMachine based server that waits for a certain time before responding with 200 OK. The delay time is to simulate back end processing and network latencies. I testing using 0, 10, 50, 100 and 500 ms as delay values. Another client application was written that worked in the normal blocking mode for comparison.

The clients were tested using Ruby 1.8.6 and 1.9.1. The only exception was the NeverBlock client which was only tested with 1.9.1. This is due to the fact that the current fiber implementation for Ruby 1.8.x is based on threads so it will only reflect a threaded implementation performance. Ruby1.8 was introduced because I noticed problems with the Ruby 1.9 threading implementation regarding scalability and performance so I added Ruby1.8 to the mix which proved to have a (sometimes) faster and more scalable threading implementation.

The application will attempt to issue 1000 requests to the back end server and will try to do so in a concurrent fashion (except for the blocking version of course)

Here are the results



And the results in ASCII format (numbers in cells are requests/sec)

Server Delay 0ms 10ms 50ms 100ms 500ms

Ruby1.8 Blocking 2000 19 16 10 2

Ruby1.9 Blocking 2400 19 17 10 2

Ruby1.8 Threaded 1050 800 670 536 415

Ruby1.9 Threaded 618 470 451 441 395

Ruby1.9 NeverBlock 2360 1997 1837 1656 1031

Let's try to explain the results. For a server that has no delay whatsoever (a utopian assumption) we see that the blocking servers offer the greatest performance. Ruby 1.9 in blocking mode comes first mainly due to the fact that Ruby1.9 is faster than Ruby1.8 and also comes with a faster Net/HTTP library[1]. Why is blocking faster? Simply because the evented server is processing the requests serially and the latency is minimal. The request processing send a response and returns immediately so the server does not get a chance to process requests concurrently. This is the fastest that you can drive your processor.

The NeverBlock implementation comes as a very close second to the fastest client which shows that the overhead of using fibers is not that much. Actually we are cheating a bit here, because we make up for the overhead by sending the requests concurrently, and while the server is still processing the serially we are able to process the fiber pause and resume while the server is working.

Needless to say, NeverBlock is much ahead of the threaded clients (either 1.8 or 1.9) when working with the zero latency server. We also see that 1.8 threads are considerably faster than 1.9's.

When we start adding a simulated delay to the server we see that the blocking clients fall dramatically from the first position to the last. They become too slow that they are really not suitable for use in that setting any more. Please note that the results for the 500ms delay are extrapolations. I was to annoyed by the idea of waiting 500 seconds for a test to run, twice!

On the other hand, threaded and NeverBlock implementations are much less affected even though they lose ground as we increase the delay. NeverBlock maintains its lead though over threaded clients. It is generally 2.5X faster.

Here is a graph of the NeverBlock advantage over the fastest threaded client



And in ASCII format

Server Delay 0ms 10ms 50ms 100ms 500ms

NeverBlock Advantage 124.76% 149.63% 174.18% 208.96% 148.43%

Aside from the NeverBlock advantage the numbers themselves are very impressive. A single process can achieve ~1000 operations per second given that we have half a second processing and network latency. In a mutli process setup we should be able to achieve a lot more than that. For example, forking another NeverBlock client on my dual core notebook which hosts the client and the server apps adds a 50% performance gain.

Conclusion

NeverBlock really shines when the back end is highly scalable. The only problem I met was a Ruby1.9 bug that crashed the client when the file descriptors exceeded 1024. I hope this could be fixed as it will enable us to extract more performance from each process. Expect the socket support to be officially added to NeverBlock soon.

Ruby Fibers Vs Ruby Threads

15

Labels: , , , ,

Ruby 1.9 Fibers are touted as lightweight concurrency elements that are much lighter than threads. I have noticed a sizbale impact when I was benchmarking an application that made heavy use of fibers. So I wondered what If I switched to threads instead? After some time fighting with threads I decided I needed to write something specific for this comparison. I have written a small application that would spawn a number of fibers (or threads) and then would return the time went into this operation. I also recorded the VM size after the operation (all created fibers and threads are still reachable, hence, no garbage collection). I did not measure the cost of context switching for both approaches, may be in another time.

Here are the results for creation time:



And the results for memory usage:



Conclusion

Fibers are much faster to create than threads, they eat much less memory too. There is also a limit on the number of threads for 1.9 as I maxed on 3070 threads while fibers were not complaining when I created 100,000 of them (but they took 203 seconds and occuppied a whoping 500MB of RAM).

Faster IO for Ruby with Postgres

13

Labels: , , , , ,

Or 40% faster DB Access for your Ruby applications!

In a previous post I talked a bit about event based programming for Ruby. I mentioned the EventMachine/Asymy combo as a means of doing Asynchronous database operations hence freeing up the Ruby runtime to do other things while it is waiting on database I/O operations. Even more, the devs need not worry about using a different programming model, with the help of Ruby Fibers we will continue to program in the same old ways while Fibers will be doing all the twisted work underneath. Very promising indeed, but one big elephant in the room was the immaturity of the current solution. Asymy is still very infant and it is based on the super slow pure Ruby MySQL driver, not to mention that it is fairly incomplete as well.


So, what can we do about the elephant in the room? There is an Arabic proverb that basically says "Nothing can beat iron but iron" and this is exactly what we are going to do. Enter Postgres, the database with a realistic, unfriendly elephant mascot. Go away dolphins, a real elephant is in the room now.

Surprisingly Postgres happens to have an excellent asynchronous client API. It allows you to do almost all operations in a non blocking way. More surprisingly the Postgres driver for Ruby covers almost all those asynchronous API calls. The driver was originally written by Matz (yes the man himself) in 1997. It was later updated by ematsu in 1999 and now we have an update fresh from the oven in March 2008 by Jdavis. If you go through the C source code you will find many hidden gems. The methods are fairly well documented and you will discover that the driver has a blocking method that wraps the asynchronous calls inside but it does so in a Ruby threading friendly way. This way a threaded application will not block on the Postgres SQL commands. Good thing but I am more interested in the asynchronous side of the fence.

Let's walk through the API and see how can we use it to do non blocking database access. First you will need to install the gem ("sudo gem install pg"). Then you need to require 'pg' in your code.

One problem though before we start. The current driver has this nasty little bug that prevents you from setting the connection to nonblocking. It is actually a bug in the parameter count defined in the Ruby interface. A simple switch from 0 to 1 fixes this. To save you time and sweat I have provided a replacement gem with the modified sources (till the bug is fixed upstream). Now let's get back to the code.

Here's how to get things started
require 'pg'
# I have configured postgres to run in *trusted* mode
# so I don't need to supply a password
conn = PGconn.new({:host=>'localhost',:user=>'postgres',:dbname=>'evented'})
conn.setnonblocking(true)
This way our connection is ready for async operations. Now we need to start sending some sql commands to our connection. To do that we normally use the PGconn#exec method. But this method will block, waiting on postgres. So instead we will use the PGconn#send_query method. This method will return immediately, not waiting for Postgres to actually process the sql command. Here's how are going to use it.
conn.send_query("select * from users where name like '%am%'")
# the method will return immediately (or raise an exception in case of an error)
But wait, where are the results? Normally we expect the call to return with the data. Now where is my data? The results are being processed right now at the server side. We can continue to do other things till they come. But how do we know when they arrive? It turns out that this is easy as well. The PGconn instance provides a method that returns the connection's socket descriptor. PGconn#socket that is. We retrieve that socket descriptor and wrap it in a Ruby IO object by calling
io = IO.new(conn.socket)
Now have a nice IO object that we can get notified of its activity in a select call. For the uninitiated, event based programming is done by have a tight loop that runs forever. Within this loop we check if IO events happen and if so we respond to them. One efficient way of doing so is using the Ruby Kernel#select method (which is a wrapper to the UNIX select). The select method works that way: you provide it with three lists, one for sockets that you need to read from and one for sockets that you need to write to, the third is for errors that you are interested in. The call returns an array of the sockets that can be read/write or nil if none is ready.

We will use select as follows:
# the method that will be called if input is ready
def process_command(conn)
# we will detail the implementation soon
end

loop do
# we supply a list of sockets we need to read from.
# Only our io object in this case. we nullify
# the other lists and we set a timeout
res = select([io],nil,nil, 0.001)
# of course this needs to be done in a cleaner way
process_command(conn) unless result.nil?
end
This way whenever there is info to read from the socket we will not get a nil (we will get an array actually) so we can call the process command. When the process command gets called it knows that there is data in the connection to be read so it calls the PGconn#consume_input method. After which it checks to see if the conn is busy or not. If it is still busy, it does nothing (it will do in a later event). On the other hand, if the connection is not busy then we start calling the PGconn#get_result method and append what we get to the result we got so far. We keep doing that till we get a nil result which indicates the end of the command and the readiness of the connection to accept further commands. Here is how the method will look like:
def process_command(conn)
conn.consume_input
unless conn.is_busy
res, data = 0, []
while res != nil
res = get_result
res.each {|d| data.push d}unless res.nil?
end
#we are done, we need to put this data some where
end
end
Several things to be noted. First, one cannot process several commands using the same connection at once. You need several connections to achieve parallel command processing. Second, the model described above works in the twisted way, to get things working the normal way you can use Ruby Fibers (or continuations but they apparently leak memory)

I have put together a couple of Ruby classes that implement a nonblocking connection pool and a fiber pool. You can find them here Using those you can write code that looks like this:
require 'fiber_pool'
require 'fibered_connection_pool'

options = {:host=>'localhost',:user=>'postgres',:dbname=>'evented'}

cpool = FiberedC onnectionPool.new(options, 12)
# second param is the number of connections to spawn, defaults at 8
# note that one more connection than those will be spawned. This one
# will be used for processing blocking requests.

fpool = FiberPool.new(100)
# the number of fibers to spawn, defaults at 50

100.times do
fpool.spawn do
cpool.exec(some_sql_command, true) #true means async
cpool.exec(some_other_sql_command, true)
cpool.exec(yet_another_sql_command, true)
end
end

# our event loop
loop do
res = select(cpool.sockets,nil,nil,0) #check for something to read
# IO is monkey patched to be able to hold a reference to the connection
res.first.each{ |s|s.connection.process_command } if res
end
This works as follows, once a fiber calls cpool.exec the query is sent to the pool for processing and the fiber is halted, giving way for another one to start processing. The other one will halt as well once it hits a cpool.exec. Later during the event loop you will get notifications of completion of queries (in any order) and resume the fiber associated with the finished query. Note that commands issued in the same fiber will run sequentially while those issued from different fibers will interleave. This is effectively what is achieved by threading but without its costs.

Performance:

I am sure that my code might use some tweaking but I am getting very good results already. During benchmarking I found out that the cost on isntantiating fibers could be high (the cost of pausing and resuming is high as well, but unavoidable) So I created a pool of fibers that can be reused (a very naive implementation that can make use of lots of improvement).

I tested by issuing a group of long and short queries together. You actually provide the test program with the number of long queries and the multiplier it should use for short queries. i.e. ruby test.rb 10 20 will iterate 10 times and issue a long query then within the same iteration it will issue 20 short queries. It will do this in a blocking and then nonblocking way, reporting the time taken for each to complete and the percentage of performance increase/decrease.



I tested for 10, 50 and 100 long queries with the following multipliers (1, 2, 5, 10, 50, 100). The graph shows the performance gain for each number of queries vs the multiplier. For example 50 long queries with a multiplier of 10 (i.e. 500 short queries) achieves a 39.6% reduction in query execution time. I have repeated many of the tests several time (not all of them, too lazy to do that). The repeated tests showed consistent results so I am pretty confident of the presented results.


Here is the full list:

Queries Mode
Ratio Long Short Blocking Non Blocking Advantage

:1/2 10 20 0.56 0.5 10.27%
50 100 2.55 2.26 11.19%
100 200 5.15 4.46 13.53%

:1/5 10 50 0.55 0.4 27.04%
50 250 2.72 1.83 32.82%
100 500 5.45 3.63 33.39%

:1/10 10 100 0.6 0.4 33.76%
50 500 3.01 1.82 39.67%
100 1000 5.9 3.65 38.13%

:1/20 10 200 0.72 0.45 38.12%
50 1000 3.43 2.1 38.73%
100 2000 6.83 4.33 36.53%

:1/50 10 500 0.98 0.62 36.57%
50 2500 4.78 3.23 32.36%
100 5000 9.74 8.68 10.93%

:1/100 10 1000 1.46 0.94 35.40%
50 5000 7.42 5.17 30.31%
100 10000 14.27 12.68 11.15%
The area I would like to focus on for performance tuning is the size of the fiber pool. The test is a bit sensitive to it so I believe I can gain a bit more performance with insane query counts if I optimize my fiber pool a bit. Setting the initial size too high certainly helps, but eats too much memory to make it usable.

A final note. I am playing with using this along side an EventMachine based http server. It works OK but is a cpu hog. Propably due to using select in next_tick calls withing EM's event loop. I would love to be able to provide EM with a list of IO objects and a call back instead of requiring me to use it to open the connection. Nevertheless, even though in many cases the nonblocking db implementation is slower than a blocking one in the http serving arena, I managed to get ~800 req/s vs ~500 req/s for a very typical use case, A request that runs a long query followed by many short ones. Impressive to say the least. I might be even try to hack EM to support the feature I need and then see what performance this could yield.

UPDATE

Apparently one can get more performance for the blocking requests if the fiber pool is initiated AFTER the blocking calls. Possibly due to the VM being impacted by the memory increase. Rerunning some of the tests showed fractional improvements for the blocking case. On the other hand, I tried some of the tests while another process was doing heavy I/O (RDoc generation). The performance gain jumped to an amazing 76% in one of the tests (it was generally between 51% and 76%).

Untwisting the Event Loop

1

Labels: , , , , , ,

Have you ever wondered why your Rails application is so memory hungry while it is not really trying to fully utilize your CPUs? To saturate your CPUs you have to have a large number of Thin (or Mongrel or whatever) instances. Why is that? We all know that the Ruby interpreter is not able to utilize more than one CPU (or no more than one CPU at a time in the case of 1.9). But why can't Ruby (or may be it's Rails?) utilize the processors efficiently? Let's look for an answer to this question.

First off, what happens in a typical Rails action? The Rails framework will be doing some request mapping and routing which is mostly CPU (if we consider memory latency negligible) Then a few requests will be sent to the database to retrieve some data after which a rendering process which is mostly CPU as well.
def show
@user = User.find(params[id]) #db access
@events = Events.find(:all) #another db access
render :action => :show #rendering
end

The problem here comes with the database part of the action. Calls to the database will block processing till results get back from the DBMS. During that time, Rails will be frozen and not trying to do any thing else till the call ends. Good news is that threads can help here (even Ruby's green threads). A blocked thread will give way to other threads till it is back in the ready state. Thus filling those slots with some useful processing. Sounds good enough? NO!

Sadly Rails is NOT thread safe. You cannot use threads to do parallel processing in Rails. So why not something like Merb? I hear you say. Well Merb and threads will be able to interleave CPU operations and help with the time spent on IO in something like fetching data from some other service. But it won't save you when you do database IO. Simply because of the simple fact that calling C extensions blocks the whole Ruby interpreter. Yes, you read it correctly the first time. Nothing cannot be scheduled while a native call is being issued. Since database drivers are mostly C extensions they suffer from this. Your nice SELECT statement keeps the whole Ruby interpreter on hold till it is finished.

But there must be a solution to this. We cannot be all left high and dry with interpreters eating our memory and not really using our CPUs.

Enter EventMachine and AsyMy

For those who are not in the loop of events (bun intended) there happens to be another approach to this problem. Event based (read asynchronous) IO. In this mode of operation you request an IO operation and tell the event loop what to do when the request is fulfilled (either fully or partially). An excellent library for event handling exists for Ruby which is Francis' EventMachine (used internally by the Thin server and the evented flavour of Mongrel). But still, using EventMachine does not magically solve all our problems. The question that keeps popping up, what to do with database access? AsyMy to the rescue! AsyMy, written by Thomas Ptacek, is an evented driver for MySQL that operates in an asynchronous fashion. A quick example will look like:
connection.execute('SELECT * from events') do |headers,data|
# do something with headers and data
pp headers
pp data
end
Asymy is still in a very early stage, the performance is horrible (as it is based on the darn slow pure Ruby MySQL driver) and it comes with many rough corners (I was not able to run INSERTs and UPDATEs without hacking it, and I am still not able to run the callbacks for those). Nevertheless, this is a formidable achievement on the road to a very fast single threaded implementation.

Here's how our action would look like if there was an Asymy adapter for ActiveRecord
#this is propably wrong but it can illustrate
#the twisted nature of evented programming
def show
User.find(params[:id]) do |result_set|
@user = result_set
Events.find(:all) do |result_set|
@events = result_set
@events.each do |event|
event.owner = @user
if event != @events.last
event.save
else
event.save do |ev|
render :action => :show
end
end
end
end
end
end
We had to twist the function flow to be able to make use of the evented nature of the new driver. Instead of flow passing normally it is being scattered in the different callbacks. This is one of the areas where event based programming makes you change the way you think about program flow. A hurdle for many developers and a show stopper for some. No wonder the event library for Python is called Twisted

Why not untangle this with Fibers?

Fibers are lightweight concurrency primitives introduced in Ruby 1.9. How light weight? well they don't come at zero cost but in long running requests the weight they add can be negligible. Fibers provide some form of cooperative (rather than preemptive) concurrency inside a single thread (you cannot pass fibers between threads, you have been warned). Fibers enjoy the ability to pause and resume like continuations, but they don't suffer from the memory leaks the continuations have. When we use this feature wisely we can unwind the action code above to look like this:
def show
@user = User.find(params[:id]
@events = Events.find(:all).each do |event|
event.owner = @user
event.save
end
end
Huh? this is the normal action code we are used to. Well, using fibers we can do this and still do things under the hood in an evented way.

To make things clear we need to illustrate Fibers with an example:
require 'fiber'

fiber = Fiber.new do
#do something
Fiber.yield another_thing
#do yet another thing
end

yielded = fiber.resume # => runs the fiber till the yield,
# returns the yielded value
# and pauses the fiber where it is
fiber.resume #=> re-runs the fiber from the point it was paused.
fiber.resume #=> no more statements to run, raises an exception
Let's see how can this be useful for dispatching controller actions (this code will preferrably be in the server itself)
Fiber.new do
Dispatcher.dispatch(controller,action,req,res)
send_response res
end.resume
Inside the action we call the find method repeatedly. This method could be implemented like this:
class DataStore
def find(*args)
query = construct_query(*args)
fiber = Fiber.current # grab the current fiber
conn.execute(query) do |headers, data|
fiber.resume convert_to_objects(data)
end
yield
end
end
This way whenever the code passes a find method it will pass the query to the db driver, return immediately and pause, giving room for other requests to be processed. Once the data comes from the db server the call back is run and it resumes the fiber (passing to it the result of the query). The result gets passed back to the caller of the function and the original action method continues till completion (or till it is paused again by another find method)

Roger Pack has a nice writeup (with actual working code) on the Evented Fibered combo here.

Charles Jolley implemented a similar thing here. It is called Pipelined and while it is more obtrusive than the approach described above, it still has the advantage of being optional. Pipelined uses continuations and hence is available to Ruby 1.8 (and Rails).

I am still ironing out and tying things together (and doing lots of benchmarks) and I would like to tell you that I have ditched AsyMy for now for another alternative which I will attempt to discuss in detail in another blog post.