How fast can be asyncio module with PostgreSQL working together
Tasks where I/O matters
It's [web services] becoming just a proxy in front of database systems, or specialized web services, that sends back JSON responses to the web browser, and let it handle all the templating and all the display work.
A new development era (2013) by Tarek Ziadé
That blog posts by Tarek Ziadé was quite actual back than three years ago, but it's even more actual now. Now days there are a lot of different types of services that mainly requires almost no other work but to receive request, send new request to 3rd party service, like DB, and return some part of received data to user. That are simple chats and messaging exchange services, services for collecting data, some recommendation engines, etc... These are places where I/O actually matters and it's the spot where python asyncio module fits perfectly.
I got very inspired after I had read article uvloop: Blazing fast Python networking, so I decided to try to build asynchronous prototype of service that would use all new and cool async approaches in python to get as much requests-per-second as it's possible. Important thing is to understand that asyncio does not actually increase python speed or productivity, for that are pypy and pyston projects, what async approach can increase is a throughput of applications that don't do lot of computations but has a lot of I/O, basically it very efficient for proxy-alike parts of application.
As a suitable problem I decided to make a very primitive Ad Serving Service that works directly with DB in simple proxy-alike manner: it logs received request data to DB and gets some data from DB based on request parameters to serve that data to user and that is all, service doesn't include any additional control logic or computations. I'd like to describe some of that experience in this post.
Used technological stack:
httptools because of the speed, in few words it's 10 times faster then aiohttp and builtin asyncio loop and completely suites for my task. For more details on this please check that amazing performance benchmark report from MagicStack guys.
Why PostgeSQL? Well, with release 9.5 where
substantial performance improvements for multi-CPU machines and new
GROUP BY analysis features GROUPING SETS, CUBE and ROLLUP were introduced it becomes very suitable for warehousing and OLAP and that is something that it in context of building a AdServer are features that is required to have. Also I very recommend to check out this presentation by Josh Berkus about using and setting up PSQL as DW DB.
Working with PostgreSQL in async way
When an asynchronous query is being executed,
True. Two cursors can’t execute concurrent queries on the same asynchronous connection.
There are several limitations in using asynchronous connections: the connection is always in
autocommitmode and it is not possible to change it. So a transaction is not implicitly started at the first query and is not possible to use methods
rollback(): you can manually control transactions using
execute()to send database commands such as
set_session()can’t be used but it is still possible to invoke the
SETcommand with the proper
COPY commands are not supported either in asynchronous mode, but this will be probably implemented in a future release.
There sure is a nice library aiopg that makes things more easy, but as far as it's based on psycopg adapter it shares same limitations for asynchronous connections. This actually was a bit disappointing to me, especially because of having cursors that can’t execute concurrent queries... At some point it even looks like it demands to use multi-threading/multiprocessing approach to get maxim throughput. I hope they will add support of concurrent queries for asynchronous connections. Also it seems that it puts some limitations on setting transaction type as 'read' and same probably goes to for setting isolation level.
For testing DB I created two simple tables, one with some hacks to increase productivity for writing requests log:
create unlogged table tbl (ID SERIAL PRIMARY KEY, data Json); ALTER TABLE tbl SET (autovacuum_enabled = false, toast.autovacuum_enabled = false);
and another to retrieve data from it for serving it to user with some good number of rows to represent the variousity of possible ad variants for user to show. :
create table t_random as select s, uuid_generate_v4() from generate_Series(1,50000000) s; CREATE UNIQUE index on t_random (uuid_generate_v4);
My original intention was to use one connection with single cursor for writing requests logs to DB and one connection with multiple cursors for reading fetching data, but in the end I decided to use a singe cursor for each connection with queue of tasks for each of worker. So both type of workers for writing and reading from DB looks pretty the same:
async def worker(loop, queue, conn_param, handler=None, dump_size=1): conn = await aiopg.connect(**conn_param, loop=loop) async with conn.cursor() as cur: values =  while True: values.append(await queue.get()) if len(values) > dump_size: await cur.execute('\n'.join(values)) values[:] =  if handler is not None: handler(await cur.fetchall())
The writing worker was accumulating 10000 logs before write it to DB and the reading worker query DB for every and each request and then response to client with that data. This code in example is a bit different from I actually wrote, but represents what is going on quite accurately. Also it was required to me to create an own HTTP Protocol Handler to have a custom and fast HTTP header parser:
class HttpProtocol(asyncio.Protocol): __slots__ = ( '_loop', '_transport', '_request', '_parser', '_url', '_headers', '_read_queue', '_write_queue') def __init__(self, loop, read_queue, write_queue): self._loop = loop self._read_queue = read_queue self._write_queue = write_queue self._reset() def _reset(self): self._request = None self._parser = None self._headers =  self._url = None def on_url(self, url): self._url = url def on_header(self, name, value): self._headers.append((name, value)) def on_headers_complete(self): self._request = ( self._url, self._headers, self._parser.get_http_version()) self._loop.create_task(self._write_queue.put( *params* ) self._loop.create_task(self._read_queue.put( *params* ) self._reset() def connection_made(self, transport): self._transport = transport def connection_lost(self, exc): self._transport = None def data_received(self, data): self._parser = httptools.HttpRequestParser(self) self._parser.feed_data(data)
HttpProtocol class is quite the same as the realization from MagicStack guys. The really different and most important part here is creating tasks on
on_headers_complete callbacks that putts some request related params to queue. I guess that is it about code, if you interested you can find full source code in my repo, but that is a very buggy code with loots of possible issues, I do not recommend to use it for any other purposes than just for fun...
All test have performed via wrk tool on Amazon EC2 t2.micro instance with using RDS PostgreSQL 9.5 (db.t2.small).
First test shows the speed of logging data to PostgreSQL DB:
> wrk -t17 -c500 -d30s http://127.0.0.1:8888/ Running 30s test @ http://127.0.0.1:8888/ 17 threads and 500 connections Thread Stats Avg Stdev Max +/- Stdev Latency 21.33ms 38.17ms 440.43ms 89.86% Req/Sec 696.26 484.09 5.43k 66.68% 328070 requests in 30.08s, 1.55GB read Requests/sec: 10905.09 Transfer/sec: 52.70MB
Second test shows the speed of both logging and retrieving data at same time:
> wrk -t17 -c500 -d30s http://127.0.0.1:8888/ Running 30s test @ http://127.0.0.1:8888/ 17 threads and 500 connections Thread Stats Avg Stdev Max +/- Stdev Latency 148.90ms 64.22ms 462.27ms 89.39% Req/Sec 215.01 84.39 363.00 80.28% 101320 requests in 30.10s, 9.37MB read Requests/sec: 3365.84 Transfer/sec: 318.83KB
New asyncio module has its sweet spot in python development and it's getting more and more attractive with appearance of new libraries and tools around, especial for development simple web-based services where you need to handle a lot of requests in single thread without much of computation. I definitely will use asyncio for some simple tasks of data aggregation/retrieving and other cases of simple proxy-alike behaviour, but I wouldn't use it in more complex applications where I/O doesn't matter that much or more control is required. At the end I very recommend to check article Asynchronous Python and Databases by Mike Bayer, the creator of SQLAlchemy.