Wednesday, February 06, 2008

Using pgpool-II for persistent connections

If you have high-traffic php applications connecting to a PostgreSQL backend, and you're not using pgpool for persistent connections, you're missing out on a great way to slash the load on your db server.

You may not know it, but postgresql persistent connections in PHP don't work very well -- in fact, they are widely acknowledged to be quite broken. If you want to connect to a Postgres database from your PHP application, you are forced to establish and tear down a connection to the DB server on each load.

Enter pgpool, which is a way to establish persistent connections to the database on a layer external to the PHP (or any other) framework. Pgpool is essentially a daemon that listens on a separate port on your client, proxies all db traffic to the actual database server, and keeps connections open after your application has disconnected. In fact, it does a lot more than that, but if you're interested in a complete set of features, you can just head over to the pgpool readme.

Here's a very telling example of the effect pgpool had for us. We have a cluster of three web nodes that connect to a single PostgreSQL server for all its content. At peak times, we serve about a 100-150 requests per second across all nodes. Even though we heavily rely on Memcache in order to avoid making too many trips to the database, it's still pretty likely that most of these hits will result in communication with the DB server. Today, after running pgpool for several weeks in testing, I have enabled it on our production cluster to the following effect:

These graphs are from MRTG -- the top one is the load graph for the DB server, and the bottom is the accumulative graph for the cluster traffic. As you can see, the effect is pretty drastic. Before pgpool, our 5-minute average load on the DB server was hovering at just under 1 (this is a dual-processor system, so full load is 2). Traffic peaks could push us close to 2 or even over -- sometimes very much over. For example, observe the traffic peak we had at 12PM yesterday that pushed us to about 2.7 -- and that's just normal usage pattern. Sometimes things turn pretty bleak when we are hit by a distributed spambot network that tries to crawl all of our pages from a bazillion of worldwide zombie nodes. Memcache is great, but it doesn't help in such cases, as we only cache data for several hours. If a page is infrequently accessed, it won't be in cache and will thus necessitate a trip to the db.

Now, observe how the graph looks after pgpool is enabled at around 9AM today and we start using persistent connections (and just persistent connections, no query caching or replication). Average 5-minute load drops from 1 to about 0.3-0.4 -- in other words, a full 60-70% reduction in load just by not having to constantly establish and tear down PostgreSQL connections. And a traffic peak that's more pronounced than yesterday's pushes the load to about 1, instead of 2.7.

Installing pgpool is probably the quickest and easiest way to reduce the load on your PostgreSQL server in a web environment context. All you have to do is "yum install postgresql-pgpool-II", configure your /etc/pgpool.conf, start the service, and tell your web applications to connect to localhost:9999 instead of the actual db server. Boom, instant load reduction.


Unknown said...

You mean you installed a pg-pool on each your web node ?

Unknown said...

That's correct.

Unknown said...

Nice, thank you for the quick overview.

That would help me. ;-)

Anonymous said...

Good Post. I'm a newbie and recently installed pgpool and wanted to run some #s past you to see if that's kind of what you saw when you initiated the service.

I'm using PHP 5, pgpool II 2.0.1 and postgres 8.2.7.

Once I configured pgpool and made the chnages to PHP to connect to postgres thru pgpool, I noticed a .010 - .020 increase in the time to process

My PHP script is very simple. Connect to the database, execute a qry to select 12 records and spit it out in a table.

Here are some execution times:

W/o PGPOOL - .036 - .038 seconds
W/ PGPOOL - .046 - .059 seconds

By comparison, if I use pg_pconnect, I get .013 seconds.

Is this what you saw. Your execution times went up per script going thru PGPOOL?


Anonymous said...

U can also try this:
IMHO fast and better for simple pooling

Jack Võ said...

can pgpool-II nead some Perl modules?

Vitaly said...

Is it possible to install pgpool on the web server where php scripts are executed and not on every db node?

Unknown said...

Yes, you would normally install pgpool directly on the web nodes and not on the db servers.