Wednesday, October 25, 2006

PHP and Database Connection Pooling

During my recent vacation I worked on this blog entry which talks about connection pooling in PHP and some of the recent interesting results IBM published in a recent Zend Developer Zone article. I just saw Christopher Jones’ blog entry on a new feature coming in Oracle 11g so I’ve slightly adjusted it to include the news.

As is apparent from the likes of Yahoo!, Facebook and other large Web 2.0 companies who have billions of page views per month, PHP can indeed be deployed in a way which scales extremely well. Naturally with any technology that has to scale to huge volumes of traffic using large clusters of servers, there are always scalability issues that need resolving. Common issues that IT personnel deal with are monitoring and management of their clusters, application deployment, file system scalability, network topology, high availability and database scalability.

In my experience, databases have had a long history of being the typical bottleneck in PHP applications. There are many reasons for that including:
A) Dynamic Web applications tend to be heavily database driven and use the database as a searchable content store (CMS) and for transactional operations (e-commerce).
B) PHP developers prefer writing PHP code than SQL, thus often leading to sub optimal queries and moving the majority of the processing logic into PHP.
C) Web sites often serve millions of users, thus requiring PHP to deal with very heavy traffic. This in turn puts huge loads on the databases.
D) Lack of good cheap tools to find bottlenecks in database queries (such tools have always been very expensive. It will be interesting to see how MySQL's new service which also promises performance monitoring will change this).
D) PHP works best in multi-process environments such as Apache. There are many benefits to such a model especially stability, but it also typically requires each process to manage its own connection to the database.

What I'd like to talk about is the last point. Most production PHP deployments prefer to use persistent database connections in order to save the overhead of opening new connections on each request. This overhead doesn't only include the TCP/IP connection and the handshake with the database, but also server-side resource allocation such as processes, threads and the connection context which can take considerable time.

Using persistent database connections with the multi-process model creates a huge problem for large PHP clusters. Let's assume that the average PHP server is configured to spawn 200 Apache worker processes. For companies who have clusters with dozens or hundreds of PHP servers, this can end up being a huge strain on database servers who have to deal with thousands of concurrent connections. For example, a 100 server cluster will typically require 20000 concurrent connections. The performance hit on the server can be devastating for reasons including large memory consumption and crippling context switch overhead due to large quantities of threads and/or processes.

Such problems are not foreign to IBM's DB2 team. They have faced such multi-process architectures in the past and implemented a feature called connection concentrator. This feature efficiently handles large numbers of connections that have short-lived transactions which are very typical to PHP applications. Just recently IBM did some PHP specific benchmarks of this feature and wrote an article describing the results. These results really impressed me as they managed to easily scale to 10,000 concurrent DB connections (equal about to the size of a PHP cluster with 50 server) on a standard two-socket, dual-core processor Linux machine with 4 GB of memory. And not only was it on what is considered a lightweight machine for a database, there was memory to spare.

Although stability of the multi-process model has always been one of the main selling points for Apache, it has often been critiqued as not allowing for application-server like sharing of database connections between the various worker processes. This not only affects PHP but also other languages such as Python and Perl. Application server advocates have tended to point this out time and again, when in fact, it's a workaround for the database servers not scaling in this kind of scenario. Why should we be solving this problem by abandoning the rock-solid multi-process model rather than having the databases handle this common and increasingly growing runtime architecture? The traditional J2EE application server model is dying in favor of a more loosely coupled, robust and stable set of technologies that all interoperate and scale well together. Whether it's LAMP, OPAL, WIMP or other combinations of technologies, the world is moving away from monolithic systems to heterogeneous loosely-coupled systems. In such environments it becomes increasingly important for each part of the system to scale in various configurations and system architectures.

IBM has shown with this feature that they are successfully playing the loosely coupled game. Now Oracle are following and adding a feature which is also supposed to solve this problem and allow them to better scale in a heavy duty PHP environment. It will be interesting to see how their performance numbers stack up against IBM’s when the feature is released.