WebCom secrets: How we hosted 70,000 domains on one Apache instance

A chief virtue of time is that it provides distance. Time is the 4th dimension we live in and it gives us the opportunity to share what once was, without fear of reprisal. It has been 12 years since I was let go from Verio, almost as much time as I worked for WebCom/Verio/NTT. I feel there is enough distance between then and now to share some secrets without fear of reprisal.

WebCom did things differently, we pioneered name-based virtual hosting and we learned how to do more with less. Back when WebCom was starting to do name-based hosting it was common for many providers to put 2,000 IP addresses on an SGI machine running IRIX. I assume that the allure of SGI had to do with decent horsepower and a BSD derived OS that could host a lot of IP addresses per NIC. Back then the BSD network stack was considered to be one of the best.

When I started we had HP PA-RISC machines, a Sun 4/330, and a Windows NT 3.51 486 running MS SQL Server (Sybase). By the end of the year we’d signed a lease on a Sun Enterprise 1000 server, a piece of “big iron” at the time. I think we had 4 SuperSPARC processors and 512MB of RAM. We looked at offering IP based hosting on Sun, but their OS only allowed up to 255 IPs per NIC. We briefly considered an inexpensive array of SCO Unix boxes, but Linux was never in the running because Chris considered it an immature OS. I spent my entire career there championing Linux, and winning.

We decided to go the Big Ole Server route with Sun, first with the S1000E, then an Enterprise 4000 in 1997. Early on we ran Netscape Enterprise Server, a commercial web server product from Netscape, written by the same people who wrote NCSA httpd. This was a modular web server with a plugin architecture and it could be expanded by writing NSAPI modules to perform actions in the chain of operations. Apache wasn’t really on the radar at this point. Chris wrote the first name-based hosting plugin for Netscape, this solution lasted us until around 20,000 domains, then the underlying architecture of Netscape became a bottleneck.

I proposed a stopgap measure to help spread the load: The vast majority of our content was static media and web pages, we could implement a reverse caching layer in front of the Netscape server and reduce the load.

We purchased 2 BIG/IP machines for load balancing, these spread the incoming load among 10 Squid reverse proxy cache boxes running RedHat 5. These were Intel Apollo desktop computers with 180Mhz Pentium Pro processors, 256MB of memory, and an 8GB HDD. This gave us around 2GB of in-memory cache storage, and about 40GB of HDD cache space. This approach had some teething issues, primarily due to how cache invalidation is done with reverse proxies, the solution never fully worked seamlessly and it had a fatal flaw: latency.

Back when we had the Squid caches, latency was a major problem with reverse proxy caches. The web hosting industry was increasingly being held hostage by 3rd party performance metric websites that sought to prove who was the fastest. The company Keynote Systems would spider your site and track latency and performance of your servers. Our proxy caching solution scored consistently low in the Keynote rankings, which resulted in pressure on me to deliver a magic bullet.

Apache to the rescue

The magic bullet came in the form of the Apache web server. Apache had matured significantly since we started using the Netscape server and it was then considered the de facto standard for web hosting on Unix, there were still some IIS holdouts that thought Microsoft was the one true way.

The solution to our performance problems was to do a clean sheet redesign of our web hosting platform. I evaluated every single feature we used in the Netscape server and I scoured the Apache docs to find matching features. 4 years of growth showed that Apache could do almost all of what we needed out of the box, almost. I had to write a couple Apache modules to implement custom user matches (http://www.webcom.com/username), but there was no solution for hosting 70,000 domains in a single Apache instance. I tried starting Apache with 20,000+ domain names, the config file was enormous and it took forever for Apache to parse it. The memory footprint was large and performance just wasn’t there.

There are 2 ways you can look at an obstacle: it is either immovable, or you can just think creatively. I chose the latter.

Sun implemented /tmp using tmpfs, which is an in-memory filesystem that shares system RAM with the disk cache. Linux has a very similar design today, the models are nearly identical. The tmpfs on Sun was VERY fast at handling lots of inodes in a single directory, so much so that it was probably an order of magnitude faster than UFS or VxFS. I exploited this capability to put 140,000+ symlinks in a single directory in /tmp and used it like a very simple database.

Back then we had Sybase, which is fully transactional, and connections are considered expensive. If we made our web server dependent on Sybase for every lookup (this is what happened early on with the NSAPI module), and the database had a hiccup or restart, your services would stop working. It’s okay if you have a short outage to things like a control panel (self services), but it cannot affect web hosting. In mission critical applications that needed database lookups, we would create DBM files with a key->value store. The key was usually a single text string and the value would be a delimited chunk of row data. We would run stored procedures to update the DBM files periodically, then rename them in place.

The limitation of using DBM files is that there was no really clean way to signal to a process to refresh the file handle on the DBM. We built-in signal handlers to do this for some long running processes, but for Apache it would not be feasible due to the delay in processing during a reload.

Abusing tmpfs

I used the tmpfs as a key->value store database by using the symlink name as the key and the contents of the symlink as the value. We would create 140,000+ symlinks, each named for the DNS name for the domain, with the contents of the symlink pointing to the directory where the domain content was stored. The Apache server would check if the HTTP request was for the base server (the server without a vhost defined) or for a pre-configured vhost. If there was no vhost directive for the request, it would lookup a symlink for “www.hostname.com” or “hostname.com”, perform a readlink to obtain the directory where the hostname pointed, then inject that into the document root of the request.

We created a single base configuration file that was appropriate for all customer domains, then we created a vhost for www.webcom.com, which served our static HTML, CGI services, and the /username redirects.

By using the tmpfs as a fast, lightweight, OS arbitrated flat database, we could host 70,000 (and more) domains on a single instance of Apache. It was FAST, so fast that we removed the entire proxy layer and just pointed all customer traffic to the main server. The capacity of Apache running on a 12 processor Enterprise 4000 was easily 6-10x more than Netscape Enterprise Server.

About the time I made these changes to Apache, I started having practical and philosophical concerns over taking Open Source Software and modifying it for commercial use, without contributing those changes back to the upstream project. The real practical business problem with that model is that every time a new Apache release comes out, I have to port my patches to the new release. I had to do this a couple times due to CVEs for Apache. I also surmised that we could actually set the standard had we released our code, instead of someone else doing it later. Someone else did do it later.

It’s been over 20 years since I wrote this code, it’s time for it to make it into the world, even if it has no practical value anymore.

diff -r -u apache_1.3.6/src/include/httpd.h apache_1.3.6_webcom/src/include/httpd.h
--- apache_1.3.6/src/include/httpd.h	2002-07-25 01:38:03.000000000 -0700
+++ apache_1.3.6_webcom/src/include/httpd.h	2002-07-25 01:42:00.000000000 -0700
@@ -306,7 +306,7 @@
  * the overhead.
  */
 #ifndef HARD_SERVER_LIMIT
-#define HARD_SERVER_LIMIT 256
+#define HARD_SERVER_LIMIT 397
 #endif
 
 /*
@@ -894,6 +894,8 @@
     int limit_req_line;      /* limit on size of the HTTP request line    */
     int limit_req_fieldsize; /* limit on size of any request header field */
     int limit_req_fields;    /* limit on number of request header fields  */
+
+    char *namevirtual_symlink_dir;	/* pedward - virtualhost hack */
 };
 
 /* These are more like real hosts than virtual hosts */
diff -r -u apache_1.3.6/src/main/http_config.c apache_1.3.6_webcom/src/main/http_config.c
--- apache_1.3.6/src/main/http_config.c	2002-07-25 01:38:05.000000000 -0700
+++ apache_1.3.6_webcom/src/main/http_config.c	2002-07-25 01:39:47.000000000 -0700
@@ -1325,6 +1325,8 @@
     s->limit_req_fieldsize = main_server->limit_req_fieldsize;
     s->limit_req_fields = main_server->limit_req_fields;
 
+/* pedward - virtualhost hack */
+    s->namevirtual_symlink_dir = NULL;
     *ps = s;
 
     return ap_parse_vhost_addrs(p, hostname, s);
diff -r -u apache_1.3.6/src/main/http_core.c apache_1.3.6_webcom/src/main/http_core.c
--- apache_1.3.6/src/main/http_core.c	2002-07-25 01:38:06.000000000 -0700
+++ apache_1.3.6_webcom/src/main/http_core.c	2002-07-25 01:42:04.000000000 -0700
@@ -494,7 +494,26 @@
 API_EXPORT(const char *) ap_document_root(request_rec *r) /* Don't use this! */
 {
     core_server_config *conf;
+	char	*p;
 
+/* pedward - virtualhost hack */
+    if (r->server->namevirtual_symlink_dir && r->connection->server == r->connection->base_server && r->hostname) {
+	char	link[1024];
+	int	i;
+
+        p = ap_pstrcat(r->pool, r->server->namevirtual_symlink_dir, "/", r->hostname, NULL);
+
+	if ((i=readlink(p, link, sizeof(link))) != -1) {
+		link[i]='\0';
+	} else {
+		return p;
+	}
+
+	p = ap_pstrdup(r->pool, link);
+
+        return p;
+    }
+	
     conf = (core_server_config *)ap_get_module_config(r->server->module_config,
 						      &core_module); 
     return conf->ap_document_root;
@@ -672,6 +691,11 @@
 {
     core_dir_config *d;
 
+/* pedward - virtualhost hack */
+    if (r->server->namevirtual_symlink_dir && r->connection->server == r->connection->base_server) {
+	return r->hostname;
+    }
+
     d = (core_dir_config *)ap_get_module_config(r->per_dir_config,
 						&core_module);
     if (d->use_canonical_name & 1) {
@@ -2650,6 +2675,17 @@
 }
 #endif
 
+/* pedward - virtualhost hack */
+static const char *set_virtual_symlink_directory(cmd_parms *cmd, void *dummy, char *arg) 
+{
+    if (arg[strlen(arg)] == '/') {
+		arg[strlen(arg)]='\0';
+    }
+
+    cmd->server->namevirtual_symlink_dir = ap_pstrdup(cmd->pool, arg);
+    return NULL;
+}
+
 /* Note --- ErrorDocument will now work from .htaccess files.  
  * The AllowOverride of Fileinfo allows webmasters to turn it off
  */
@@ -2875,6 +2911,9 @@
   (void*)XtOffsetOf(core_dir_config, limit_req_body),
   OR_ALL, TAKE1,
   "Limit (in bytes) on maximum size of request message body" },
+/* pedward - virtualhost hack */
+{ "NameVirtualHostSymlinkDirectory", set_virtual_symlink_directory, NULL, RSRC_CONF, TAKE1,
+  "Set the namevirtual host symlink directory"},
 { NULL }
 };
 
@@ -2902,9 +2941,28 @@
 	&& (r->server->path[r->server->pathlen - 1] == '/'
 	    || r->uri[r->server->pathlen] == '/'
 	    || r->uri[r->server->pathlen] == '\0')) {
+
+/* pedward - virtualhost hack */
+    if (r->server->namevirtual_symlink_dir && r->connection->server == r->connection->base_server && r->hostname) {
+	char	link[1024];
+	int	i;
+	char	*p;
+
+        p = ap_pstrcat(r->pool, r->server->namevirtual_symlink_dir, "/", r->hostname, NULL);
+
+	if ((i=readlink(p, link, sizeof(link))) != -1) {
+		link[i]='\0';
+		p = link;
+	}
+
+        r->filename = ap_pstrcat(r->pool, p,
+				 (r->uri + r->server->pathlen), NULL);
+    } else {
         r->filename = ap_pstrcat(r->pool, conf->ap_document_root,
 				 (r->uri + r->server->pathlen), NULL);
     }
+
+    }
     else {
 	/*
          * Make sure that we do not mess up the translation by adding two
@@ -2917,8 +2975,25 @@
 				     NULL);
 	}
 	else {
-	    r->filename = ap_pstrcat(r->pool, conf->ap_document_root, r->uri,
-				     NULL);
+/* pedward - virtualhost hack */
+	    if (r->server->namevirtual_symlink_dir && r->connection->server == r->connection->base_server && r->hostname) {
+		char	link[1024];
+		int	i;
+		char	*p;
+
+		p = ap_pstrcat(r->pool, r->server->namevirtual_symlink_dir, "/", r->hostname, NULL);
+
+		if ((i=readlink(p, link, sizeof(link))) != -1) {
+			link[i]='\0';
+			p = link;
+		}
+
+                r->filename = ap_pstrcat(r->pool, p, r->uri,
+					     NULL);
+	    } else {
+                r->filename = ap_pstrcat(r->pool, conf->ap_document_root, r->uri,
+					     NULL);
+	    }
 	}
     }
 
diff -r -u apache_1.3.6/src/main/http_vhost.c apache_1.3.6_webcom/src/main/http_vhost.c
--- apache_1.3.6/src/main/http_vhost.c	2002-07-25 01:38:07.000000000 -0700
+++ apache_1.3.6_webcom/src/main/http_vhost.c	2002-07-25 01:39:48.000000000 -0700
@@ -665,6 +665,8 @@
     const char *hostname = r->hostname;
     char *host = ap_getword(r->pool, &hostname, ':');	/* get rid of port */
     size_t l;
+/* pedward */
+	char *p;
 
     /* trim a trailing . */
     l = strlen(host);
@@ -672,6 +674,12 @@
         host[l-1] = '\0';
     }
 
+	p = host;
+	while (*p) {
+		*p = tolower(*p);
+		p++;
+	}
+
     r->hostname = host;
 }
 

2 thoughts on “WebCom secrets: How we hosted 70,000 domains on one Apache instance”

  1. WebCom could never have survived without your genius, Perry. The smartest thing Chris and I ever did was hire you.

    Bluntly, WebCom’s single host architecture was a product of the fact that I was a very raw and green sysadmin when we started, and it was easier for me to conceptualize and manage a single system, rather than dealing with distributed systems (the technologies for which were very primitive to non-existent in any form we could afford in early 1994). Another reason for the “big iron” approach was support… we could pay Sun Microsystems $20-30k a year for “Gold” level support, which I leaned on heavily. When something broke at an OS level, I had someone to scream at (and I did).

    Two stories here:

    a) at one point, the web server was locking up due to the filesystem not accepting writes; fixing it required rebooting the server, which in turn resulted in a multi-hour fsck of the non-journaled filesystem; it turns out that there was a bug in the filesystem (I recall something about an NMS “hole” or some such) and my screams got their third level support to issue us a pre-release patch

    b) the Sybase database server was originally tied directly to the web server, every web page hit was logged directly to the database, and then we ran reports and resource usage billing against it; when we ran Sybase 9, this wasn’t a problem… but Sybase 10 was a disaster, and regularly hung under load (as I recall, that release ultimately kind of killed their company) and refused to accept further connections, which in turn hung the web server; after this happened a few times, Chris and I decided to disconnect the logging function from the database and have the web server write to a log processing tool that would handle that the task of uploading hit records to the database. Which solved that problem, but created another problem in that the database server (that Sparc 1000e) was running at 95% of capacity, and any time we wound up accumulating a backlog of pending transactions (such as Sybase crashing), the system simply could never catch up.

  2. Yep, I remember custlogd, the daemon that would process the logs and bill customers. Chris did tie logging into the database early on, and we kept with that strategy but we decentralized it. Snoopy and Woodstock were 2 Linux boxes I built to run the log processor daemon, we would read logs over NFS from the E4000 (implementing external locking on NFS, since filesystem locking sucked at that point) and the log processor daemon would insert them into an array of inexpensive redundant Sybase boxes. Custlogd was then modified to connect to these databases, aggregate the log data, then insert records into the main provisioning database. This process worked well except that we learned “normalization” was a bad word when performance was concerned.

    We had 1 single clustered table that contained all log records, when a record was processed it would update the table to indicate it was deleteable. We would try and perform batch deletes, but this caused us no end of locking havoc.

    I think the ultimate solution ended up being to delete records in very short intervals. As a MySQL Support Engineer I recognize these problems today and there is a lot of best practices we didn’t know about. Today you could just delete the record immediately or do deletes in batches of 10,000 or less.

    Ironically, you could sidestep these short lived types of tables and rename or use partitions, then drop partitions.

    One of the interesting perspectives is that I work for Oracle, who bought Sun, who bought MySQL AB. When it comes to known bugs, if there is a patch, getting a hotfix is typically a fairly short turnaround. Our problems with Solaris didn’t end there, they culminated with a major faux pas when trying to deploy Netapp F740’s with Solaris 2.5.1 — you can’t.

    We tried to get Solaris 2.5.1 working, but it was just buggy and Sun kept telling us to upgrade to 2.6, so we planned for this. What we didn’t account for is that the Veritas FS (vxfs) was a kernel module from a 3rd party, and we didn’t have that module for Solaris 2.6. We performed the upgrade and discovered we couldn’t access anything but the root filesystem AND the Netapps! We mirrored our root volume to a second disk for redundancy purposes, but by the time we decided we were just going to roll back, the cron job that mirrored the root volume had already run!

    Faced with no turning back, we screamed at Sun until someone emailed us the .so file we needed to get our systems working, installed it, rebooted, and the system was back up. We were down for about 12 hours that night and that was also my first all-nighter. The biggest problem with doing these types of maintenance tasks after hours is that people lose their cognitive edge as the day wears on and mistakes are made at critical junctures.

    It all seems trivial right now, with more industry experience under my belt than my age back then!

Leave a Reply

Your email address will not be published. Required fields are marked *