New: 2-Dec-96
A bug was found it the patch by Klaus Weide.
(See his message and reply below.)
-John Heidemann
----------------------------------------------------------------------
X-url:
To: new-httpd@hyperreal.com
Subject: mmap patch for Apache-1.0.5
Date: Fri, 24 May 1996 11:01:41 -0700
From: John Heidemann
At ISI I'm looking at web server performance as part of the LSAM
project (http://www.isi.edu/div7/lsam/). As part of this analysis we
found an optimization to Apache performance: by using memory-mapped
files (rather than stdio), CPU utilization can be reduced when sending
large files.
The attached patch implements this optimization in Apache-1.0.5.
Performance is examined in more detail in the long comment at the
beginning of the patch.
Although the patch is for Apache-1.0.5, the port to 1.1bX should be
fairly easy. If people think that the patch is suitable for inclusion
in a future release of Apache (probably 1.2), then I will do the port.
Comments?
-John Heidemann
USC/ISI
----------------------------------------------------------------------
Date: Fri, 22 Nov 1996 11:11:58 -0600 (CST)
From: Klaus Weide
To: johnh@dash.isi.edu
Subject: bug in "mmap patch for Apache-1.0.5"?
Message-ID:
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
(I am referring to the version found at
as of today.)
Hello,
I looked over the patch mentioned above, and it appears to me that
there is a flaw in the logic which determines the `segment_length'
for the (first) mmap() call. This would only be relevant if
(1) send_fd_mmap() is called on a FILE with the position indicator
different from the start of the file, and
(2) the initial `remaining_length' is smaller than a full
MMAP_SEGMENT_SIZE.
The following is the relevant part of your patch:
+ /* set up for initial mapping */
+ segment_start = start_ftell & ~(MMAP_SEGMENT_SIZE-1);
+ o = start_ftell & (MMAP_SEGMENT_SIZE-1);
+ remaining_length = r->finfo.st_size - start_ftell;
+
+ while (!c->aborted && remaining_length) {
+ segment_length = MMAP_SEGMENT_SIZE;
+ if (segment_length > remaining_length)
+ segment_length = remaining_length;
+ if (segment_length == 0)
+ break;
+
+ map = mmap(NULL, (size_t)segment_length, [... , ... ,]
+ r_fd, (off_t)segment_start);
[ rest of while loop ]
For example, in the following situation:
segment_start
|
| < -------- -- MMAP_SEGMENT_SIZE ------------ > |
v |
|------------------------------------------------|-----------------|
^ ^ ^
| | |
start_ftell | , r->finfo.st_size
|
(effective end of mmapped region, *TOO SHORT*)
the call to mmap would effectively happen as
mmap(NULL, ( - start_ftell ) , ..., segment_start);
when it *should* be
mmap(NULL, ( - segment_start ) , ..., segment_start);
NOTE: I have only looked at your patch and do not know the rest of
the apache code. If I have overlooked something obvious, or misunder-
stand mmap(), please let me know. [But as far as I and the man pages
I consulted know, mmap() doesn't care about a current file position,
it always maps from the beginning of a file.]
Klaus
----------------------------------------------------------------------
(Message week:6677)
X-url:
To: Klaus Weide
Subject: Re: bug in "mmap patch for Apache-1.0.5"?
In-reply-to:
Date: Mon, 02 Dec 1996 14:59:31 -0800
From: John Heidemann
On Fri, 22 Nov 1996 11:11:58 CST, Klaus Weide wrote:
>(I am referring to the version found at
>
>as of today.)
>
>...
>
> I looked over the patch mentioned above, and it appears to me that
>there is a flaw in the logic which determines the `segment_length'
>for the (first) mmap() call. This would only be relevant if
>
> (1) send_fd_mmap() is called on a FILE with the position indicator
> different from the start of the file, and
> (2) the initial `remaining_length' is smaller than a full
> MMAP_SEGMENT_SIZE.
>
>...
>
There is an error as you outline---thanks for letting me know.
The correct code would have
map = mmap(NULL, (size_t)(segment_length + o), [... , ... ,]
r_fd, (off_t)segment_start);
>The following is the relevant part of your patch:
>+ /* set up for initial mapping */
>+ segment_start = start_ftell & ~(MMAP_SEGMENT_SIZE-1);
>+ o = start_ftell & (MMAP_SEGMENT_SIZE-1);
>+ remaining_length = r->finfo.st_size - start_ftell;
>+
>+ while (!c->aborted && remaining_length) {
>+ segment_length = MMAP_SEGMENT_SIZE;
>+ if (segment_length > remaining_length)
>+ segment_length = remaining_length;
>+ if (segment_length == 0)
>+ break;
>+
>+ map = mmap(NULL, (size_t)segment_length, [... , ... ,]
>+ r_fd, (off_t)segment_start);
> [ rest of while loop ]
This bug was not discovered until now because:
(1) Apache 1.0 always has start_ftell == 0.
(2) mmap rounds mappings up to whole page granularities
I've updated the patch on my web page.
Note that the patch is (as of today) untested.
-John
----------------------------------------------------------------------
Index: http_protocol.c
===================================================================
RCS file: /nfs/gost/CVSroot/external/apache/src/http_protocol.c,v
retrieving revision 1.1
retrieving revision 1.4
diff -u -u -r1.1 -r1.4
--- http_protocol.c 1996/04/04 17:56:44 1.1
+++ http_protocol.c 1996/05/23 22:58:13 1.4
@@ -532,13 +532,260 @@
return fread (buffer, sizeof(char), bufsiz, r->connection->request_in);
}
+#ifdef ISI_MMAP
+/***********************************************************************
+ *
+ * ISI_MMAP patch
+ * --------------
+ * John Heidemann,
+ *
+ *
+ * Apache 1.0.5 (and NCSA 1.5) use stdio to send out file data.
+ * Stdio is good for piecing together headers, but it's not
+ * the best choice for bulk-data transfer because it incurs
+ * several unnecessary data copies.
+ *
+ * With stdio you see the following copies to send out a file:
+ * disk -> fs/vm-cache -> stdio buffer -> user buffer
+ * -> stdio buffer -> mbufs -> network device
+ * (6 copies)
+ *
+ * Instead of using stdio, instead we should memory map the file
+ * and then write that memory directly out to the network.
+ * Mmap/write eliminates the stdio buffer copies:
+ * disk -> fs/vm-cache -> mbufs -> network device
+ * (3 copies)
+ * With mmap, the data never hits user-space.
+ *
+ *
+ * What is the result of mmaping instead of stdio?
+ * -----------------------------------------------
+ *
+ * In cases where your web server is CPU-bound and mmaping is
+ * effective, you should see better performance with mmapping. In
+ * cases where your web server is not CPU bound, you should see a
+ * lower CPU utilization.
+ *
+ * Mmapping is only effective for ``large'' files; for extremely small
+ * files the cost of setting up the mmap exceeds the cost of simply
+ * doing the extra data copies. In this case ``large'' is an
+ * OS- and hardware- dependent value; for SunOS 4.1.3 on Sparc-10s
+ * the balance seems to be at about 10k.
+ *
+ * When are web servers CPU bound? A Sparc-10 can saturate a 10Mb/sec
+ * Ethernet with CPU to spare. With Myrinet (a 640Mb/sec network, see
+ * http://www.myri.com), CPU usage becomes an issue. With Sparc-10s,
+ * we found that mmaping allows ~2Mb/sec better performance than stdio
+ * for files larger than 25KB (maximum throughput is 18Mb/sec for 10MB
+ * files). For Sparc-20/71s we see about the same performance gain
+ * (maximum throughput is ~39Mb/sec for 10MB files). (These
+ * measurements are between two unloaded machines with the same CPU
+ * type connected through a single Myrinet switch. A modified Apache
+ * server ran on one machine, and a single client ran on the other
+ * machine, requesting the same file 50 times in a row. Files were
+ * stored in tmpfs on the server.)
+ *
+ * Servers are also CPU bound when there are many clients hitting a
+ * single server. We ran WebStone (with the ``Silicon Surf''
+ * filelist) with and without the mmap patch on a Sparc-20/71 server
+ * with two Sparc-10 clients over Myrinet. The mmap-enhanced
+ * server handles about 0.5-1.5 additional connections per second
+ * as the number of clients varies from 2 to 24. (The total
+ * number of connections per second ranges from 14.9 to 35.4.)
+ *
+ *
+ * About the implementation
+ * ------------------------
+ *
+ * The implementation had several goals:
+ * - minimal changes
+ * - make the new code look like the old code
+ * - check all errors
+ * - fall back on stdio at the slightest problem, if we can
+ * In general, write something that people will run in a production
+ * web server.
+ *
+ * There is one possible resource leak: mmap segmenets must be released
+ * upon aborts. I check all error returns, but it looks like
+ * timer-expirations lead to longjmps. To get around this problem,
+ * we probably should add the mmap segment to the resource
+ * pool cleanups.
+ *
+ *
+ * How to use
+ * ----------
+ *
+ * To use this implementation, apply the patch to Apache 1.0.*,
+ * add -DISI_MMAP to AUX_CFLAGS in the Makefile or Configuration,
+ * and re-build.
+ *
+ *
+ * Disclaimer
+ * ----------
+ *
+ * DISCLAIMER OF WARRANTY. THIS PATCH IS PROVIDED "AS IS". The
+ * University of Southern California MAKES NO REPRESENTATIONS OR
+ * WARRANTIES, EXPRESS OR IMPLIED. By way of example, but not
+ * limitation, the University of Southern California MAKES NO
+ * REPRESENTATIONS OR WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY
+ * PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE
+ * COMPONENTS OR DOCUMENTATION WILL NOT INFRINGE ANY PATENTS,
+ * COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. The University of Southern
+ * California shall not be held liable for any liability nor for any
+ * direct, indirect, or consequential damages with respect to any
+ * claim by the user or distributor of this patch or any
+ * third party on account of or arising from this Agreement or the use
+ * or distribution of this patch.
+ *
+ */
+
+
+#include
+#include
+/* work around deficient system headers (ex. SunOS 4.1.3) */
+#ifndef MAP_FILE
+#define MAP_FILE 0
+#endif /* ! MAP_FILE */
+
+/*
+ * On SunOS 4.1.3, the performance tradeoff
+ * between mmap and stdio
+ * (as measured by bandwidth over Myrinet between Sparc-10 hosts)
+ * seems to strike at ~10000B.
+ * Your mileage may vary.
+ */
+#define MMAP_THRESHOLD (8*1024)
+#define MMAP_SEGMENT_SIZE (8*1024*1024)
+/*
+ * Currently we write data in 32KB chunks,
+ * 4x more than with fread/fwrite.
+ * Larger chunks => fewer system calls => lower CPU utilization.
+ * ...*but* we have a timer going and we don't want the timer
+ * to expire before we're through (or we'll be sorry).
+ */
+#define MMAP_WRITE_SIZE (32*1024)
+#define MMAP_AGAIN -2
+
+/*
+ * To avoid data copies,
+ * send_fd_mmap uses mmap/write instead of stdio.
+ *
+ * Another interface difference:
+ * send_fd doesn't necessarily leave either the file passed in (f),
+ * or r->connection->client in a usable state.
+ * See the comment at the end for details.
+ *
+ * - John Heidemann, , 960411
+ */
+long send_fd_mmap(FILE *f, request_rec *r)
+{
+ int r_fd, w_fd, start_ftell;
+ caddr_t map;
+ size_t remaining_length, segment_length;
+ off_t segment_start;
+ int total_bytes_sent = 0;
+ int w, n, o;
+ conn_rec *c = r->connection;
+
+ /* First, clean up file. */
+ fflush(f);
+ start_ftell = (off_t)ftell(f);
+ r_fd = fileno(f);
+
+ fflush(c->client);
+ w_fd = fileno(c->client);
+
+ /* set up for initial mapping */
+ segment_start = start_ftell & ~(MMAP_SEGMENT_SIZE-1);
+ o = start_ftell & (MMAP_SEGMENT_SIZE-1);
+ remaining_length = r->finfo.st_size - start_ftell;
+
+ while (!c->aborted && remaining_length) {
+ segment_length = MMAP_SEGMENT_SIZE;
+ if (segment_length > remaining_length)
+ segment_length = remaining_length;
+ if (segment_length == 0)
+ break;
+
+ map = mmap(NULL, (size_t)(segment_length + o), PROT_READ, MAP_SHARED|MAP_FILE,
+ r_fd, (off_t)segment_start);
+ /*
+ * If mmap failed and we haven't done anythign else yet,
+ * fall back on stdio by returning MMAP_AGAIN.
+ * send_fd recognizes this message and picks up.
+ */
+ if (map == (caddr_t) -1)
+ return total_bytes_sent ? total_bytes_sent : MMAP_AGAIN;
+ n = segment_length - o; /* bytes to send */
+
+ /*
+ * xxx: we write in larger chunks than send_fd,
+ * possibly therefore requiring larger timeout values.
+ */
+ while (n && !c->aborted) {
+ w = MMAP_WRITE_SIZE;
+ if (n < MMAP_WRITE_SIZE)
+ w = n;
+ w = write(w_fd, &map[o], w);
+ if (w == -1) {
+ munmap(map, segment_length);
+ return total_bytes_sent;
+ };
+ reset_timeout(r);
+ total_bytes_sent += w;
+ n -= w;
+ o += w;
+ };
+
+ (void) munmap(map, segment_length);
+ remaining_length -= segment_length;
+ o = 0; /* set up for next pass */
+ };
+
+ /*
+ * Upon return, whether f or c->client are usable
+ * is unspecified (and therefore OS dependent).
+ *
+ * In most OSes, it should be OK to go back and use them.
+ *
+ * In the worst case, they may have to be re-created with code like:
+ * dup(w_fd);
+ * fclose(c->client); -- out with the old
+ * c->client = fdopen(w_fd); -- in with the new
+ *
+ * This problem can be fixed in Apache-1.1 which uses it's own
+ * stdio-equivalent which will have known behavior.
+ */
+
+ return total_bytes_sent;
+}
+#endif /* ISI_MMAP */
+
long send_fd(FILE *f, request_rec *r)
{
char buf[IOBUFSIZE];
long total_bytes_sent;
register int n,o,w;
conn_rec *c = r->connection;
-
+
+#ifdef ISI_MMAP
+ /*
+ * Be very conservative about invoking mmap.
+ * The file stats must be valid, we must have a regular
+ * file, and we must have ``enough'' data to send that
+ * mmapping is worthwhile. If so, try it out.
+ * If we try it and it doesn't work, fall back
+ * on stdio if we can.
+ */
+ if (r->finfo.st_mode &&
+ S_ISREG(r->finfo.st_mode) &&
+ r->finfo.st_size - ftell(f) > MMAP_THRESHOLD) {
+ total_bytes_sent = send_fd_mmap(f, r);
+ if (total_bytes_sent != MMAP_AGAIN)
+ return total_bytes_sent;
+ /* MMAP_AGAIN => fall through and do stdio anyway */
+ };
+#endif /* ISI_MMAP */
total_bytes_sent = 0;
while (!r->connection->aborted) {
while ((n= fread(buf, sizeof(char), IOBUFSIZE, f)) < 1