[FASTCGI] Threaded C fcgiapp implementation problems and questions

Jonathan Gray jgray at streamy.com
Wed Apr 22 15:54:50 EDT 2009


Hello,

I have a multithreaded, C FastCGI script using the fcgiapp library running
on top of lighttpd.

I'm having a recurring problem on my production environment that crops up
after a few days straight of load around 20-40 concurrent connections.

This script is implementing something called COMET
http://en.wikipedia.org/wiki/Comet_(programming)

It's basically using AJAX/XHR requests to simulate pushing to the client. 
The user opens an AJAX request to the script and the server keeps it
loading until a message comes in from the server (it connects to a central
server which sends messages to clients), or until we time it out.  On the
wikipedia page, this is described as Ajax with long polling /
XMLHttpRequest long polling.

This has been working for a very long time but recently as load has been
increasing we started to see a weird behavior.

All of a sudden, lighttpd/mod_fastcgi will start to reject all new
connections.  The log shows this error:

2009-04-01 12:21:33: (mod_fastcgi.c.3005) got proc: pid: 3664 socket:
unix:/home/user/cgi/socks/event.sock-0 load: 25
2009-04-01 12:21:33: (mod_fastcgi.c.2494) unexpected end-of-file (perhaps
the fastcgi process died): pid: 3664 socket:
unix:/home/user/cgi/socks/event.sock-0


The process is not dead, there are 24 other connections that are currently
being properly handled.  When these requests come in, the script does not
see them at all (ie. FCGX_Accept_r does not return).

After all the existing connections have dropped, it will then continue
normal operation and start to accept new connections:

2009-04-01 12:22:00: (mod_fastcgi.c.1515) released proc: pid: 3664 socket:
unix:/home/user/cgi/socks/event.sock-0 load: 2
2009-04-01 12:22:01: (mod_fastcgi.c.1515) released proc: pid: 3664 socket:
unix:/home/user/cgi/socks/event.sock-0 load: 1
2009-04-01 12:22:03: (mod_fastcgi.c.1515) released proc: pid: 3664 socket:
unix:/home/user/cgi/socks/event.sock-0 load: 0

and then

2009-04-01 12:22:03: (mod_fastcgi.c.3005) got proc: pid: 3664 socket:
unix:/home/user/cgi/socks/event.sock-0 load: 1

The same PID (the process never crashed) then does start to see new
connections and things go for another few days without problems, then the
same thing happens again.


The design of my application differs from the example threaded application
because I do not keep a thread per connection, rather I use queues,
timers, hash tables, etc to track the state of sessions and their
FCGX_Request.

Since I can't just use a FCGX_Request per thread, as done in the example,
I pre-instantiate a large array of FCGX_Requests of size
MAX_ALLOC_REQUESTS.  I then loop through this array, sliding down one
index each time.  This array is significantly large that I do not get
anywhere close to reusing a request that was not FCGX_Finish_r'd already.
(this is set to 25,000 right now, in benchmarking i'm trying to get over
10k.  i am nowhere near this in production where the bug happens).

Is this a sane approach?  Could I be messing something up with my
allocating so many and doing FCGX_InitRequest on each.

  for(i=0;i<MAX_ALLOC_REQUESTS;i++) FCGX_InitRequest(&reqs[i],0,0);


I am locking around the accept, so the accepting of connections is
single-threaded:

    pthread_mutex_lock(&accept_mutex);
    rc = FCGX_Accept_r(&reqs[curreq]);
    nextreq = (curreq + 1) % MAX_ALLOC_REQUESTS;
    pthread_mutex_unlock(&accept_mutex);


I have two potential threads that can close the connection.  In all cases,
the closing of the connection follows the form:

    FCGX_FPrintF(request->out,"...");
    FCGX_Finish_r(request);


That request is still part fo the reqs[] and will be called again, much
later, with FCGX_Accept_r.

Again, is this right?  I read that FCGX_Finish_r is thread-safe, so I'm
not locking around that, there are potentially two threads running FPrintF
and Finish_r simultaneously (but ALWAYS on different FCGX_Requests).


I have increased all kinds of system limits like file/socket descriptor
limits and memory limits.  I have seen the requests "loop" through the big
array of pre-allocated ones, and they are reused without a problem.

One thing i'm also not sure about is how keep-alives and pipelining might
interact with what i'm doing.  When looking into the lighttpd status page,
sometimes I noticed connections, after they are out of handle-req and the
script has returned/finish_r'd it, they sit in a 'read' state for some
time.  The only pointer I've got w.r.t. that was it might be trying to
read another request from the client?


I'd really appreciate any kind of help.  I'm a bit stuck and in any case
could use some best practices advice.

Thanks.

Jonathan Gray


More information about the FastCGI-developers mailing list