[Sks-devel] wserver_timeout value causing cascading failure?

sks-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Sks-devel] wserver_timeout value causing cascading failure?

From:	Jonathon Weiss
Subject:	[Sks-devel] wserver_timeout value causing cascading failure?
Date:	Mon, 24 Apr 2017 13:53:32 -0400

Hi All,

As the maintainer of what is probably the most heavily used key-server
on the net, I've run into a problem that I wanted to discuss here.

An important note here is that I'm using Apache as a proxy for SKS (on
80, 443, and 11371).

If I understand how SKS works, it can accept and hold onto multiple
client connections at once, but only processes them serially.

I think what's going on is something like the following:

1) multiple client connections come in and are passed from Apache to
   SKS (possibly while SKS is working on a previous query).

2) SKS works on the first query and returns the answer

3) for some reason the owner of the second query has disappeared (I
   assume this is because the client gives up, and maybe hist reload or
   something, and Apache notices that the client is gone and drops all
   connection state)

4) SKS waits 'wserver_timeout' (default 60) seconds, and gives up and
   goes on to the next connection.

5) The next client gave up during the timeout, and the problem expands
   out of control.

One obvious way to break out of this cycle is if you have a long
enough period of time where no requests come in, that all of the
timeouts for existing connections can be resolved.  On a mostly idle
server, this may be fairly easy to achieve (especially if queries
normally arrive at a rate of less than one per minute.

I have no idea what the average request rate is for a pool member, but
pgp.mit.edu handles 125k-175k /pks/lookup queries a day (or in round
numbers, roughly 1.5 - 2 queries per second).  Obviously, that doesn't
leave a lot of windows for long timeouts.

My solution has been to set wserver_timeout=1 (and some less effective
timeout tuning on the Apache side), on the theory that Apache running
on the same server ought to be able to hand off the query really
quickly.  It will take a few more problem free days for me to be fully
confident, but wserver_timeout=1 very much looks like it has solved
the problem.  For a while I was running with wserver_timeout=4, but
that proven insufficient.


This all leaves me with several questions:

1) Does anyone see any flaws in my analysis?  or work-around?

2) Has anyone else encountered anything like this?

3) Any suggestions on what to do if/when wserver_timeout=1 becomes
   insufficient?

4) Any chance of detecting this sort of problem in sksd and skipping
   the timeout altogether?


        Jonathon

        Jonathon Weiss <address@hidden>
        MIT/IS&T/Infrastructure Design & Engineering
        Cloud Platforms (Server Operations)

[Prev in Thread]

Current Thread

[Next in Thread]

[Sks-devel] wserver_timeout value causing cascading failure?, Jonathon Weiss <=
- Re: [Sks-devel] wserver_timeout value causing cascading failure?, Jonathon Weiss, 2017/04/24
  - Re: [Sks-devel] wserver_timeout value causing cascading failure?, Kristian Fiskerstrand, 2017/04/24
- Re: [Sks-devel] wserver_timeout value causing cascading failure?, Phil Pennock, 2017/04/24
- Re: [Sks-devel] wserver_timeout value causing cascading failure?, Kim Minh Kaplan, 2017/04/26

Prev by Date: [Sks-devel] HTTP reverse proxy failures at pgp.mit.edu
Next by Date: Re: [Sks-devel] wserver_timeout value causing cascading failure?
Previous by thread: [Sks-devel] HTTP reverse proxy failures at pgp.mit.edu
Next by thread: Re: [Sks-devel] wserver_timeout value causing cascading failure?
Index(es):
- Date
- Thread