Ticket #302 (closed task: fixed)

Opened 2 years ago

Last modified 1 year ago

plosone is hosed

Reported by: russ Assigned to: ronald
Priority: critical Milestone:
Component: ambra Version:
Keywords: crash block flakiness Cc:
Blocking: Blocked By:

Description

howdy!

things are slower than slow right now on production.

i've disabled plosone on the webheads, but you can see what's going on if you set a hosts rule pointing www.plosone.org at 209.237.227.199

i tried restarting mulgara (stop, then start) and things seemed to improve for maybe 15 seconds.

help!

Dependency Graph

Change History

03/08/07 13:52:34 changed by russ

oh, the server at 209.237.227.199 is connecting to plosone01, look at plosone01 and plostopaz01 if you want to see what's happening on the stack...

03/08/07 15:29:10 changed by russ

  • owner changed from somebody to stevec.
  • component changed from topaz to plos-one.

it looks like this was a plosone issue and not a mulgara issue.

we restarted mulgara, fedora, and topaz with no real luck - it would work for about one page and then die.

we noticed that restarting apache would fix things - again for one page load.

we restarted plosone, and everything came back up. it took about an hour to rebuild the cache on plosone01. it was not pulling cache data from plosone02 (who was, presumably, also hosed).

we've got the site up and public again, with both webheads pointing to the first stack while plosone02 rebuilds.

plosone02 does not seem to be getting cache data from plosone01 - its rebuild is equally slow.

i have a couple of theories:

1. i messed things up with my new multicast iptables rules. i've checked and double checked, and things seem happy. tcpdump is showing multicast traffic between the two servers. i'll try adding a logging rule.

2. we've never left plosone and topaz up this long before (weeks!) perhaps they fall over if they stay up too long.

another note - we restart apache every day. we do this because we noticed that apache would eventaually end up with all it's children hung over time. we were blaming 64-bit flakiness for this...but it's exactly the same thing we were seeing with this crash. i suspect that plosone, or AJP, or mod_jk is doing something evil that doesn't get resolved without a plosone restart.

03/08/07 15:51:23 changed by russ

after plosone02 finished rebuilding i did another restart and another wget run, and it seems to be rebuilding much faster now, although not as zippy as i remember it.

certainly the home page only took a minute or two to come up as opposed to 15 minutes.

so i guess the two things we need are for someone to check and see if the multicast/caching is working, and for someone to look at the plosone logs for this crash and see if there's anything they can figure out...

03/08/07 16:26:38 changed by ebrown

I was able to run the multicast test time server/client successfully. I ran the server on plosone01 and the client on plostopaz01. So it is working on those two machines anyway.

(follow-up: ↓ 6 ) 03/08/07 16:36:59 changed by stevec

plosone caches do not exchange cache data. they only exchange dirty notices.

(in reply to: ↑ 5 ) 03/09/07 08:15:40 changed by russ

Replying to stevec:

plosone caches do not exchange cache data. they only exchange dirty notices.

i've created #303 as a request for documentation on cache behavior on both the plosone and topaz-fedora-mulgara sides.

03/09/07 08:16:25 changed by russ

also! during the crash, we were able to successfully run the annotations tool while the web site was unresponsive. this is another data point that suggests mulgara was not the problem here.

03/12/07 09:25:19 changed by russ

since the crash, we've had a return/worsening of our apache crash issues.

when we first launched, apache would hang up and become completely unresponsive every few days. i theorized that apache children were getting hung, and eventually all the children would be hung and the server would be completely unresponsive. i didn't originally consider that plosone or mod_jk could be the culprit.

we "fixed" this by restarting apache every day.

since the plosone crash, apache is crashing before the end of the day.

i've disabled the load balancer for now - dns traffic for www.plosone.org is now pointing to a single webhead - to see if we can get some evidence either way that mod_jk/plosone is the problem.

we could really use some information on what happened to plosone on thursday. between the mulgara issues and this, we're starting to spend a substantial proportion of our time babysitting plosone and this cannot continue.

03/12/07 09:40:23 changed by russ

rich noted the following warnings - lots of them - on saturday in the plosone log. this was after the daily apache restart, and before apache started having issues again monday early AM. rich restarted apache anyway, and the warnings did not abate.

Mar 10, 2007 1:21:12 PM org.apache.jk.core.MsgContext? action WARNING: Error sending end packet java.net.SocketException?: Broken pipe at java.net.SocketOutputStream?.socketWrite0(Native Method) at java.net.SocketOutputStream?.socketWrite(SocketOutputStream?.java:92) at java.net.SocketOutputStream?.write(SocketOutputStream?.java:136) at org.apache.jk.common.ChannelSocket?.send(ChannelSocket?.java:531) at org.apache.jk.common.JkInputStream?.endMessage(JkInputStream?.java:112) at org.apache.jk.core.MsgContext?.action(MsgContext?.java:293) at org.apache.coyote.Response.action(Response.java:182) at org.apache.coyote.Response.finish(Response.java:304) at org.apache.jk.server.JkCoyoteHandler?.invoke(JkCoyoteHandler?.java:204) at org.apache.jk.common.HandlerRequest?.invoke(HandlerRequest?.java:282) at org.apache.jk.common.ChannelSocket?.invoke(ChannelSocket?.java:767) at org.apache.jk.common.ChannelSocket?.processConnection(ChannelSocket?.java:697) at org.apache.jk.common.ChannelSocket?$SocketConnection?.runIt(ChannelSocket?.java:889) at org.apache.tomcat.util.threads.ThreadPool?$ControlRunnable?.run(ThreadPool?.java:684) at java.lang.Thread.run(Thread.java:595)

03/12/07 14:02:27 changed by russ

during the outage on thursday, eric was able to connect directly to plosone on port 8080, which further suggests this an AJP or mod_jk issue:

Okay. I just tunneled my browser to plosone01:8080 and for the few things I did it seemed reasonable. But I'll go look at the logs for earlier.

03/12/07 14:06:24 changed by russ

for some reason, after the plosone logs rotated friday morning (3/9), all the logging from thursday (3/8) *after* the restart is missing.

also, plosone is logging into the previous day's logfile. i'll open a separate ticket about this.

so, if you want logs up to the crash, please see plosone.log.2007-03-07

03/12/07 14:09:34 changed by russ

action plan:

we've changed DNS so that all traffic for www.plosone.org is hitting plosweb01 and the first stack.

i've suspended the daily restarts of apache.

we're hoping we can induce a crash on apache or plosone01 so we can investigate what's going on. when the crash happens, we'll switch to the second stack and poke around on the first.

(follow-up: ↓ 14 ) 03/12/07 17:19:33 changed by stevec

From what I was seeing, some requests were hanging (w/o ever reaching plosone), and other were occasionally getting through with no problems, and then everthing seemed OK, and then some hung and some others went through again. That always makes me suspect some sort of connection issue where threads are blocking waiting for something to free. I did find this tidbit:

The most important setting in high-load environments is the number of processor threads used by the Tomcat AJP connector. By default, this is only 15, and you should increase it to match Apache's maxThreads setting (256 by default):

<Connector port="8009" minProcessors="5" maxProcessors="256" protocol="AJP/1.3" />

And we don't have those max threads set as far as I know. I'm not sure this is causing the problem, but maybe it is worth looking into. Maybe someone else knows for sure if this would cause the problems we've been seeing. It seems like it would be fairly easy to test this on a staging setup by bringing the min/max down to a small number.

(in reply to: ↑ 13 ) 03/12/07 18:26:04 changed by russ

Replying to stevec:

The most important setting in high-load environments is the number of processor threads used by the Tomcat AJP connector. By default, this is only 15, and you should increase it to match Apache's maxThreads setting (256 by default):

we're not using apache worker, we're using apache prefork, so the relevant setting is maxclients. which is set to 128 on each server. so, max 256. however, in practice the max is around 100 total since the servers are lightly loaded.

i'll add min/max to conf/server.xml on the plosone servers and see if that helps.

03/12/07 18:32:23 changed by russ

oops, i did it again.

that was easy.

i watched netstat while the ship sank. netstat on plosone servers reported >200 sockets open, while netstat on plosweb servers reported ~60 sockets open. so for some reason sockets are getting orphaned on plosone.

once things are restarted, i'm sure you all will be able to see the same behavior.

since restarting plosone servers closes all the connections, i'm going to restart each plosone server on alternating days, running the wget script after each restart.

this is just a temporary fix. we need a real fix soon. in addition to the max/min processors, there are a number of timeout parameters for mod_jk's worker.properties file which might make a difference, but i'm not sure which ones apply.

03/14/07 10:13:49 changed by russ

another crash this AM. in the process of restarting.

ronald suggests adding connectionTimeout="3600000" to the ajp connector in server.xml

i'm doing this and eric's fix for the logging issue before restart.

there are also some mod_jk timeout flags that are supposed to fix the "hung tomcat" problem - sound familiar? however, i don't know what a reasonable value for these variables are.

see: http://tomcat.apache.org/connectors-doc/reference/workers.html and search for hung

03/15/07 07:54:25 changed by russ

more research on the mod_jk workers.properties issue suggests that mod_jk needs timeouts to match the AJP connector time out, so i added those as well yesterday PM.

as of now, the open socket numbers are very close on both sets of web-plosone servers, and we had no reported outages overnight.

i think our timeout (1 hour) is a little long, and could still result in a clog if we were under really heavy load (which hasn't happened yet) depending on the average lifetime of an apache child.

here's hoping this is resolved with the timeouts

03/15/07 16:33:11 changed by russ

  • status changed from new to closed.
  • resolution set to fixed.

i'm resolving this provisionally, as plosone was not the culprit in todays crash.

03/16/07 06:28:17 changed by ronald

  • status changed from closed to reopened.
  • resolution deleted.

I'm reopening as the underlying problem hasn't been fixed, namely that netstat is showing active connections on plosone01 which have no counterpart plosweb01 (and visa versa), and ip_conntrack is showing yet another slightly different list. My current suspicion has to do with ip_conntrack timing out entries that are still active.

(follow-up: ↓ 21 ) 03/26/07 12:37:07 changed by stevec

  • owner changed from stevec to ronald.
  • status changed from reopened to new.

reopened it, but didn't assign it to yourself. are you still keeping tabs on this?

(in reply to: ↑ 20 ) 03/26/07 23:30:02 changed by ronald

  • status changed from new to assigned.

Replying to stevec:

reopened it, but didn't assign it to yourself. are you still keeping tabs on this?

Sorta.

04/06/07 13:19:43 changed by russ

things seem to be working nicely, now that we're not using ip_conntrack for connections to and from port 8009.

i've been doing some spot checks and the connections match on both sides.

i've documented the issue in our wiki.

i'm happy to stop investigating here if you are - things have been stable on the plosone side for a number of weeks now.

04/09/07 09:59:53 changed by russ

  • status changed from assigned to closed.
  • resolution set to fixed.

04/09/07 15:39:53 changed by ronald

It would be really nice to know why exactly ip_conntrack was having problems. Unfortunately, I currently don't have the time to chase this down further.

08/07/07 16:25:51 changed by

  • milestone deleted.

Milestone Bugs deleted