Ticket #475 (closed defect: fixed)

Opened 2 years ago

Last modified 1 year ago

mulgara crash - all pages timing out

Reported by: russ Assigned to: russ
Priority: critical Milestone:
Component: topaz Version: 0.7
Keywords: Cc:
Blocking: Blocked By:

Description

i came home around 10:30 to find nagios alerts waiting for me.

plosone was unresponsive. lots of timeouts, connection refused, etc. in plosone.log

in mulgara.log, nothing since 8pm.

[root@ploskow01 plos]# /sbin/service mulgara status
mulgara (17165) is running from /usr/local/topaz/mulgara
  PID USER      RSS %MEM     ELAPSED     TIME %CPU COMMAND
17165 topaz    1886768 11.4 05:44:24 00:21:31  6.2 java
java    17165 topaz    5u  IPv6 6357470       TCP *:9091 (LISTEN)
java    17165 topaz   11u  IPv6 6357512       TCP 127.0.0.1:9291 (LISTEN)
java    17165 topaz   13u  IPv6 6357522       TCP *:60618 (LISTEN)
[root@ploskow01 plos]# tail /var/log/topaz/mulgara.log
2007-07-10 19:54:06,404 INFO  DatabaseSession> Closing session [Finalizer org.mu
lgara.resolver.DatabaseSession]
2007-07-10 20:05:08,756 INFO  DatabaseSession> Closing session [Finalizer org.mu
lgara.resolver.DatabaseSession]
2007-07-10 20:05:08,757 INFO  DatabaseSession> Closing session [Finalizer org.mu
lgara.resolver.DatabaseSession]
2007-07-10 20:05:08,757 INFO  DatabaseSession> Closing session [Finalizer org.mu
lgara.resolver.DatabaseSession]
2007-07-10 20:08:09,380 INFO  DatabaseSession> Closing session [Finalizer org.mu
lgara.resolver.DatabaseSession]
2007-07-10 20:09:09,583 INFO  DatabaseSession> Closing session [Finalizer org.mu
lgara.resolver.DatabaseSession]
2007-07-10 20:10:09,784 INFO  DatabaseSession> Closing session [Finalizer org.mu
lgara.resolver.DatabaseSession]
2007-07-10 20:10:09,784 INFO  DatabaseSession> Closing session [Finalizer org.mu
lgara.resolver.DatabaseSession]
2007-07-10 20:11:09,987 INFO  DatabaseSession> Closing session [Finalizer org.mu
lgara.resolver.DatabaseSession]
2007-07-10 20:11:09,987 INFO  DatabaseSession> Closing session [Finalizer org.mu
lgara.resolver.DatabaseSession]
[root@ploskow01 plos]# /sbin/service mulgara stop
Stopping mulgara (17165)
[root@ploskow01 plos]#

after a mulgara restart, everything is chipper.

i understand that not much has changed in mulgara for 0.7, and it does feel like a network issue. certainly, things have changed now that plosone talks to mulgara directly.

it feels like the good old days all over again.

i'm editing iptables on mulgara and plosone boxes to accept all traffic from other servers in the stacks (previously i was using -m state --state NEW, and then accepting all ESTABLISHED and RELATED traffic, but i know we had some ipconntrack issues in the past before we figured out the right AGP settings...)

if someone could look over the tomcat settings with a fine toothed comb i'd appreciate it.

also, please tell me what kind of logging you want when these kinds of things happen.

thanks!

Dependency Graph

Change History

07/10/07 23:04:43 changed by amit

  • owner changed from pradeep to ronald.
  • milestone changed from Bugs to 0.71.

Russ, #472 is already open on this, so please add to that. Opening individual tickets about the same problem causes problems here..:) I will leave this open, but please add to 472 instead of this with regards to Mulgara issues showing this type of behaviour.

07/11/07 03:08:52 changed by ronald

Russ, please, please, please get a thread dump if things are hung. I've said this over and over again. Also, for good measure the output of netstat might help.

As to why things were hung, I can only see parts, but it's clear that one operation got the write lock and never released it, so everybody else was stuck waiting for it (149 folks, at which point tomcat hit the max-threads limit).

Following the last write lock, that process then issued a rollback shortly after, which should have released the write lock, but it doesn't look like it.

OK, how did we get here? plosone02 attempted to do a query, which in turn first triggered a begin-transaction. On mulgara this acquired the write-lock, but then failed with "Unable to begin transaction" in beginTransactionalBlock() - the underlying exception unfortunately is only logged at level debug, so we might want to up the log level to debug for org.topazproject.mulgara.ws.ItqlInterpreterBeanWrapper.

Now the following looks like a bug in Mulgara's transaction stuff: the exception does not cause the write-lock to be released. However, at this point the autoCommit flag hasn't been set to false (i.e. the transaction marked as started), so the subsequent rollback issued by us is rejected with a "Attempt to rollback transaction outside transaction". And bingo, we're stuck. The only way to release the lock now is to close the database-session.

How to fix? For one we should know why the transaction fails to begin in the first place - see the log-level comment above to track this down. Second, in light of this mulgara bug we can try to discard and close the database session on errors from rollback (our wrappers make sure that rollback is called in case of any error elsewhere). Third, we should probably upgrade mulgara...

And lastly, why is this happening with the switch to 0.7? Not sure, but my most likely guess is that we just didn't have the begin-transaction failures before and therefore didn't run into this bug. Alternatively, I think we may have been more aggressive at tossing away (closing) database-sessions on errors.

07/11/07 05:35:05 changed by ronald

Forgot to mention: [3149] address the connection/session close on failed rollback.

(follow-up: ↓ 6 ) 07/11/07 09:59:23 changed by russ

sorry about the lack of thread dumps :(

i don't see a logger named org.topazproject.mulgara.ws.ItqlInterpreterBeanWrapper? in /usr/local/topaz/mulgara/common/classes/log4j.xml.

should i add one, or do i want to be editing a different file?

(follow-up: ↓ 11 ) 07/11/07 10:00:54 changed by russ

i wanted to note the differences between this crash and #472

in #472, the crash happens as a result of some other action - publishing an article for example - and only fetchArticle.action fails.

in this crash, it happened at random and all pages broken.

it could definitely be the same issue though, and we can close this as a dupe if you like.

(in reply to: ↑ 4 ) 07/11/07 10:24:26 changed by amit

Replying to russ:

sorry about the lack of thread dumps :( i don't see a logger named org.topazproject.mulgara.ws.ItqlInterpreterBeanWrapper? in /usr/local/topaz/mulgara/common/classes/log4j.xml. should i add one, or do i want to be editing a different file?

You will have to add one. I also strongly recommend getting more familiar with log4j.xml and what all you can do with it. It will help in the long term.

07/11/07 10:42:00 changed by russ

working on it :)

i've added the following to log4j.xml and restarted mulgara:

  <logger name="org.topazproject.mulgara.ws.ItqlInterpreterBeanWrapper">
    <level value="debug" />
  </logger>

07/11/07 14:08:26 changed by russ

  • summary changed from total and complete mulgara crash to mulgara crash - all pages timing out.

07/11/07 15:33:55 changed by ronald

Thanks for logging. Aha! Now we see

Caused by: javax.transaction.NotSupportedException: Nested transactions not supported

Ok, as an (intended) side-effect we also now have the database-session instances being logged, i.e. we can see how those are being (re)used in the different threads. And lo and behold, there's obviously something wrong: one thread has a given dbs instance and is processing a longish query, and suddenly another thread is given the same dbs, which then leads to errors...

From this I can confirm we're either somehow re-using connections (i.e. dbs's) on the client, or tomcat (or axis) are getting their cookies and http-sessions confused. More on that after further analysis.

07/11/07 18:25:12 changed by ronald

Latest update: I think I've found the bug: due to a slight omission in the configuration we end up sharing otm session instance between between all requests in a given http-session, and if two requests happen at the same time for the same http-session then we get effectively get the nested transactions problem. And the plosone logs confirm that we're seeing two such requests when the problem occurred (at least the one at 13:43:04).

(in reply to: ↑ 5 ) 07/11/07 18:33:36 changed by ronald

Replying to russ:

i wanted to note the differences between this crash and #472 in #472, the crash happens as a result of some other action - publishing an article for example - and only fetchArticle.action fails. in this crash, it happened at random and all pages broken. it could definitely be the same issue though, and we can close this as a dupe if you like.

No, I don't think the two are the same. At last not in #472's original form (stuff timing out but then things fixing themselves).

07/12/07 10:35:02 changed by russ

i had the rare pleasure of being awoken by SMS alerts at 4am and restarting mulgara.

i got a thread dump for you.

let me know if you'd like me to configure nagios to send you all SMS messages as well

i think we're averaging about 2 hours of downtime per day since the 0.7 upgrade, or ~92% uptime :(

07/12/07 20:34:48 changed by amit

Closed in r3170.

07/14/07 13:16:46 changed by amit

  • owner changed from ronald to russ.

Emailed the patch to R&R.

07/16/07 11:34:14 changed by russ

  • status changed from new to assigned.

07/19/07 11:09:04 changed by amit

Any update on this?

07/20/07 17:33:59 changed by russ

  • status changed from assigned to closed.
  • resolution set to fixed.

looks like patch has fixed.

10/29/07 21:13:41 changed by

  • milestone deleted.

Milestone 0.71 deleted