Ticket #265 (closed defect: fixed)

Opened 2 years ago

Last modified 1 year ago

mulgara heap space memory bug

Reported by: russ Assigned to: rich
Priority: medium Milestone:
Component: infrastructure Version:
Keywords: restart mulgara Cc:
Blocking: Blocked By:

Description (Last modified by russ)

on monday morning's backup, the site was broken after restart, but apparently resolved itself eventually

or perhaps rich restarted the stack and didn't tell me.

in any case, after the restart (around 5am monday morning), my wget script returned site errors for all articles. also, we had a number of reports of disappearing annotations. both of these are consistent with mulgara having problems in my experience.

as far as i can tell from the cron output, everything stopped and restarted fine.

please investigate what happened monday and advise on how to avoid this happening in the future.

Dependency Graph

Attachments

HeapMon.jar (4.4 kB) - added by ronald on 02/28/07 06:19:23.
Simple app to monitor heap usage and dump threads when it gets too large. Run with 'java -jar HeapMon? >& mon.log'. See 'java -jar HeapMon? -x' for options.
libyjpagent.so (0.7 MB) - added by ronald on 02/28/07 06:22:01.
Yourkit profiler agent library (version 6.0.12)
resolver-store-base-1.0.0.jar (88.4 kB) - added by ronald on 03/28/07 02:51:42.
Patched resolver-store jar to fix out-of-heap errors.

Change History

01/24/07 23:10:25 changed by ronald

  • status changed from new to assigned.

So mulgara temporarily went out of heap space, causing some queries to fail. Looking at the logs it's not clear why that would happen. We could try running those exact queries again to see if it's data dependent.

Some notes:

  • while the problem occurred shortly after the restart, I don't think it's really related to that because 4 minutes passed with 1366 successful queries between the start and the heap error.
  • The queries that failed were all permissions related. These queries usually take around 500 - 600 ms, but in the failure case took 25 seconds before the heap exhaustion.
  • While these queries were running, other queries started and completed successfully, including some permission queries that started before the heap exhaustion and successfully completed afterwards.
  • The heap exhaustion occurred sporadically for a few minutes only. However, for the next 8 hours some queries were failing with a 'Nested transaction not supported' error. This then stopped for 10 hours, only to resume and continue till the next restart. Our guess is that this is because of a database-session that is in a bad state after the heap exhaustion - this would explain the sporadic failures and the stop, but not the resumption after 10 hours (there are no visible itql errors in logs in this 10 hour timespan).
  • We need to enable debug-level logging for org.topazproject.mulgara.ws.ItqlInterpreterBeanWrapper - this will let us see what database sessions are reused and to clarify if the above guess is correct or not.
  • This heap space exhaustion has occurred before (specifically on the 2nd, 12th, 15th, and 18th).

However, currently we have no idea how to prevent this - expanding the heap is unlikely to help, as this seems like an issue of runaway code (indicated by the fact that these queries are running 50 times longer than normal before they exhaust the heap space).

01/25/07 18:54:20 changed by pradeep

(In [2290]) re #265. Remove a 'bad' handle from itql handle pool.

Currently a 'bad' handle is anything that threw an exception at us.

This should reduce the recurring 'Nested transaction not supported' errors. (at least for xacml pool. the WsTopazContext? pool is yet to be fixed.)

01/26/07 15:16:00 changed by ronald

Russ, could we get the backup of ploskow01:/var/lib/topaz/data from Monday morning? I made a copy of the database a couple days ago and ran the queries that had failed against it, and they worked fine. While I'm not too optimistic about being able to reproduce this problem, I would like to give it a shot. The nice thing about Monday's failure (as opposed to the earlier ones) is that it happened just a few minutes after the backup, and before any inserts or deletes were made. So this is as good a shot as we can hope for in trying to reproduce the problem (and since there were only around 1360 queries before the problem, it's even feasible to rerun the same sequence of queries).

Maybe the best place to put the backup would be somewhere on plofail01?

02/06/07 15:38:56 changed by russ

i've restored /var/lib/topaz/data from ploskow01 as of 20070122 to ~topazdev/backup/20070122 on plosfail01.

02/12/07 04:52:38 changed by ronald

Ok, I tried to reproduce the problem on the backup, but haven't been successful. I replayed the exact sequence of queries up to and past where the error first occurred, and things ran fine. I did this with 1, 2, 3, and 4 client threads. I also picked the queries that were done just before the failed ones plus the failed ones, and ran them in several threads in a loop, pounding mulgara for about 50 minutes, and no dice.

So, thanks in part to some suggestions on the mulgara-dev list, I've written a small app that uses JMX and debugger API to monitor mulgara and get thread and heap dumps if the heap usage goes over some threshold. This would need to be run, as user topaz, on ploskow01. It can be started once and left alone (it'll handle mulgara being restarted) - thread dumps are sent to standard out and heap dumps to /tmp/head.dump.<N> . For this app to work, we'll also need to add a file /etc/sysconfig/mulgara with the following contents:

export MULGARA_TOMCAT_OPTS="-Xmx400m -XX:MaxPermSize=256m -Dcom.sun.management.jmxremote.port=21212 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"

This enables JMX on the server.

02/15/07 16:04:20 changed by russ

  • owner changed from ronald to russ.
  • priority changed from critical to medium.
  • status changed from assigned to new.

cool.

after i upgrade to 0.5-1 is complete, i will test this on my dev and staging servers, ask you to take a look, and then we'll go for it on live.

this hasn't happened on restart since the original event (according to my cache seeding script, which reported errors last time) so i am downgrading the severity.

02/15/07 16:05:05 changed by russ

  • status changed from new to assigned.

02/17/07 20:35:30 changed by ronald

The error has occurred several times since then: on the 25th, 26th, 27th, 2nd, 14th, 15th, and 16th. When it exactly it occurs is random, so it occurring soon after the restart on the 22nd was pure luck.

Btw., I'm uploading a new HeapMon.jar with a slightly higher default threshold than before and where things like the threshold and location of the dumps can be specified on the command line.

Also, we're looking into the JProfiler and YourKit profilers, and may want to use those - that will require copying a jar to the system and modifying the /etc/sysconfig/mulgara config.

02/20/07 14:51:56 changed by russ

  • owner changed from russ to ronald.
  • status changed from assigned to new.

i copied HeapMon?.jar to /usr/local/topaz/bin on plosfail01 and got an error.

[root@plosfail01 bin]# java -jar HeapMon?.jar -h Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException?: 1

at HeapMon?.main(HeapMon?.java:36)

[root@plosfail01 bin]# echo $JAVA_HOME /opt/jdk1.5.0_09

[root@plosfail01 bin]# whereis java java: /etc/java /usr/lib/java /usr/share/java /opt/jdk1.5.0_09/bin/java

02/21/07 12:23:30 changed by ronald

Sorry, I forgot: -h is a valid switch that expects an argument. So try `java -jar HeapMon.jar -foo' for help. Or just run it without any arguments - the defaults should be ok.

Btw., this needs to be run on the same machine and as the same user as the mulgara process, as it needs to be able to access the process directly.

02/21/07 15:39:06 changed by russ

hmm. now i seem to be having issues getting jmx enabled.

i've got the correct invocation in /etc/sysconfig/mulgara

however, when i start mulgara, and then run a ps, there are no jmx arguments on the commandline.

additionally, port 21212 is not open, and when i run HeapMon?.jar i get errors in mon.log complaining that it can't connect to JMX...

[root@plosfail01 ~]# more mon.log Heaps will be written to '/tmp/heap.dump.*' when heap usage exceeds 314572800 bytes Waiting for server to start: host = 'localhost', port = 21212 Trying to connect: jmx = service:jmx:rmi:///jndi/rmi://localhost:21212/jmxrmi, pid = 8008 java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException? [Root exception is java.rmi.ConnectException?: Connection ref used to host: localhost; nested exception is:

java.net.ConnectException?: Connection refused]

do i need to install anything to get jmx working?

02/21/07 15:41:36 changed by russ

here's what i've got in ps:

9865 pts/24 Sl 0:07 /opt/jdk1.5.0_09/bin/java -server -Xmx400m -XX:MaxPermSize=256m -Djavax.net.ssl.trustStore=/etc/topaz/keystore.jks -Djavax.net.ssl

.trustStorePassword=changeit -Djava.endorsed.dirs=/usr/local/topaz/mulgara/common/endorsed -classpath /opt/jdk1.5.0_09/lib/tools.jar:/usr/local/topaz/mulgara /bin/bootstrap.jar:/usr/local/topaz/mulgara/bin/commons-logging-api.jar -Dcatalina.base=/usr/local/topaz/mulgara -Dcatalina.home=/usr/local/topaz/mulgara -Dj ava.io.tmpdir=/usr/local/topaz/mulgara/temp org.apache.catalina.startup.Bootstrap start

02/21/07 18:26:04 changed by ronald

Aarg! Sorry. /etc/sysconfig/mulgara should be exporting mulgara_TOMCAT_OPTS (note the lowercase mulgara). I.e.

export mulgara_TOMCAT_OPTS="-Xmx400m -XX:MaxPermSize=256m -Dcom.sun.management.jmxremote.port=21212 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"

(see also /etc/sysconfig/plosone for a similar entry).

02/22/07 09:56:27 changed by russ

  • description changed.
  • summary changed from on monday morning's backup, the site was broken after restart, but apparently resolved itself eventually to mulgara heap space memory bug.

perfect! i'll have this up and running on production mulgara later today. i'll write to /tmp/mon.log

02/28/07 06:16:12 changed by ronald

  • status changed from new to assigned.

Due to HeapMon's problem getting a heap-dump sometimes (see #291) I'm modifying the approach here. What we'll do is use the yourkit profiler agent which can be told to generate a heap dump when the memory usage exceeds a given threshold. We'll still run HeapMon, but a reduced version which only gets thread-dumps (via JMX) and hence should not run into the problem in #291 .

I'm attaching a new HeapMon and the profiler's shared library. In order for java to find the shared lib, you'd either have to modify '/usr/local/topaz/bin/tomcat' to set LD_LIBRARY_PATH appropriately, or (and this is probably easier) just copy the lib to '/opt/jdk1.5.0_09/jre/lib/amd64/'.

Ok, so here's a summary of what to do:

  • copy the attached libyjpagent.so to '/opt/jdk1.5.0_09/jre/lib/amd64/' . You can test by running 'java -agentlib:yjpagent=help' which should print out a help text.
  • edit '/etc/sysconfig/mulgara' to contain
       export mulgara_TOMCAT_OPTS="-Xmx400m -XX:MaxPermSize=256m -Dcom.sun.management.jmxremote.port=21212 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -agentlib:yjpagent=disablealloc,disablecounts,disablej2ee,dir=/tmp/,usedmem=75"
    
    (the -agentlib:... at the end is new).
  • copy the new HeapMon.jar somewhere and run as before (e.g. 'nohup java -jar HeapMon -p 21212 >& /tmp/mon.log &'

The heap dumps will be dumped into /tmp/Bootstrap-<date>-low-memory.memory.

02/28/07 06:19:23 changed by ronald

  • attachment HeapMon.jar added.

Simple app to monitor heap usage and dump threads when it gets too large. Run with 'java -jar HeapMon? >& mon.log'. See 'java -jar HeapMon? -x' for options.

02/28/07 06:22:01 changed by ronald

  • attachment libyjpagent.so added.

Yourkit profiler agent library (version 6.0.12)

02/28/07 16:53:06 changed by russ

mulgara restart is not working correctly with this.

however, mulgara stop; mulgara start (which is what we actually do with the nightly backups) is okay.

can you confirm that everything is running happily on plosfail01? it's okay if mulgara restart doesn't work temporarily. i just want some reassurance :)

###

i put all the files in the right places, edited sysconfig/mulgara, and restarted mulgara.

i ran heapmon as 'nohup java -jar /usr/local/topaz/bin/HeapMon -p 21212 >& /tmp/mon.log &'

then i tried /sbin/service mulgara restart. this eventually returned, however mulgara status reported "mulgara not running". then i tried mulgara start and was told "No PID file for mulgara, but port 9091 already in use by pid p14118".

i had to kill heapmon, then kill the mulgara process in order to restart mulgara.

i tried again with mulgara stop; mulgara start and things worked normally, with an error in mon.log followed by a successful reconnection.

03/03/07 03:20:20 changed by ronald

First of all, things seem to fine on plosfail01 currently. HeapMon is logging the heap sizes as expected, and the mulgara jvm is running with the correct arguments.

However, the restart problem is a bit annoying and it's unclear why that could be happening. The fact that the PID file got removed but the process was still running is doubly odd: the shell script does not remove the PID file unless the process is truly gone. I'll try again to reproduce this.

03/05/07 16:16:58 changed by russ

this us up and running on production mulgara

03/28/07 02:49:31 changed by ronald

I think I have a patch for this, but since we can't reproduce the problems outside of production would you be interested in patching production in order to test this? It involves replacing one jar file in the mulgara war (specifically /usr/local/topaz/mulgara/webapps/mulgara-service/WEB-INF/lib/resolver-store-base-1.0.0.jar)

Note that the fix itself is extremely safe: I've commented out a finalizer that was only logging some information, and that only at level debug.

03/28/07 02:51:42 changed by ronald

  • attachment resolver-store-base-1.0.0.jar added.

Patched resolver-store jar to fix out-of-heap errors.

03/28/07 12:09:46 changed by rich

Mulgara ran out of heap space again this morning. I generated a thread dump, applied the patch and restarted Mulgara.

03/28/07 13:36:32 changed by amit

  • owner changed from ronald to rich.
  • status changed from assigned to new.

Thanks Rich. I guess we wait till tomorrow morning.

03/31/07 13:17:52 changed by ronald

So far things are looking good. If things stay stable for another week then I think we can definitely close this.

04/09/07 09:59:33 changed by russ

  • status changed from new to closed.
  • resolution set to fixed.

08/07/07 16:25:51 changed by

  • milestone deleted.

Milestone Bugs deleted