Ticket #288 (closed defect: duplicate)

Opened 2 years ago

Last modified 1 year ago

during backup this morning, a hang. after stack restart, site error on most article pages

Reported by: russ Assigned to: somebody
Priority: critical Milestone:
Component: topaz Version:
Keywords: crash block flakiness Cc:
Blocking: Blocked By:

Description

i guess i'll restart the stack again and pray.

it seems like things are getting flakier and flakier in fedora/mulgara land...

if this is really restart related, the quickest solution might be to develop a backup strategy that doesn't require restarts...

Dependency Graph

Change History

02/21/07 08:02:35 changed by russ

are we caching site errors, so that if the first call to an article gives an error we never try again?

02/21/07 08:13:09 changed by russ

on my first restart, i was waiting for the home.action page to build and cache itself and watching topaz.log, i saw the following errors in topaz.log, on of which at least is mulgara related.

are we hitting that mulgara heap bug again?

2007-02-21 07:33:04,749 INFO ArticleServicePortSoapBindingImpl?> method getArticleInfos threw an exception [http-8008-Processor23 org\ .topazproject.ws.article.ArticleServicePortSoapBindingImpl?] AxisFault?

faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException faultSubcode: faultString: org.mulgara.query.QueryException?: Couldn't build query faultActor: faultNode: faultDetail:

{http://xml.apache.org/axis/}hostname:ploskow01.localdomain

2007-02-21 07:43:13,078 ERROR Message> java.io.IOException: [http-8008-Processor23 org.apache.axis.Message] ClientAbortException?: java.net.SocketException?: Broken pipe

02/21/07 08:51:08 changed by amit

02/21/07 08:58:53 changed by russ

yes, that's the correct hostname for our mulgara server. we named it when we were still using kowari.

02/21/07 09:58:46 changed by rich

out of memory errors in mulgara.out since restart:

Feb 21, 2007 8:09:33 AM org.apache.catalina.startup.Catalina start INFO: Server startup in 851 ms Exception in thread "Multicast Server Thread" java.lang.NullPointerException?

at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender?$MulticastServerThread?.createCachePeersPayload(MulticastKeepaliveHeartbeatSender?.java:138) at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender?$MulticastServerThread?.run(MulticastKeepaliveHeartbeatSender?.java:107)

Exception in thread "RMI RenewClean?-[192.168.66.18:48231,net.sf.ehcache.distribution.ConfigurableRMIClientSocketFactory@1d4c0]" java.lang.OutOfMemoryError?: Java heap space Exception in thread "ContainerBackgroundProcessor?[StandardEngine?[Catalina]]" java.lang.OutOfMemoryError?: Java heap space Exception in thread "RMI RenewClean?-[192.168.66.18:48362,net.sf.ehcache.distribution.ConfigurableRMIClientSocketFactory@1d4c0]" java.lang.OutOfMemoryError?: Java heap space Exception in thread "RMI RenewClean?-[192.168.66.18:48311,net.sf.ehcache.distribution.ConfigurableRMIClientSocketFactory@1d4c0]" java.lang.OutOfMemoryError?: Java heap space Exception in thread "RMI RenewClean?-[192.168.66.17:48636,net.sf.ehcache.distribution.ConfigurableRMIClientSocketFactory@1d4c0]" java.lang.OutOfMemoryError?: Java heap space Exception in thread "RMI RenewClean?-[192.168.66.17:47911,net.sf.ehcache.distribution.ConfigurableRMIClientSocketFactory@1d4c0]" java.lang.OutOfMemoryError?: Java heap space Exception in thread "RMI RenewClean?-[192.168.66.17:48029,net.sf.ehcache.distribution.ConfigurableRMIClientSocketFactory@1d4c0]" java.lang.OutOfMemoryError?: Java heap space Feb 21, 2007 8:46:09 AM org.apache.tomcat.util.threads.ThreadPool?$ControlRunnable? run SEVERE: Caught exception (java.lang.OutOfMemoryError?: Java heap space) executing org.apache.tomcat.util.net.LeaderFollowerWorkerThread?@e8bac45, terminating thread Exception in thread "RMI RenewClean?-[192.168.66.17:49704,net.sf.ehcache.distribution.ConfigurableRMIClientSocketFactory@1d4c0]" java.lang.OutOfMemoryError?: Java heap space Feb 21, 2007 8:49:19 AM org.apache.coyote.http11.Http11Processor process SEVERE: Error processing request java.lang.OutOfMemoryError?: Java heap space Exception in thread "RMI RenewClean?-[192.168.66.17:49733,net.sf.ehcache.distribution.ConfigurableRMIClientSocketFactory@1d4c0]" java.lang.OutOfMemoryError?: Java heap space

02/21/07 12:40:24 changed by ronald

The out-of-memory errors are random - we've seen quite a few of them lately: just after midnight today, last Monday, last Saturday, etc. I'm still convinced this isn't restart related - today it didn't happen till 20 minutes after the restart.

Also, after the 5 o'clock restart things seem to have run fine for 35 minutes, at which point we got a bunch of "Stale resolvers found", and a few minutes later things seem to have stopped. So again, I'm not sure this is restart related, unless it's the extra load after restart (see #289).

02/21/07 15:44:01 changed by russ

  • status changed from new to closed.
  • resolution set to duplicate.

i think we have enough tickets open for this issue, including #265, so i'm closing this one.

the site was restored after disabling the webheads, starting up verrrry slowly, and rebuilding the cache one one stack before starting the other.

we're going to try doing our backup/restarts without shutting down the plosone services, so that we don't need to rebuild cache which should reduce the load after restart, and hopefully will make this issue less likely to occur until we get a real fix.

08/07/07 16:25:51 changed by

  • milestone deleted.

Milestone Bugs deleted