Ticket #321 (closed defect: fixed)

Opened 2 years ago

Last modified 1 year ago

ecache out of sync between stacks

Reported by: russ Assigned to: pradeep
Priority: high Milestone:
Component: topaz Version:
Keywords: xacml cache Cc:
Blocking: Blocked By:

Description

interesting!

an article published yesterday was working on stack 2, but showing a site error on stack 1.

in the log for topaz01, we discovered that topaz01 was denied access to the article, due to deny-inactive.

this article was ingested yesterday. steve suggested that the XACML cache on topaz01 didn't pick up on the fact that the article was published.

a restart of topaz01 fixed this problem (presumably its cache was dropped and it rebuilt from the other members)

2007-04-04 10:25:47,513 INFO  DenyBiasedPEP> 'deny-inactive' denies 'anonymous:user/' to do 'annotations:listAnnotations' on 'info:doi/10.1371/journal.pone.0000348' [http-8008-Processor273 org.topazproject.xacml.DenyBiasedPEP]

it would be nice to know why the caches got out of sync (or what happened, if that's not the case) in this instance.

thanks!

Dependency Graph

Change History

04/04/07 10:42:38 changed by amit

  • owner changed from somebody to pradeep.
  • priority changed from medium to high.

Yes, it would be nice to know. Given our architecture, losing a cache message can be problematic over multiple stacks. Assigining to Pradeep and increasing the priority at least till we put some debug statements to try and figure this out.

(in reply to: ↑ description ) 04/04/07 12:08:30 changed by pradeep

Replying to russ:

interesting! an article published yesterday was working on stack 2, but showing a site error on stack 1.

The ingest was done on stack 1 at 3:02PM on 4/3. The article state was changed to active at 3:21PM on 4/3 on stack 1.

in the log for topaz01, we discovered that topaz01 was denied access to the article, due to deny-inactive.

listAnnotations was denied access on both topaz01 and topaz02. The article service permissions are fine for both topaz01 and topaz02.

this article was ingested yesterday. steve suggested that the XACML cache on topaz01 didn't pick up on the fact that the article was published.

Article was ingested on topaz01. It was the Annotation service that didn't know that the article state had changed to 'active'.

a restart of topaz01 fixed this problem (presumably its cache was dropped and it rebuilt from the other members)

Looks like topaz02 requires a restart also. Please go ahead and do that now.

[snip]

it would be nice to know why the caches got out of sync (or what happened, if that's not the case) in this instance.

It is strange that the Article service on both topaz01 and topaz02 was updated by the mulgara trigger - but not the Annotation service. So doesn't look like a network issue. Will keep you posted.

(follow-up: ↓ 5 ) 04/04/07 12:11:43 changed by russ

this is happening again!

there is a single annotation on article 318.

it comes up on stack 2, but topaz01 is throwing a XACML error:

2007-04-04 12:08:50,242 INFO  DenyBiasedPEP> 'deny-anonymous' denies 'anonymous:user/' to do 'annotations:getAnnotationInfo' on 'info:doi/10.1371/annotation/505' [http-8008-Processor19 org.topazproject.xacml.DenyBiasedPEP]

i'm going to leave this alone for now, in case it's helpful for you to poke at it. please let me know if you still need either topaz server restarted and i will do when i'm back from lunch...

(follow-up: ↓ 6 ) 04/04/07 13:42:07 changed by ronald

I won't rule out a bug in cache-invalidation trigger rules on mulgara. Pradeep, it would be good if you could review those (2nd pair of eyes).

(in reply to: ↑ 3 ) 04/04/07 13:47:34 changed by pradeep

Replying to russ:

this is happening again! there is a single annotation on article 318. it comes up on stack 2, but topaz01 is throwing a XACML error:

Permission grants on annotation/505 did not propagate to Annotation service on topaz01. The grants propagated fine to Annotation service on topaz02. (btw: The grants were applied to Permission service on topaz01.) This can happen if the loop-back interface is not set as Multicast. Please check. (I don't have root access to check this).

As I mentioned earlier for article 348, Annotation service on topaz02 is out of sync and needs to be restarted. For annotation/505, the Annotation service on topaz01 is out of sync and needs to be restarted.

(in reply to: ↑ 4 ; follow-up: ↓ 7 ) 04/04/07 13:57:02 changed by pradeep

Replying to ronald:

I won't rule out a bug in cache-invalidation trigger rules on mulgara. Pradeep, it would be good if you could review those (2nd pair of eyes).

The thing is Article Service on both topaz01 and topaz02 were aware of the article state change. That is only possible if the trigger worked. Annotation service on both topaz01 and topaz02 are out of loop for some reason.

Is it possible to get a list of open ports used by EhCache? for RMI as well as TCP/UDP on topaz01, topaz02 and mulgara?

(in reply to: ↑ 6 ) 04/04/07 13:58:45 changed by pradeep

Replying to pradeep:

Replying to ronald:

I won't rule out a bug in cache-invalidation trigger rules on mulgara. Pradeep, it would be good if you could review those (2nd pair of eyes).

The thing is Article Service on both topaz01 and topaz02 were aware of the article state change. That is only possible if the trigger worked. Annotation service on both topaz01 and topaz02 are out of loop for some reason. Is it possible to get a list of open ports used by EhCache? for RMI as well as TCP/UDP on topaz01, topaz02 and mulgara?

Oops I meant Multicast and unicast ports opened.

04/04/07 15:06:49 changed by pradeep

Suggest we enable net.sf.ehcache logging at debug level on topaz and mulgara. This should shed some light into what is going on. log4j.xml for topaz is located at /usr/local/topaz/topaz/common/classes and for mulgara is located at /usr/local/topaz/mulgara/common/classes. The entry to be added is:

 <logger name="net.sf.ehcache">
   <level value="debug">
 </logger>

04/04/07 15:26:11 changed by russ

Mutlicast

loopback is set to multicast on both topaz01 and topaz02.

both topaz servers iptables are configured to accept all traffic on the loopback interface. this is the first rule in the table, so none of the multicast related iptables rules should be affecting this.

Recent Changes

the only recent change that i'm aware of is the patch that ronald provided to fix the mulgara heap space bug.

Open Ports

can you give me some guidance on getting the list of open ports for ehcache?

Logging

i'll make the logging change before restarting the two topaz stacks (serially so as not to lose the cache). once they are both up and running i'll do the same for mulgara.

(follow-ups: ↓ 12 ↓ 13 ) 04/04/07 15:38:05 changed by russ

it's necessary to close the level tag correctly:

<logger name="net.sf.ehcache">
   <level value="debug" />
 </logger>

this is up on topaz02, and there are a lot of interesting log entries about no route to host for various (pingable) ipaddresses etc.

04/04/07 15:44:40 changed by russ

all three services have been restarted with the new logging rules

(in reply to: ↑ 10 ) 04/04/07 17:30:22 changed by pradeep

Replying to russ:

it's necessary to close the level tag correctly:

Sorry. Thanks.

{{{ <logger name="net.sf.ehcache"> <level value="debug" /> </logger> }}} this is up on topaz02, and there are a lot of interesting log entries about no route to host for various (pingable) ipaddresses etc.

Will take a look.

(in reply to: ↑ 10 ; follow-up: ↓ 14 ) 04/04/07 19:39:02 changed by pradeep

Replying to russ:

this is up on topaz02, and there are a lot of interesting log entries about no route to host for various (pingable) ipaddresses etc.

This is the reason for caches going out of sync. I tried a telnet with one of the log messages:

topazdev@plostopaz02 topaz]$ telnet 192.168.66.17 59360
Trying 192.168.66.17...
telnet: connect to address 192.168.66.17: No route to host
telnet: Unable to connect to remote host: No route to host

Is there some firewall rule preventing this?

(in reply to: ↑ 13 ; follow-up: ↓ 15 ) 04/05/07 09:53:29 changed by russ

Replying to pradeep:

Is there some firewall rule preventing this?

absolutely! 59360 is not on our list of required ports. can you tell me which machines need to accept traffic on this port?

it seems odd to me - this port has never been open. we haven't been restarting the topaz servers nightly for at least a month now - and we're just beginning to have this problem?

can you give me a complete list of which ports need to be open on which machines?

(in reply to: ↑ 14 ) 04/05/07 11:21:08 changed by pradeep

Replying to russ:

Replying to pradeep:

Is there some firewall rule preventing this?

absolutely! 59360 is not on our list of required ports. can you tell me which machines need to accept traffic on this port? it seems odd to me - this port has never been open. we haven't been restarting the topaz servers nightly for at least a month now - and we're just beginning to have this problem? can you give me a complete list of which ports need to be open on which machines?

The port is dynamically given by the OS. So I don't think there is any way to control that. So basically allowing mulgara, topaz01, topaz02, and fedora01 to talk to each other on any TCP port is how the firewall should be configured.

(follow-up: ↓ 18 ) 04/05/07 14:11:33 changed by russ

okay, i can do that and i'll update our install instructions.

however, i'm still confused.

nothing has changed in iptables recently, and we're just now having this problem. is there any reason this has suddenly started to happen?

can you tell me what data is sent over regular channels and what goes over multicast? since the port is dynamically allocated when the service starts, how does the discover happen? (just curious)

thanks!

thanks!

(follow-up: ↓ 19 ) 04/05/07 14:32:26 changed by russ

wow, two thanks in that last comment.

okay, i get it now. before ronald's patch, we had to restart mulgara ALL THE TIME. we still restart mulgara every night. perhaps restarting mulgara fixes the problem, so it was unusual for anyone to notice it - since it only affected new annotations, or new articles, and we really were restarting mulgara a lot.

with the new iptables rules, i see lots of happy debug statements in the topaz logs. i wonder if this explains some of the slowness eric has seen on our multi-stack install vs. yours.

let me know when i can remove the debug logging.

thanks! (just one)

(in reply to: ↑ 16 ) 04/05/07 15:19:15 changed by pradeep

Replying to russ:

okay, i can do that and i'll update our install instructions. however, i'm still confused. nothing has changed in iptables recently, and we're just now having this problem. is there any reason this has suddenly started to happen?

At plosone launch, we had the same issue and you had updated the firewall rules to allow these unicast tcp communication between the ehcache peers. So something must have changed after that.

can you tell me what data is sent over regular channels and what goes over multicast? since the port is dynamically allocated when the service starts, how does the discover happen? (just curious)

The way it works is that multicast is used for peer discovery. The heart beat messages that each peer sends over multicast contains a URL where it is listening for cache updates. So the rcving peer simply adds the URL to its list of event listeners and notifies all on any cache update.

(in reply to: ↑ 17 ) 04/05/07 15:22:16 changed by pradeep

Replying to russ:

wow, two thanks in that last comment. okay, i get it now. before ronald's patch, we had to restart mulgara ALL THE TIME. we still restart mulgara every night. perhaps restarting mulgara fixes the problem, so it was unusual for anyone to notice it - since it only affected new annotations, or new articles, and we really were restarting mulgara a lot.

This is probably the case. Also because of mulgara exceptions etc. we may not have noticed these in the log.

with the new iptables rules, i see lots of happy debug statements in the topaz logs. i wonder if this explains some of the slowness eric has seen on our multi-stack install vs. yours.

Great!

let me know when i can remove the debug logging.

If things are working correctly as expected, you can remove the debug logging.

thanks! (just one)

:)

04/06/07 13:57:25 changed by russ

  • status changed from new to closed.
  • resolution set to fixed.

08/07/07 16:25:51 changed by

  • milestone deleted.

Milestone Bugs deleted