Ticket #995 (closed clarification: fixed)

Opened 4 months ago

Last modified 2 months ago

article-state cache key is not cleared when article is published

Reported by: russ Assigned to: russ
Priority: critical Milestone: 0.9.1
Component: ambra Version: 0.9-rc1
Keywords: Cc:
Blocking: Blocked By:

Description (Last modified by pradeep)

if anyone tries to access an article before its published, no one can access the article ever.

this is an old bug that seems to have risen from the dead.

here's how to reproduce:

  1. ingest an article
  2. try to view the article as an anonymous user - you'll get a site error
  3. publish the article.
  4. try to view the article as an anonymous user - you'll STILL get a site error
  5. clear the article-state cache
  6. now you can view the article as an anonymous user.

Note that #984 is marked as a duplicate of this. That was related to permit-admin cache.

Dependency Graph

Attachments

mulgara.trace.log.gz (2.0 kB) - added by russ on 07/30/08 11:13:44.
ambra.trace2.log.gz (18.1 kB) - added by russ on 08/12/08 15:45:56.
ambra log for comment 10
mulgara.trace2.log.gz (29.3 kB) - added by russ on 08/12/08 15:46:10.
mulgara log for comment 10

Change History

07/23/08 11:16:55 changed by rich

  • owner changed from rich to russ.

Please verify on production that it's not an iptables or firewall issue.

07/23/08 17:11:14 changed by russ

initially reproduced on single boxes, but pradeep helped solve firewall problems here.

cannot reproduce on test multi-stack instance.

continues to happen the vast majority of the time on the production instance.

so, something related to traffic, or ehcache/overflowToDisk? or i still have firewall misconfiguration on production?

07/26/08 21:52:27 changed by amit

Please use wireshark to narrow the problem down.

07/30/08 11:13:12 changed by russ

  • owner changed from russ to ronald.

wireshark is the tool of last resort :)

i confirmed from mulgara logs (with org.topazproject.mulgara.resolver.CacheInvalidator? at debug) that mulgara is not sending invalidations when the topaz01 server is open to the public.

if topaz01 is isolated from public traffic, then i reliably see invalidations from mulgara on each delete, ingest, and publish action on topaz01.

if topaz01 is serving public traffic, then i rarely see invalidations on delete, and i usually see invalidations on ingest and publish (but not always).

if invalidation fails on two actions in the delete-ingest-publish cycle, then anon users are either denied access to published articles, or allowed access to unpublished articles.

i cannot reliably reproduce this, but i will upload the same log snippet from mulgara that i posted to the list, in which mulgara does not send invalidations on delete.

perhaps the resolver gets confused when multiple connections/transactions are open from the same IP address?

07/30/08 11:13:44 changed by russ

  • attachment mulgara.trace.log.gz added.

08/01/08 09:18:44 changed by pradeep

  • description changed.

(follow-up: ↓ 8 ) 08/04/08 00:06:47 changed by ronald

  • status changed from new to assigned.

I can't reproduce the original bug description on a single-box stack.

Also, looking at the provided log, I do see invalidations for the article-state cache:

2008-07-29 14:03:40,410 DEBUG CacheInvalidator> Invalidating key 'info:doi/10.1371/journal.pone.0000000.g001' in cache 'article-state' [CacheInvalidator-Worker org.topazproject.mulgara.resolver.CacheInvalidator]
...
2008-07-29 14:03:40,411 DEBUG CacheInvalidator> Invalidating key 'info:doi/10.1371/journal.pone.0000000' in cache 'article-state' [CacheInvalidator-Worker org.topazproject.mulgara.resolver.CacheInvalidator]
...
2008-07-29 14:03:40,427 DEBUG CacheInvalidator> Invalidating key 'info:doi/10.1371/journal.pone.0000000.g001' in cache 'article-state' [CacheInvalidator-Worker org.topazproject.mulgara.resolver.CacheInvalidator]
...
2008-07-29 14:04:08,933 DEBUG CacheInvalidator> Invalidating key 'info:doi/10.1371/journal.pone.0000000.g001' in cache 'article-state' [CacheInvalidator-Worker org.topazproject.mulgara.resolver.CacheInvalidator]
2008-07-29 14:04:08,933 DEBUG CacheInvalidator> Invalidating key 'info:doi/10.1371/journal.pone.0000000' in cache 'article-state' [CacheInvalidator-Worker org.topazproject.mulgara.resolver.CacheInvalidator]

Russ, I need to see the mulgara updates being sent too - either from the ambra logs (e.g. with org.topazproject.mulgara.itql.TIClient at level debug) or from the mulgara logs (e.g. with org.mulgara.resolver.DatabaseSession at level trace).

A note on caching and queueing: updates queued and the queue periodically processed. This means you may not see the invalidations happen immediately (though normally they happen fairly quickly as the current config specifies the queue to be processed every 100ms).

Also, query results are cached, so that only one query instead of two needs to be run, and so that write operations are not delayed unnecessarily. However, this introduces a potential problem under the following sequence of events, as demonstrated on the article state:

  1. the ambra cache has no entry in the article-state cache for the article
  2. the article's state is public
  3. in a first transaction the article state is set to private
  4. a request forces the load of the article-state cache entry for the article (which will see the article as private)
  5. in a second transaction the article state is set to public
  6. the last three operations happen so quickly that the queued update handling from the first transaction has not had a chance to be processed

In this scenario no update would be sent, because by the time the queued update is processed the state will have been reset to it's original value and therefore it looks like no change happened.

But this doesn't seem to be the problem here.

08/04/08 01:45:05 changed by amit

  • owner changed from ronald to russ.
  • status changed from assigned to new.

(in reply to: ↑ 6 ) 08/04/08 11:24:56 changed by russ

Replying to ronald:

I can't reproduce the original bug description on a single-box stack.

neither can i. it only happens on multi-box, in production, with public traffic on the topaz server

Also, looking at the provided log, I do see invalidations for the article-state cache:

yes, those correspond to the publish and ingest actions. only the delete (the first action) failed to trigger invalidations.

since the ambra log on topaz01 had public traffic, and thus is rather large and congested, and you have access to it, i didn't pull out the corresponding topaz log snippet. i will provide a corresponding ambra log on my next run.

Russ, I need to see the mulgara updates being sent too

will do

A note on caching and queueing: updates queued and the queue periodically processed. This means you may not see the invalidations happen immediately (though normally they happen fairly quickly as the current config specifies the queue to be processed every 100ms).

hmm. i will wait some time between actions to ensure that it's not a queue issue on the next run.

08/04/08 11:25:18 changed by russ

  • status changed from new to assigned.

08/12/08 15:43:01 changed by russ

  • status changed from assigned to new.

i think i have what you need.

i upped org.mulgara.resolver.DatabaseSession? to trace and re-ran this experiment.

i deleted, ingested, then published pone.0000000.

i waited over 1s between each action.

in the mulgara log, from the timings, it looks like only the publish action resulted in invalidations from mulgara.

i'm attaching the mulgara and ambra logs for the time period in question. in the ambra log, you'll see a deleteArticle.action, an ingestArchives.action, and finally a publishArchives.action. we're using ntpd, so the times should correlate pretty tightly.

i don't believe that any other users were in the admin during this experiment.

08/12/08 15:45:56 changed by russ

  • attachment ambra.trace2.log.gz added.

ambra log for comment 10

08/12/08 15:46:10 changed by russ

  • attachment mulgara.trace2.log.gz added.

mulgara log for comment 10

08/12/08 15:46:30 changed by russ

  • owner changed from russ to ronald.

08/18/08 03:38:40 changed by ronald

  • owner changed from ronald to russ.

Thanks for the logs. I'm pretty sure I've found the bug, in mulgara, and fixed it in MulgaraSvn:1146 - a new mulgara snapshot has been uploaded to our maven repository, so rebuilding the mulgara-service jar and topaz-mulgara rpm should give you the fixes.

Regarding what exactly the problem was: the CacheInvalidator relies on proper calls to its XAResource object in order to know which transaction is currently active in any given thread. This is used to maintain a list of changes for each transaction, which is then processed on a commit or discarded on a rollback. However, Mulgara was not making the proper calls to XAResource, so that changes could get associated with the wrong transaction (including an already completed transaction) or with no transaction at all (the "null" transaction). This in turn meant that changes could effectively get dropped or delayed for arbitrarily long times.

While this should theoretically occur on any setup, it was effectively a race-condition (depending on how much traffic Mulgara is getting) and hence may only show up under the right circumstances; and this explains why you didn't see the problem on production when you took ambra offline. I tried reproducing the problem on a single-box install with artificial additional Mulgara traffic, but was not able to. So while I'm fairly certain this bug was the source of these invalidation problems, I'm going to leave this ticket open until you try it out on production.

08/18/08 07:48:17 changed by amit

  • type changed from defect to clarification.

08/18/08 12:03:10 changed by amit

  • milestone changed from 0.9.0 to 0.9.1.

Moving it to 0.9.1 as the the only way we can test it is unfortunately on PLoS production.

09/12/08 13:24:33 changed by russ

  • status changed from new to closed.
  • resolution set to fixed.

afaict, this is fixed in 0.9 final release.