Ticket #528 (closed defect: duplicate)

Opened 1 year ago

Last modified 9 months ago

Ingestion should be standalone application.

Reported by: russ Assigned to: jsuttor
Priority: medium Milestone: 0.9.0
Component: ambra Version: 0.8.2-SNAPSHOT
Keywords: admin Cc:
Blocking: Blocked By:

Description

often, during ingest, we'll start getting nagios alerts about plosone

it always resolves after the ingest is through - there's no error per se, just mulgara blocking on something for a looooong time i guess.

it's not nice to make users wait longer than 10 secs for content just because we're posting new information. there needs to be a better way to handle locking on ingest and publication.

Service Ok[2007-08-07 13:41:03] SERVICE ALERT: plosweb01;www.plosone.org;OK;HARD;3;HTTP OK HTTP/1.1 200 OK - 129003 bytes in 4.747 seconds
Service Critical[2007-08-07 13:31:03] SERVICE ALERT: plosweb01;www.plosone.org;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds
Service Critical[2007-08-07 13:29:33] SERVICE ALERT: plosweb01;www.plosone.org;CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds
Service Critical[2007-08-07 13:28:34] SERVICE ALERT: plosweb01;www.plosone.org;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds

Dependency Graph

Change History

08/07/07 14:25:06 changed by amit

  • owner changed from jsuttor to russ.

This will require a redesign and rework of ingestion. Are you sure you want this at critical level as it will take away resources from other 0.9 features?

08/07/07 15:09:01 changed by russ

  • owner changed from russ to amit.

since 0.7, we have down time every week on ingest and publication of articles and susanne has to spend hours and hours dealing with timeouts, errors, etc. and i have to spend hours handling publication.

is this going to get better or worse when we go to 0.9 and are publishing for 2 journals instead of just one?

we can make some improvements to the process by having a separate admin vhost if you all can help with that (see #521)

IMO, anything that causes consistent outages on a production site is critical and needs to be fixed now.

also, this is a piece of the topaz infrastructure that has been bad from the start - nothing good can come of waiting to fix this as we add more journals, more articles, more features.

i'll ask rich to comment - i really can't tell you whether critical bug A comes before critical feature B.

08/07/07 15:31:53 changed by rich

0.9 features and NTD launch are the priorities....

08/07/07 15:43:06 changed by amit

  • owner changed from amit to russ.

Please take a look at my #521 comment. The essenece there is to ensure admin pages get redirected to a single Publishing App server. You guys probably know how to do that much better than I do. If that works, then I would prefer finishing up NTD and one good new feature for 0.9.

08/07/07 15:55:42 changed by russ

  • owner changed from russ to amit.

#521 makes it possible for me to push all the production work back on susanne - but it doesn't fix any of the underlying problems.

ingest will still cause long load times. we'll still have to take the site down to publish. it's an improvement to process in many ways but no fix for this issue.

08/07/07 16:06:24 changed by amit

  • owner changed from amit to russ.

Sorry Russ. I confused the thread. You are right. #521 will not improve the load time. The reason ingestion is taking a long time is the image transformation being done as part of the ingestion process. If that is becoming painful, we have to move transformation as part of standalone tools. Let me know if you want this as 0.9 as it requires a bit of work.

(follow-up: ↓ 8 ) 08/07/07 16:14:10 changed by russ

  • owner changed from russ to amit.

this issue is not about ingest taking a long time.

it's about the fact that, during ingest, the site is so slow that we get nagios alerts saying the site is down.

the slowness is intermittent - it doesn't happen on every article. i don't know what causes it. the behavior is similar to the other issues we've had involving mulgara locking.

i don't know if that's related to image transform or not.

i thought we were going to full command line ingest for 0.9 - i certainly would suggest that we do - but it's unclear to me whether that has any bearing on this problem since i imagine mulgara does the same thing for command line or browser based ingest.

(in reply to: ↑ 7 ) 08/07/07 16:23:23 changed by amit

  • owner changed from amit to russ.

Replying to russ:

this issue is not about ingest taking a long time. it's about the fact that, during ingest, the site is so slow that we get nagios alerts saying the site is down.

Please keep in mind that the image transformation is taking place on the server. Our logs seem to indicate that this is taking the most time and I am *assuming* image transformation is also overloading the CPUs.

the slowness is intermittent - it doesn't happen on every article. i don't know what causes it. the behavior is similar to the other issues we've had involving mulgara locking.

That might be based on number of images within the article.

i don't know if that's related to image transform or not. i thought we were going to full command line ingest for 0.9 - i certainly would suggest that we do - but it's unclear to me whether that has any bearing on this problem since i imagine mulgara does the same thing for command line or browser based ingest.

The command line ingest has been on the wish list and not assigned a priority. Please confirm with Rich and let us know.

08/07/07 16:25:51 changed by

  • milestone deleted.

Milestone Bugs deleted

08/07/07 16:26:41 changed by russ

i'll check cpu usage on next ingest - but i think it's unlikely that this is the cause - it really feels like mulgara is blocking and imageMagick isn't that big of a hog in my experience.

i'll update after next week's ingest.

08/07/07 16:27:00 changed by russ

  • status changed from new to assigned.

(follow-up: ↓ 18 ) 08/07/07 16:32:04 changed by amit

One more thing...Mulgara allows only one write transaction at a time and others block waiting for the write to finish. If the ingestion takes a long time, it will keep a write transaction open blocking others making it appear to be slow.

08/07/07 16:35:43 changed by russ

hmm. so is it the case that mulgara opens a write before imageMagick starts, and keeps it open until imageMagick finishes?

i'll check and see how long imageMagick is taking as well...

08/07/07 16:40:13 changed by amit

That is a possibility. I don't know the precise sequence of operations in ingest.

08/07/07 16:47:25 changed by ronald

Amit, the problem is not the image-transformations. Yes, they take time, and yes, the use cpu, but they don't cause the site to block. What causes the site to block is the article cache being cleared and both plosone's having to load all articles and associated objects, which takes several minutes, and which blocks all other activity that needs to talk to mulgara. And no, moving the image-resizing and other stuff out into a standalone app will not help at all here.

The only things that will help are A) to optimize the home-page's and browse-page's queries (instead of loading all articles as now), and/or B) for mulgara to support multiple simultaneous read transactions. Another option would be to avoid blowing away the caches and instead support incremental updates of the internal data structures.

(follow-up: ↓ 19 ) 08/07/07 16:54:02 changed by russ

hmm...as i understand it the article cache is only cleared on PUBLISH, this problem is happening on INGEST, i don't think it's the same thing...

08/07/07 16:54:57 changed by russ

rich says nothing takes priority above the enhancements necessary for 0.9 - so as much as i want these things fixed soon i understand that they may or may not happen in 0.9.

(in reply to: ↑ 12 ) 08/07/07 17:03:18 changed by ronald

Replying to amit:

One more thing...Mulgara allows only one write transaction at a time and others block waiting for the write to finish. If the ingestion takes a long time, it will keep a write transaction open blocking others making it appear to be slow.

Actually, Mulgara allows only a single transaction at a time. Period. Independent of whether it's read-only or read-write.

Also, there are many transactions involved during ingest: one for the basic ingest (processing the zip, loading the mulgara, fedora, and search), and one for each resized image. The basic ingest takes about 50 seconds; the other ones are much shorter (10 seconds).

(in reply to: ↑ 16 ) 08/07/07 17:13:35 changed by ronald

Replying to russ:

hmm...as i understand it the article cache is only cleared on PUBLISH, this problem is happening on INGEST, i don't think it's the same thing...

Ooops, my bad, you're right. Sorry, I see now that you're talking about a different issue with ingest. In that case it must be those 50 seconds for the initial ingest. Note that most of that time (40+ seconds) is spent loading mulgara, fedora, and search, so moving the zip processing out into a standalone app still won't help. It looks like the Fedora loading is taking the most time, so moving that outside the mulgara transaction would help.

08/10/07 17:12:49 changed by russ

  • owner changed from russ to amit.
  • status changed from assigned to new.

i'm going to assign this back to amit.

it sounds to me that this might be an issue even after we move to command line ingest if the main issue is that mulgara is waiting for fedora to finish.

post 0.8, perhaps someone can invesigate further?

08/10/07 19:43:54 changed by ronald

Btw., didn't you have the same issue with 0.6? This part of the code hasn't really changed - it just got moved from topaz to plosone.

08/13/07 10:08:07 changed by russ

i don't *think* we had this problem with 0.6, but if you remember we did then you're probably right.

09/24/07 12:19:31 changed by russ

  • priority changed from critical to high.
  • summary changed from long page load times during ingest to ingest causes site to hang - mulgara is locked for at least 10 seconds on some ingests.

i guess i agree that this is not critical, but it's still a big problem.

let's look into moving long operations (loading fedora, imageMagick) outside of mulgara transactions.

(follow-up: ↓ 25 ) 10/01/07 14:32:01 changed by ronald

I just noticed why this problem wasn't there in 0.6, but started with 0.7: we used to have a transaction covering just the mulgara inserts, but now the transaction scope covers the whole ingest, including fedora and lucene (but not the image scaling).

(in reply to: ↑ 24 ) 10/01/07 14:36:17 changed by amit

Replying to ronald:

I just noticed why this problem wasn't there in 0.6, but started with 0.7: we used to have a transaction covering just the mulgara inserts, but now the transaction scope covers the whole ingest, including fedora and lucene (but not the image scaling).

What was the reason for the shift?

10/29/07 20:55:31 changed by amit

  • owner changed from amit to russ.
  • summary changed from ingest causes site to hang - mulgara is locked for at least 10 seconds on some ingests to Ingestion should be standalone application..
  • version changed from 0.7 to 0.8.2-SNAPSHOT.
  • milestone set to 0.8.2.

Updated because of r4040.

10/29/07 21:23:53 changed by ronald

Btw, not creating all those unnecessary fedora objects during ingest should help a bit here. Over a quarter of all objects in fedora are the category objects, and the RELS-EXT streams are not needed either. Because this stuff is completely unused, we can just stop creating it on ingest, i.e. no data migration or any other changes are needed.

11/08/07 17:47:22 changed by rich

  • owner changed from russ to rich.
  • priority changed from high to medium.
  • milestone deleted.

12/27/07 18:21:11 changed by jsuttor

  • owner changed from rich to jsuttor.
  • milestone set to pubApp_0.8.3.

01/07/08 18:26:38 changed by jsuttor

  • keywords set to admin.
  • status changed from new to closed.
  • resolution set to wontfix.

Ingest will remain part of the Admin console and will be significantly reworked:

#722, #550, #713 #675

closing this ticket.

01/07/08 19:11:58 changed by amit

  • status changed from closed to reopened.
  • resolution deleted.

I don't get this. What is the reason for closing this? Is there another ticket covering this issue? I am worried that issues will get dropped as it exists in our minds and not on Trac.

03/03/08 11:24:52 changed by rich

  • milestone changed from 0.8.3 to 0.9.0.

03/19/08 15:34:58 changed by amit

  • status changed from reopened to closed.
  • resolution set to duplicate.

Duplicate of #848.