Ticket #999 (closed defect: wontfix)

Opened 4 months ago

Last modified 4 months ago

prepare-sip demands that "journal" be part of doi. should take doi from article xml instead.

Reported by: russ Assigned to: rich
Priority: critical Milestone: 0.9.0
Component: ambra Version: 0.9-rc1
Keywords: Cc:
Blocking: Blocked By:

Description

in addition to breaking the 0.8 functionality, and entering incorrect DOIs into our system, this is also a plos-ism that should be removed before the 0.9 public release.

most articles we ingest have a doi with the following format:

doi/10.1371/journal.abcd.1234567

issue image articles have a different format:

doi/10.1371/image.abcd.i01.v01

prepare sip is adding a "journal" to the dois in ext-link and graphic tags in the image article.

ingest is adding a "journal" to the doi of that actual article on ingest.

the article doi is defined in the article xml, specifically in the <article-id pub-id-type="doi"> tag. any uri fixes made by prepare_sip need to respect that doi, no matter what it is. prepare_sip can't make any assumptions about doi format.

ingest must also respect the doi defined by article xml, and cannot make assumptions about doi format.

i'll attach before and after XML for a recently ingested image article.

Dependency Graph

Attachments

image.pcbi.v04.i07.xml (5.1 kB) - added by russ on 07/28/08 10:01:15.
this is the prepare-sip transformed xml for image.pcbi.v04.i07
image.pcbi.v04.i07.xml.orig (5.2 kB) - added by russ on 07/28/08 10:10:44.
this is the original xml, before prepare-sip
image.pcbi.v04.i07.xml.new (5.2 kB) - added by russ on 07/28/08 12:36:44.
after updating xml with an absolute URI on the graphic tag
image.pcbi.v04.i07.g001.tif (335.1 kB) - added by russ on 07/28/08 12:37:18.
this is the only other item in the article zip. zip up this and an xml file, and you can run prepare-sip to try and duplicate my findings..

Change History

07/28/08 10:01:15 changed by russ

  • attachment image.pcbi.v04.i07.xml added.

this is the prepare-sip transformed xml for image.pcbi.v04.i07

07/28/08 10:10:44 changed by russ

  • attachment image.pcbi.v04.i07.xml.orig added.

this is the original xml, before prepare-sip

07/28/08 10:18:01 changed by russ

  • summary changed from ambra prepare_sip and ingest are adding an extra, incorrect, "journal" to doi strings to ingest is adding an extra, incorrect, "journal" to doi strings.

ahh, okay, i see what's going on.

$ diff -bBdw image.pcbi.v04.i07.xml image.pcbi.v04.i07.xml.orig
79c83
<                 <graphic xlink:href="info:doi/10.1371/journal.image.pcbi.v04.i07.g001" alt-version="no" mimetype="image" position="float"
xlink:type="simple"/>
---
>                 <graphic xlink:href="image.pcbi.v04.i07.g001.tif"/>

we're not giving an absolute URI in the xml, so prepare_sip is prepending "info:doi/10.1731/journal.". presumably this is because the typesetting error that prepare-sip is trying to fix leaves off the journal.

although we could try to make prepare-sip more complex to catch this, it's probably best fixed on our end, when we generate the image xml.

i'll fix this, and then see if we still have the doi problem on ingest.

07/28/08 11:11:42 changed by amit

Ronald has repeatedly pointed this out and asked for this to be fixed upstream. prepare_sip goes through a specific cycle fixing problems that should be caught and fixed earlier in the cycle.

07/28/08 12:17:11 changed by russ

amit, we're not talking about the errors from charlesworth which, unfortunately, we need prepare-sip to catch and fix.

image articles go through a completely different process, in-house, which we control.

07/28/08 12:22:34 changed by russ

  • owner changed from russ to ronald.

i tried fixing the image article xml by using an absolute URI on the graphic tag.

prepare-sip returns a NPE.

looks like we're actually not allowed to fix these errors upstream :)

here's the updated tag:

<graphic xlink:href="info:doi/10.1371/image.pcbi.v04.i07.g001"/>

here's the output from prepare-sip:

[root@plosstage01 image.pcbi.v04.i07]# /usr/local/topaz/bin/prepare-sip -v image.pcbi.v04.i07.zip SIP for image.pcbi.v04.i07.zip

manifest added article links fixed

Processing file: image.pcbi.v04.i07.zip article-type: Issue Image img-set-name: #default img-set: found Created temp file: /tmp/tmp_51906image.pcbi.v04.i07.g001.tif

error: java.lang.NullPointerException?

}}}

for some reason, manifest.xml still lists a bad URI for the article and the image. it looks like prepare-sip is unhappy unless there's a "journal" in the doi. this was a bad call. let's discuss options for fixing in the next conference call.

07/28/08 12:29:15 changed by russ

  • summary changed from ingest is adding an extra, incorrect, "journal" to doi strings to prepare-sip demands that "journal" be part of doi. should take doi from article xml instead..

07/28/08 12:33:20 changed by amit

  • owner changed from ronald to russ.

Please add your edited XML file to the ticket.

07/28/08 12:36:44 changed by russ

  • attachment image.pcbi.v04.i07.xml.new added.

after updating xml with an absolute URI on the graphic tag

07/28/08 12:37:18 changed by russ

  • attachment image.pcbi.v04.i07.g001.tif added.

this is the only other item in the article zip. zip up this and an xml file, and you can run prepare-sip to try and duplicate my findings..

07/28/08 12:37:44 changed by russ

  • owner changed from russ to amit.

07/29/08 11:28:04 changed by ronald

The "problem" is occuring while creating the manifest, i.e. when it has to guess what the heck is what. I think the best would be if you created a proper manifest as part of preparing the zip, then no guessing has to be done. See the dtd and the manifest in a prepared sip for examples.

07/29/08 12:15:12 changed by amit

  • owner changed from amit to russ.

07/29/08 12:17:49 changed by russ

  • owner changed from russ to rich.

assigning to rich to consider.

personally, i think that 0.9 should work the same as 0.8 wrt how image article ingest works, and that if production changes were needed for 0.9, that requirement should have been laid out at the beginning of this milestone, not at the end.

07/29/08 16:18:41 changed by rich

  • owner changed from rich to amit.

If we insist on having end users create their own manifests, then we should provide a generic prepare_sip that does the ImageMagick? transformations and respects the configuration around the transform. The article page and slideshows expect the large, medium, and small PNGs to exist.

If we decide that it's also up to the end user to figure out how to create the resized images, then we should break out the prepare-sip specific configuration for ImageMagick? into a separate config file.

07/29/08 17:48:05 changed by amit

  • owner changed from amit to rich.

Sorry, I have not understood what ImageMagick? has to do with this ticket. To be clear, prepare sip is pretty much dependent on what the input package is going to look like, so there cannot be a generic prepare sip. What can be specified in the input format that ambra can ingest and we can provide an example.

ImageMagick? transforms can again be PLoS specific and not necessarily needed by others. Also I strongly recommend that folks actually look at what PrepareSIP.groovy is doing. It is just a useful combinator script which calls other scripts to do the actual processing.

08/01/08 17:13:32 changed by amit

  • status changed from new to closed.
  • resolution set to wontfix.

I am closing this. Here are the reasons:

  • prepare_sip is going to be customer specific
  • If PLoS is going to create different SIP packages based on images created internally vs. the ones received from AP/CW then they will have to modify it accordingly
  • Encoding different style of DOI's is a bad idea. I would recommend not to use different patterns.
  • Someone in PLoS will have to decide where all this information will need to be encoded and how complicated the corresponding rules will need to be. Ronald has created MANIFEST.xml to exactly avoid trying to guess what the rules are and where the information needs to be pulled from (file name, article XML, file created internally requires different DOI vs. one created by CW/AP etc).