Ticket #768 (closed defect: fixed)

Opened 1 year ago

Last modified 10 months ago

ingest constructs illegal mulgara queries if empty elements exist

Reported by: russ Assigned to: ronald
Priority: low Milestone:
Component: ambra Version: 0.8.2-SNAPSHOT
Keywords: ingest Cc:
Blocking: Blocked By:

Description

it looks like something in the XML, probably around references, is causing a variable expansion to fail, resulting in ingest trying to insert invalid triples such as:

<info:doi/10.1371/journal.pgen.0010042/bibliographicCitation> <http://rdf.plos.org/RDF/hasEditorList> $bn_w383aab2aaac14a
$bn_w383aab2aaac14a <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq>
$bn_w383aab2aaac14a <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1> <info:doi/10.1371/profile/30255e57-c8f6-428b-b60e-70be2d4d9379>

i did a query to see what hasEditorList is supposed to look like and got this:

select $s $p $o
from <local:///topazproject#ri>
where $s $p $o
and $p <mulgara:is> <http://rdf.plos.org/RDF/hasEditorList>

<solution>
      <s resource="info:doi/10.1371/journal.pgen.0030029/reference#pgen-0030029-b001"/>
      <p resource="http://rdf.plos.org/RDF/hasEditorList"/>
      <o blank-node="_node2284559"/>
</solution>

i will upload the entire query that mulgara is rejecting and the XML for the article in question.

i'm willing to examine the article XML for errors if you can give me some idea of the mechanism here and what i need to look for.

Dependency Graph

Attachments

ITQL (196.0 kB) - added by russ on 01/24/08 11:03:26.
failed mulgara query on ingest of pgen e10042
pgen.0010042.zip (1.7 MB) - added by russ on 01/24/08 11:03:51.
article package for pgen e10042
error (21.4 kB) - added by russ on 01/25/08 12:38:12.
error trace on ingest with empty source tag in reference 58

Change History

01/24/08 11:03:26 changed by russ

  • attachment ITQL added.

failed mulgara query on ingest of pgen e10042

01/24/08 11:03:51 changed by russ

  • attachment pgen.0010042.zip added.

article package for pgen e10042

01/24/08 11:05:09 changed by russ

  • milestone set to pubApp_0.8.2.1.

(in reply to: ↑ description ) 01/24/08 20:12:09 changed by ronald

Replying to russ:

it looks like something in the XML, probably around references, is causing a variable expansion to fail, resulting in ingest trying to insert invalid triples such as:

<info:doi/10.1371/journal.pgen.0010042/bibliographicCitation> <http://rdf.plos.org/RDF/hasEditorList> $bn_w383aab2aaac14a
$bn_w383aab2aaac14a <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq>
$bn_w383aab2aaac14a <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1> <info:doi/10.1371/profile/30255e57-c8f6-428b-b60e-70be2d4d9379>

There's nothing wrong with these inserts - they're perfectly valid itql. The $bn_foo are blank-nodes.

i did a query to see what hasEditorList is supposed to look like and got this:

select $s $p $o
from <local:///topazproject#ri>
where $s $p $o
and $p <mulgara:is> <http://rdf.plos.org/RDF/hasEditorList>

<solution>
      <s resource="info:doi/10.1371/journal.pgen.0030029/reference#pgen-0030029-b001"/>
      <p resource="http://rdf.plos.org/RDF/hasEditorList"/>
      <o blank-node="_node2284559"/>
</solution>

Yup, this looks correct too.

So, what is the error you were getting?

01/25/08 12:25:29 changed by russ

  • summary changed from ingest of genetics e10042 results in invalid ITQL in insert query to ingest constructs illegal mulgara queries if empty elements exist.
  • milestone deleted.

i got a similar error on a file that had an empty <caption> element in the <fig> element for one of the figures.

removing the empty caption tag allowed the article to ingest.

i see with 10042 that one of the references has an empty source tag, i'm going to see if removing that will fix.

looks like ingest is getting confused with empty tags and is generating invalid mulgara. i uploaded the entire mulgara query that caused the error - can you see anything else wrong with it?

i'll grab the entire error for the failed 10042 ingest in a sec...

01/25/08 12:37:46 changed by russ

  • priority changed from high to low.

indeed, removing the empty <source/> tag from reference 58 resolved the problem.

these empty tags are valid per DTD (although it's odd semantically) so we might want to handle them without an error as a low priorty.

i'm uploading the entire error we got with the empty <source/> tag, with the actual mulgara query snipped out since it's already uploaded to this ticket.

01/25/08 12:38:12 changed by russ

  • attachment error added.

error trace on ingest with empty source tag in reference 58

01/26/08 13:52:56 changed by ronald

  • owner changed from jsuttor to ronald.
  • status changed from new to assigned.

Thanks for the info. I see it now: the empty source tag translates to an empty title in the rdf/xml:

<rdf:Description rdf:about="info:doi/10.1371/journal.pgen.0010042/reference#pgen-0010042-b58">
  <dc:title rdf:datatype="http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral"></dc:title>
</rdf:Description>

This by itself isn't a problem, but the source:head/plos/libs/article-util/src/main/resources/org/plos/article/util/RdfXmlToTriples.xslt script seems to have a bug where it doesn't handle empty elements properly, even though there's explicit code and comments for that case.

02/29/08 05:20:59 changed by ronald

  • status changed from assigned to closed.
  • resolution set to fixed.

(In [4860]) Fix #768: first of all, we use parseType="Literal" instead of converting the content to a string ourselves and setting the datatype to rdf:XMLiteral. The latter causes problems with empty elements because they result in empty literals, and the spec has some really weird rules regarding these when they're a typed literal (see http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/ section 7.2.16 and 7.2.21, and test014). RdfXmlToTriples? is doing the correct thing here, but not what we want, hence the switch to use parseType="Literal",

Second, RdfXmlToTriples? has been updated to try and preserve (internal) whitespace in literals.