Ticket #70 (closed enhancement: fixed)

Opened 2 years ago

Last modified 5 months ago

Script to extract more data out of existing ingested articles.

Reported by: ronald Assigned to: ebrown
Priority: critical Milestone:
Component: topaz Version: 0.8
Keywords: article ingest mulgara resolver Cc:
Blocking: Blocked By:

Description

UCSD would like to do some additional processing during ingest of an article:

  • find mentions of proteins and insert links to the protein database
  • create additional RDF?

So we need to support some sort of ingest modules.

Related to this is proper config support for defining the modules and xslt's to use (these are hardcoded at the moment).

Dependency Graph

Change History

07/31/06 01:04:55 changed by ronald

  • keywords changed from articles ingest to article ingest.

05/29/07 17:40:07 changed by amit

  • keywords changed from article ingest to article ingest mulgara resolver.
  • summary changed from Support additional processing during article ingest to Support additional processing during article ingest (mulgara resolver).
  • version set to 0.8.
  • milestone changed from TBD to 0.8.

We should look at Pradeep's idea of creating a Mulgara resolver for this.

06/20/07 12:08:32 changed by amit

  • owner changed from ronald to ebrown.
  • priority changed from high to critical.

I think we are going with Eric's idea of writing a script to add the additional meta-data.

07/03/07 21:04:23 changed by amit

Here are some additional information we need to be able to extract out of the article and into Mulgara:

  • Article Type
  • Authors (in the right order)
  • Journal
  • Volume
  • Issue
  • Email addresses of authors (maybe link to user would do)
  • Publisher
  • Article Categories
  • Article subject(s)
  • Dates (xsd:date type)
  • Copyright (URI to capture the actual copyright?)
  • Title
  • Page count (Will it help for eTOC?)

Basically look at article-meta and see what all makes sense.

07/03/07 21:05:41 changed by amit

  • summary changed from Support additional processing during article ingest (mulgara resolver) to Script to extract more data out of existing ingested articles..

07/06/07 01:50:07 changed by ebrown

(In [3108]) re #70 Initial pass at script to extract more data from fedora

This is just a first pass that actually works so that feedback on the results this script produces can be gathered. Of particular interest is what data I'm moving, the actual predicates and whehter they all belong in Article or some belong in ObjectInfo?, etc.

Example Usage:

/usr/local/topaz/bin/rungroovy /usr/local/topaz/scripts/migration.groovy -a info:doi/10.1371/journal.pone.0000056

You can then go into itql and do the following:

/usr/local/topaz/bin/runitql
%mode = "table reduce exp quote"
%trunc = 50
select $p $o from <local:///topazproject#ri> where <info:doi/10.1371/journal.pone.0000056> $p $o;

You should see things like:

<topaz:pageCount>          5
<topaz:copyrightYear>      2006
<topaz:issue>              1
<topaz:volume>             1
<topaz:articleType>        'research-article'
<topaz:copyrightStatement> 'Vigne, Frelin. This is an open-access article di...'
<topaz:publisherName>      'Public Library of Science'
<topaz:journalTitle>       'PLoS ONE'
<topaz:authors>            _node483
<topaz:affiliations>       'Institut National de la Sant?\195?\131?\194?\169 et de la Recherche...'
<topaz:affiliations>       'Universit?\195?\131?\194?\169 de Nice Sophia Antipolis'
<topaz:body>               '<body><sec id=\'s1\'><title>Introduction</title><p...'
52 rows

The intent is to use the fedora-client to iteratate over all the articles it knows about. That is something like the following lists all the articles in fedora:

$FEDORA_HOME/client/bin/fedora-find localhost 9090 pid '?' http | grep pid | grep 'pone.[0-9]*$' | awk '{print $2}'

If the output from that is looped around this migration.groovy script, all articles will get migrated.

IMO, The migration script itself should leave around some kind of predicate indicating what version of the migration script was run. i.e. it would be a version number that we'd not want the actual plosone application to use.

07/12/07 12:39:43 changed by ebrown

(In [3164]) fixes #481 support <a:created> data model change with otm using <xsd:dateTime> instead of string

Migration script clearly needs to handle these types of things as well (re #70) - however these are on annotations, not ingested articles.

07/23/07 14:18:56 changed by ebrown

(In [3245]) re #70 Match migration to Amit's model changes

Note that there are still quite a few "compromises" here and/or missing migration because we're trying to be so strict on our mapping of existing data to rdf-standards when fields/data sometimes exists only on one side, is ambigious, is of a different data type and/or has incompatible structures.

07/24/07 01:19:17 changed by ebrown

(In [3252]) Convert article deletion to otm (re #70)

07/24/07 17:32:43 changed by ebrown

  • status changed from new to closed.
  • resolution set to fixed.

Was leaving this open for ingestion issues. But I see there is #99 for that.

07/16/08 11:00:34 changed by

  • milestone deleted.

Milestone 0.8 deleted