Ticket #99 (closed enhancement: wontfix)

Opened 2 years ago

Last modified 1 year ago

Improve meta-data extraction during ingest

Reported by: ronald Assigned to: ronald
Priority: critical Milestone:
Component: ambra Version: 0.8
Keywords: ingest article Cc: ebrown
Blocking: Blocked By:

Description

More and better meta-data needs to be extracted during ingest:

  • category/subcategory split and store separately
  • extract related articles
  • should dc:creator and dc:contributor point to nodes, with names and affiliations hanging off that?
  • dates other than the epub?

Dependency Graph

Change History

08/08/06 14:04:14 changed by ronald

Regarding pub-date: we need to store the date the article is ingested, and (for alerts) the date at which the article is actually published (opened to general viewing). Also, it's not clear whether the epub pub-date will already be set in the incoming pmc, and if it is whether we should replace it.

08/11/06 09:39:55 changed by ebrown

  • cc set to ebrown.

Please make the pub-date field type <xsd:date>. We may want to change that to <xsd:datetime> in the future as timeline and alerts get more sophisticated, so we should discuss. But #116 currently assumes it will get a date that is typed with <xsd:date>.

08/11/06 16:48:20 changed by amit

  • priority changed from unassigned to high.

08/17/06 02:32:37 changed by ronald

(In [468]) Addresses #72 and #99: support inserting RDF directly into the triplestore. For this the RDF element in the object has renamed to the more accurate RELS-EXT, and a new element rdf:RDF is supported as a child of ObjectList? which takes an RDF/XML fragment.

Conversion from RDF/XML to triples is done via a stylesheet taken from http://www.semanticplanet.com/library/Main/RdfToTriplesStylesheet and slightly modified to generate iTQL triples instead of N3 triples.

08/17/06 03:33:06 changed by ronald

(In [471]) Addresses #99 and #116: add xsd:date datatype to dc:date .

09/05/06 01:36:43 changed by ronald

(In [580]) Addresses #99 and #116: set additional dates, namely dateSubmitted, dateAccepted, issued, and available. Alerts should probably use the available date.

10/02/06 16:37:22 changed by ronald

  • status changed from new to assigned.
  • milestone changed from TBD to october16.

category extraction for oct 16.

10/10/06 00:12:37 changed by ronald

(In [772]) Addresses #99: the subjects are split along the first '/' into main- and sub-category, or just main-category if no '/' is present. These are then stored under a separate node as

  <article> <topaz:hasCategory> <cat>
  <cat> <topaz:mainCategory> 'foo' 
  <cat> <topaz:subCategory> 'bar' 

Note that dc:subject still contains the full categories.

10/10/06 00:13:56 changed by ronald

  • milestone changed from october16 to TBD.

05/29/07 17:40:46 changed by amit

  • version changed from 0.5-SNAPSHOT to 0.8.
  • milestone changed from TBD to 0.8.

06/20/07 12:09:43 changed by amit

  • priority changed from high to critical.

Please coordinate with Eric as he writes his script to update the production servers for already ingested articles.

06/20/07 15:18:57 changed by amit

  • component changed from topaz to plos-one.

08/03/07 02:30:27 changed by ronald

(In [3318]) Generate bib-citation and references too. Addresses #99 and matches migration code in [3136] et al.

08/08/07 17:22:06 changed by amit

  • milestone changed from 0.8 to 0.9.

I think the features for 0.8 are in, but we need to revisit this for the next milestone which is focused on data migration.

09/02/07 22:34:05 changed by amit

  • status changed from assigned to closed.
  • resolution set to wontfix.

I am closing this one, as incremental features improvments to ingest make little sense. We need to redesign ingest for the next major change. Small additions will probably keep going in, but the ingest design is reaching its limitations (as we had expected).

09/02/07 22:41:40 changed by

  • milestone deleted.

Milestone 0.9 deleted