Currently Indexed Fields
| Field Name | NLM Article markup |
| identifier | $meta/article-id[@pub-id-type = 'doi'] |
| title | $meta/title-group/article-title |
| date | $meta/pub-date |
| dateSubmitted | $meta/history/date[@date-type = 'received'] |
| dateAccepted | $meta/history/date[@date-type = 'accepted'] |
| creator | $meta/contrib-group/contrib[@contrib-type = 'author'] |
| contributor | $meta/contrib-group/contrib[@contrib-type = 'contributor'] |
| subject: | $meta/article-categories/subj-group[@subj-group-type = 'Discipline']/subject |
| description | $meta/abstract |
| publisher | $jnl-meta/publisher/publisher-name |
| body | /article/body |
| rights | $meta/copyright-statement |
| language | always "en" |
| type | always "http://purl.org/dc/dcmitype/Text" |
| format | always "text/xml" |
| journal-title | $meta/journal-title |
| volume | $meta/volume |
| isssue | $meta/issue |
| elocation | $meta/elocation-id |
| editor | $meta/contrib-group/contrib[@contrib-type = 'editor'] |
| issn | $meta/issn |
| reference | (ref-list) |
| annotations | Mulgara |
| user profiles | Mulgara |
All stored metadata is potentially searchable; see the otm-models for what data is stored, pmc2obj.xslt for what is extracted during ingest, and SearchService.java for what search fields are mapped to what fields in the model.
Notes:
- The article citation is generated from the article metadata. Since these fields are already indexed, it's not necessary to index the article citation separately. Article citation has been removed from the use cases.
- The only use case for a user to search article references would be to search the author name AND title. If you search either field separately, you will get a lot of false positives. This search may not be too useful for users. Is there other functionality that can be used over the parsed reference fields?
Additional Fields
The following fields are not indexed (but should be):
| Field Name | NLM Article markup |
| aff | $meta/aff |
| acknowledgements: | $meta/ack |
| conflict: | $meta/fn [@fn-type= 'conflict'] |
| financial-disclosure: | $meta/fn [@fn-type= 'financial-disclosure' |
| any footnote type: | $meta/fn [@fn-type= '<value>' |
| author-notes | $meta/author-notes |
| secondary subject: | $meta/article-categories/subj-group[@subj-group-type = 'Discipline']/subject/subject |
Notes:
- The affiliation field <addr-line> is a bit of a mess. An author can put any information they want into this field and it is not cleaned/verified. Because of this, the information doesn't follow any rules and would be difficult to parse. And even if it were easy to parse, the information would not be consistent (e.g. UCLA vs. UC Los Angeles vs. University of California Los Angeles). From the PLoS standpoint, the <addr-line> should not be parsed and stored in a comparable field to the UserProfile? organizationName.
- We don't have nested subj-groups and the article XML will not be changes to use nested-sub-groups. The second category is already there. It's the subject name after the '/'. For example:
<subj-group subj-group-type="Discipline"> <subject>Neuroscience/Cognitive Neuroscience</subject> <subject>Neuroscience/Neurodevelopment</subject> <subject>Neuroscience/Psychology</subject> </subj-group> Primary category = Neuroscience Secondary categories = Cognitive Neuroscience, Neurodevelopment, Psychology
- Figure/table/media metadata is already indexed in the body of the article. This information has been removed from the use cases.
Index Capabilities
- Index (or remove index) on article XML on:
- Publish, delete, re-publish
- Index (or remove index) annotations on:
- Add, delete, update
- Index (or remove index) user profiles on:
- Add, delete, update
Simple Search
Simple search includes the following article fields:
- abstract
- title
- author
- body
Simple search should also include:
- annotations
- article reference list
Search Capabilities
- Search any of the indexed fields.
- Search relationships in Mulgara:
- Search for all papers by an author
- Search for all annotations by a user
- Etc.
- Search across one or more journals (user specified)
- Cross-pubbed articles should show up in search (#1027)
- Support query syntax
Sorting Search Results
- Sort by relevance (number of words?)
- Sort chronological order
- Sort reverse chronological order
- Sort by article type
- Sort by rating
- Sort by article level metrics
- Number of article views and/or PDF/XML downloads
- Number of citations
- Overall impact (combination of all article level metrics)
Open Questions
- To sort an article by a metric (e.g. page views), would the metric have to be stored in Mulgara?
- Can an annotation type be used for article metrics?
- Can the annotation store multiple metrics (e.g. page views, PDF downloads, # citations, etc.)
- As the metadata or content models could change, is there a way for an admin to add new fields and re-index?
