Ticket #270 (closed defect: fixed)

Opened 2 years ago

Last modified 1 year ago

search service returns invalid XML document

Reported by: stevec Assigned to: ebrown
Priority: unassigned Milestone:
Component: topaz Version:
Keywords: Cc:
Blocking: Blocked By:

Description

The search service returns a snippet of the body text, and it truncates this snippet string regardless of what might be there. I believe we had a problem with XML tags getting truncated whih was fixed. It appears entities also have this problems. For example, search for the string "root apical papilla" on plosone.org and you'll get a site error. The problem is that article 52 has a snippet that has &apos and then ends there. Not ending semi-colon, and the XML parser fails.

Dependency Graph

Change History

01/26/07 15:34:34 changed by ronald

  • owner changed from somebody to ebrown.

Hmm, I remember seeing code to deal with this - ah yes, topaz-lucene-impl/src/main/java/org/topazproject/fedoragsearch/topazlucene/Statement.java" line 175. Wonder why this doesn't work here.

Oh, I think I see the problem: what gets returned to plosone is

  <field name="body" snippet="yes"> to the square <span class="highlight">root</span> of the number of Alzheimer&apos ...  are proportional to the square <span class="highlight">root</span> of the number</field>

Note the snippet=yes - this implies a snippetsMax > 0, but that branch of the code does not deal with this problem. Hmm, looks tricky: maybe the best way is to first resolve all entity references, then build the snippet list, and then escape what needs to be again.

01/26/07 17:17:30 changed by ebrown

  • status changed from new to assigned.

This can happen on any field. We don't have any DTD or entity definitions at this point. So I'm not sure how to resolve anything but a fixed list of entities.

(BTW, Try searching on "amp". We ought to add standard entities to the stop-list.)

I think there's a trick. I can get the list of fragments. Hack those and combine them myself. The only thing I don't think I'll be able to cope with is somebody that searches on an entity itself (as the fragment will be "&<span class="highlight">amp</span>;"). I'm drawing a blank on how to work around that.

01/26/07 19:22:14 changed by ebrown

  • status changed from assigned to closed.
  • resolution set to fixed.

(In [2301]) fixes #270 search service returns invalid XML

This fix does two things:

  • Retrieves snippets separately and runs entity stripping routine
  • Stripping routine now strips all entities > 7 characters. This should remove any entities that are partially highlighted. i.e. searching on "amp"

08/07/07 16:25:51 changed by

  • milestone deleted.

Milestone Bugs deleted