Information on using and implementing fedoragsearch
- FedoraInfo? - General information about Fedora
- http://defxws2006.cvt.dk/fedoragsearch/ - Documentation and download
- Fedora Mailing Lists
- Primary developers: Gert Pedersen, Thierry Michel
- Topaz configuration: source:private/eric/fedoragsearch
- Lucene's flexible Query Syntax
Discovery
Discovery was for #146/#156. Found the following:
fgs can be configured to index the article xml [656] or the first data-stream of a supported mime-type it finds (text/plain, text/html, application/pdf) [654]. It also indexes meta data that can be searched depending on how you set your search string. By default (little work on my part), the following is the transform output the lists the meta-data that is indexed:
- All dublin-core data
- PID
- type = FedoraObject
- state = Active
- createDate
- lastModifiedDate
- contentModel = PlosArticle
It is trivial (xslt) for us to add, remove and customize these as long as they are ingested into fedora.
Note that lucene supports quite complex query expressions. Everything that is indexed (even the complete article text) is in a field. What fields are searched without qualifying the search text is configured in index.properties. (Currently, uva.access is the name of the field containing the article text.)
Installation
See external fedoragsearch above. However, some documentation there is a bit unclear. So best is to review a few concepts and parameters here.
Fedoragsearch currently has the concept of dealing with multiple repositories. When it discusses repositories, these are actually separate instances of fedora. From discussions on the fedora mailing list, my impression is that 2.2 may ship with fedoragsearch and it will probably be configured for just the one repository that the installed fedora represents. (Don't know.) But the concept of it dealing with separate repositories was a bit confusing for me at first especially because its default/demo configuration seems to be able to talk to two repositories. BTW, the names of the repositories is completely arbitrary and only exists to make looking up configuration files possible / simpler.
For topaz, we will probably support one repository. Thus, there will be a few configuration files in WEB-INF/classes/config: (assuming we reference our repositoriy and repository index as Topaz and TopazIndex respectively)
- fedoragsearch.properties - Main configuration file (that references others)
- repository/Topaz/repository.properties - Information on how to access the fedora repository webapp, foxml files and what mime types to expect.
- repository/Topaz/repositoryInfo.xml - Information returned by one of fedoragsearch's APIs.
- index/TopazIndex/index.properties - Configures indexing and searching defaults.
- index/TopazIndex/indexInfo.xml - Information returned by one of fedoragsearch's APIs
- index/TopazIndex/xxxxxFoxmlToLucene.xslt - Converts foxml files to xml that lucene understands how to import.
There are also a large number of default .xslt files for translating results to html for REST implementations and SOAP implementations. These files are defaulted in the .properties files but can almost always be overridden via the actual APIs.
In my testing, I had to hard-code absolute path names in a number of locations to get things to work. I don't know if there is a way around that without touching fedoragsearch code or not. (Though Ronald indicated the right way to implement this in Java.)
Current test configuration is checked into source:private/eric/fedoragsearch.
Usage
Both REST and SOAP interfaces are offered: (actual APIs are almost exactly the same)
- http://localhost:9090/fedoragsearch/rest
- http://localhost:9090/fedoragsearch/services (soap) - download wsdl from here
REST
The default REST interfaces provice a bunch of drop-down selections that only offer access to the two repository configurations that ship with fedoragsearch. Thus, using them can be a bit painful. However, it is quite easy to change the CGI parameters and still test with the REST interfaces.
The REST html pages can easily be customized, but it takes a bit of work.
SOAP
The most interesting APIs are updateIndex() and gfindObjects():
Method Name: gfindObjects In #0: query ((u'http://www.w3.org/2001/XMLSchema', u'string')) In #1: hitPageStart ((u'http://www.w3.org/2001/XMLSchema', u'long')) In #2: hitPageSize ((u'http://www.w3.org/2001/XMLSchema', u'int')) In #3: snippetsMax ((u'http://www.w3.org/2001/XMLSchema', u'int')) In #4: fieldMaxLength ((u'http://www.w3.org/2001/XMLSchema', u'int')) In #5: indexName ((u'http://www.w3.org/2001/XMLSchema', u'string')) In #6: resultPageXslt ((u'http://www.w3.org/2001/XMLSchema', u'string')) Out #0: gfindObjectsReturn ((u'http://www.w3.org/2001/XMLSchema', u'string')) Method Name: updateIndex In #0: action ((u'http://schemas.xmlsoap.org/soap/encoding/', u'string')) In #1: value ((u'http://schemas.xmlsoap.org/soap/encoding/', u'string')) In #2: repositoryName ((u'http://schemas.xmlsoap.org/soap/encoding/', u'string')) In #3: indexName ((u'http://schemas.xmlsoap.org/soap/encoding/', u'string')) In #4: indexDocXslt ((u'http://schemas.xmlsoap.org/soap/encoding/', u'string')) In #5: resultPageXslt ((u'http://schemas.xmlsoap.org/soap/encoding/', u'string')) Out #0: updateIndexReturn ((u'http://schemas.xmlsoap.org/soap/encoding/', u'string'))
updateIndex
To update the index with current data from the repository, you can call updateIndex() as follows:
updateIndex("fromFoxmlFiles", None, "Topaz", "TopazIndex?", "topazFoxmlToLucene", "copyXml")
(Undocumented) Options for action allow for plenty of flexibility:
- createEmpty - Must be called once (if there is no repository yet)
- fromFoxmlFiles
- fromPid
- deletePid
Probably createEmpty and fromFoxmlFiles could be called daily and most of the necessary funcationality indexing would be done. (Though it would disrupt any search fucntions while it was working.)
NB: To get indexing to work properly, xxxxFoxmlToLucene.xslt must have been properly implemented.
Note: It does appear that DTDs are being validated (and likely not cached) and thus indexing can take some time. See #172.
gFindObjects
I only tested this via REST. But it seemed to work just fine.
(I did suggest to Gert that he change hitPageStart to an int so that it would be more compatible with more SOAP implementations.)
