Humanities Computing

How is computing used in humanities scholarship? How does information technology impact teaching and learning?
Topics include: Digital libraries, electronic publishing, scholarly communication, web remediation of humanities scholarship, etc.

Thursday, September 02, 2004

Swish-e, searching XML/TEI

http://swish-e.org/

Swish-e is an open source indexer/search engine. It excels at indexing
(X)HTML files, but indexes plain text and XML files almost as easily.
It comes with C, PHP, and Perl API's, and it runs under (over?) Unix as
well as Window's operating systems.

I am/will be using swish-e as the underlying indexer for searches
against TEI documents. Specifically, I have been marking sets of
literature up in TEI. I then convert the sets into a number of formats
such as plain text, XHTML, PDF, various Palm flavors, etc. I then use
swish-e to index the XHTML because swish-e does makes it easy to pull
out the meta tags of HTML head elements and make them field searchable
as well as the body of the text being free-text searchable. I could
have almost as easily indexed the raw TEI files, then then I have to
deal with transforming the XML before it gets to the browser. ("I know.
There are many ways to do that."). See:

http://infomotions.com/alex2/

I have also been fiddling with Plucene, a Perl port of Lucene, a
Java-based indexer/search engine library:

http://search.cpan.org/dist/Plucene/

Unlike swish-e, Lucene/Plucene are libraries. Swish-e is a
indexer/search engine binary as well as a library.

1 Comments:

Blogger A's Log said...

Hi Hope, I hope to implement a searchable index of TEI docs using Lucene & Java but am wondering what the best approach to index TEI XML is? given that someone might want to do field specific searches e.g. 'poems only' or 'place names only' etc. Someone advised me that some XML oriented method e.g. XPath/XQuery would be the best approach..where does one start??

July 10, 2008 at 10:24 AM  

Post a Comment

<< Home