Kino Search Contrib
KinoSearch is a Perl implementation of the
Apache Lucene search engine (implemented in Java). This is the implementation of this indexed search engine for Foswiki. With KinoSearch you create an index over all webs including attachments like Word, Excel and PDF. Based on that you get a really fast search over all topics and the attachments. You need this contrib if:
- your wiki has grown so big, that default search is too slow or
- you want to do search not only on the topics but also the attachments.
Screenshot
Usage
See the
KinoSearch? topic for user documentation.
Searching With Kinosearch
The
kinosearch script uses a template called
kinosearch.tmpl to render the results. You can override it in the same way as any other templates (i.e. create
kinosearch.yourskin.tmpl,
Set SKIN = yourskin,pattern).
There is also the
KinoSearch? topic with a form ready to use with the
kinosearch script.
If you have the
Foswiki:Extensions/KinoSearchPlugin, you can use the rest handler instead. The syntax is identical to the
kinosearch script.
-
http://foswiki.org/bin/rest/KinoSearchPlugin/search
-
cd foswiki/bin ; ./rest KinoSearchPlugin.search
Note: Rest handlers currently require the user to be authenticated. If you want unauthenticated users to search, use the
kinosearch script instead.
The following form submits a query to the
kinosearch script. The installation instructions are detailed below.
Integrating KinoSearch? into Foswiki's Internal SEARCH (experimental)
By setting
$Foswiki::cfg{RCS}{SearchAlgorithm} = 'Foswiki::Store::SearchAlgorithms::Kino'; (a setting in the
Store settings section in
configure),
Foswiki will use the
KinoSearch? index for any inbuilt search (including
WebSearch) that it can (for regex searches it will fall back to the
Forking search algorithm).
If you want Foswiki's
WebSearch to also show you attachment results (when you select the 'Both body and title' option), you need to also set
{KinoSearchContrib}{showAttachments}=1, and add
kino to the front of your
SKIN setting.
The reason this feature is experimental, is that kinosearch does not do partial matching, so searching for
TAG will not match text like
%TAG{"something"}%, only instances where the word TAG is seperated by whitespace. Foswiki's SEARCH expects total partial matching.
Note: This currently only works for Foswiki
1.0.x.
RSS Feeds
RSS 2.0 feeds can be set up for any search results. To access the feed append
&rss=on;skin=none to the end of the search url. There is a link to the feed from the results page in the default templates.
Indexing
Creating a New Index
Each topic's text body, title, form fields and attached documents are indexed.
You should run this script manually after installation to create the index files used by KinoSearch.
You can also schedule a weekly or monthly crontab job to create the index files again, or execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.
-
cd foswiki/kinosearch/bin ; ./kinoindex
Updating the Index
The
kinoupdate script uses the web's
.changes files to know about topic modifications. Also, a
.kinoupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, it first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again.
-
cd foswiki/kinosearch/bin ; ./kinoupdate
This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.
# m h dom mon dow command
35 * * * * cd /path/to/your/foswiki/kinosearch/bin ; ./kinoupdate
Attachment File Types to be Indexed
By default, the following file types are indexed:
-
.txt
-
.html
-
.xml
-
.doc
-
.docx
-
.xls
-
.xlsx
-
.ppt
-
.pptx
-
.pdf
You can change this with the
$Foswiki::cfg{KinoSearchContrib}{IndexExtensions} setting in
configure.
If you add other file extensions, they are treated as ASCII files. If needed, you can add more specialised stringifiers for further document types (see
Indexing further document types).
Indexing of Form Fields
All form fields are indexed. For this, the form templates are checked and the included fields are indexed. Additionally the name of the form of a topic is stored in the field
form_name. How to search for this is described below.
Note: With
kinoupdate only the form fields that existed at the time the initial index was created are indexed. Thus if you add a form or if you add a new field to an existing form, you should create a new index with
kinoindex.
Installation Instructions
Backend for Indexing Word 2003 Documents
To index Word 2003 Documents (
.doc) you will need to install one of the following:
-
antiword (recommended)
-
abiword
-
wvWare
You can then select the tool to use in
configure.
Backend for PDF
To index
.pdf files you need to install
xpdf-utils.
Backend for PPT
To index
.ppt files you need to install
ppthtml.
Backends for DOCX, PPTX
To index these file types, you will need to install the following tools from Sourceforge:
Then set the command path to these tools in
configure.
Please refer to the dependencies for file type XLSX under
Contrib Info.
Installing the AddOn
You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.
Open configure, and open the "Extensions" section. Use "Find More Extensions" to get a list of available extensions. Select "Install".
If you have any problems, or if the extension isn't available in
configure, then you can still install manually from the command-line. See
http://foswiki.org/Support/ManuallyInstallingExtensions for more help.
You only need the
Foswiki:Extensions/KinoSearchPlugin if you are using the
rest handlers, or the
%KINOSEARCH% macro. Otherwise you are fine without it.
There are additional packages required as listed in the dependencies under
Contrib Info.
Configuration
There are a number of settings that need to be set in
configure before you can use the Contrib.
Test of the Installation
- Test if the installation was successful:
- Check that
antiword, abiword or wvHtml is in place: Type antiword, abiword or wvHtml on the prompt and check that the command exists.
- Check that
pdftotext is in place: Type pdftotext on the prompt and check that the command exists.
- Check that
ppthtml is in place: Type ppthtml on the prompt and check that the command exists.
- Change the working directory to the
kinosearch/bin Foswiki installation directory.
- Run
./kinoindex
- Once finished, open a browser window and point it to the
System.KinoSearch topic.
- Just type a query and check the results.
Test of Stringification with ks_test
Some users report problems with the stringification: The
kinoindex scipts fails, takes too long on attachments or
kinosearch does not yield correct results. Some times this may result from installation errors esp. of the installation of the backends for the stringification.
ks_test give you the opportunity to test the stringification in advance.
Usage:
ks_test stringify file_name
(I plan to extend ks_test, but at the moment the only possible second parameter is stringify).
In the result you see, which stringifier is used and the result of the stringification.
Example:
/path/to/foswiki/kinosearch/bin$ ./ks_test stringify /path/to/foswiki/KinoSearchContrib/test/unit/KinoSearchContrib/attachement_examples/Simple_example.doc
Used stringifier: Foswiki::Contrib::KinoSearchContrib::StringifyPlugins::DOC_antiword
Stringified text:
Simple example Keyword: dummy Umlaute: Grober, Uberschall, Anderung
You see that the stringifier DOC_antiword is used and the resulting
text seems to be O.K.
Upgrading From SearchEngineKinoSearchAddOn
If you previously used the SearchEngineKinoSearchAddOn (either on TWiki or on Foswiki), you will have noticed that this has been repackaged as a Contrib in line with Foswiki standards. The plugin is available seperately at
Foswiki:Extensions/KinoSearchPlugin
You will also need to move your settings from
Main.SitePreferences into
configure.
Finally, the following settings have been renamed:
-
$Foswiki::cfg{KinoSearchLogDir} --> $Foswiki::cfg{KinoSearchContrib}{LogDirectory}
-
$Foswiki::cfg{KinoSearchIndexDir} --> $Foswiki::cfg{KinoSearchContrib}{IndexDirectory}
Further Development
There is certainly a lot more that can be done with this Add-on, such as adding more stringifiers, improving the performance and making it more robust. See
Foswiki:Tasks/KinoSearchContrib for currently open tasks.
Indexing Further Document Types
The indexing of attached documents is realised in two steps:
- the content of the document is changed to an ASCII string. This is called stringification.
- this ASCII string is indexed with KinoSearch. This is the normal way in all index applications.
To index different types of documents, it is necessary to have specialised stringifiers, i.e. classes to extract the ASCII text out of the document. In this contrib, a plug-in mechanism is implemented, so that additional stringifiers can be added without changing the existing code. All stringifier plugins are stored in the directory
lib/Foswiki/Contrib/KinoSearchContrib/StringifierPlugins.
You can add new stringifier plugins by just adding new files here. The minimum things to be implemented are:
- The plugin must inherit from
Foswiki::Contrib::KinoSearchContrib::StringifyBase
- The plugin must register itself by
__PACKAGE__->register_handler($application, $file_extension);
- The plugin must implement the method
$text = stringForFile ($filename)
All the stringifiers have unit tests associated with them, and we would encourage you to provide unit tests for any you wish to contribute. See
Foswiki:Development/UnitTests for more information on unit testing.
Contrib Info
| Author(s): |
Foswiki:Main.MarkusHesse, Foswiki:Main.SvenDowideit & Foswiki:Main.AndrewJones |
| Copyright: |
© 2007, Foswiki:Main.MarkusHesse; © 2009, Foswiki Contributors |
| Release: |
1.22 |
| Version: |
5428 (2009-11-04) |
| Change History: |
|
| 04 Nov 2009: |
v 1.22, Foswikitask:Item2326: small documentation improvent -- Foswiki:Main.IngoKappler |
| 24 Sep 2009: |
v 1.21, Foswikitask:Item1363: port to Foswiki -- Foswiki:Main.WillNorris. Rename to KinoSearchContrib and split plugin into KinoSearchPlugin; add stringifiers for .docx, .pptx and .xlsx; change the kinosearch script to work with FSA; Moved settings into configure; Commands now set in configure; Replaced system() calls with Foswiki::Sandbox->sysCommand(); Handle passworded MS Office files; Make the index more robust if it somehow encounters binary files; Can now specify skipped topics; updated and simplified docs; added doc for TipsContrib; update templates; Foswikitask:Item8246: fix checking of access controls -- Foswiki:Main.AndrewJones |
| 06 Nov 2008: |
v 1.20, minor revert to stop crash |
| 05 Nov 2008: |
v 1.19, fixes for (nex)twiki/trunk |
| 20 Aug 2008: |
v 1.18, added Integrated SEARCH, SearchEngineKinoSearchPlugin, restHandlers, updated code and tests -- Foswiki:Main.SvenDowideit |
| 6 Aug 2008: |
v 1.17, TWikibug:Item5717: persist use form choices, TWikibug:Item5647: cope better with attachment problems -- Foswiki:Main.SvenDowideit |
| 4 Jun 2008: |
v 1.16, TWikibug:Item5646: Problem with attachments with capital letter suffix |
| 12 May 2008: |
v 1.15, TWikibug:Item5579, TWikibug:Item5580, TWikibug:Item5619: Problem with ALLOWWEBVIEW and Forms fixed |
| 23 Apr 2008: |
v 1.14, TWikibug:Item5273, TWikibug:Item5546, TWikibug:Item5550, TWikibug:Item5552: Use current user in search script |
| 27 Jan 2008: |
v 1.13, TWikibug:Item5271: Option "show locked topics" now works |
| 19 Jan 2008: |
v 1.12, TWikibug:Item5270: Enhancement of stringifiers |
| 19 Dec 2007: |
v 1.11, Additions on stringifiers, modification of output format |
| 17 Nov 2007: |
v 1.10, PPT stringifier added |
| 11 Nov 2007: |
v 1.09, Some bugfixing |
| 3 Nov 2007: |
v 1.08, Some bugfixing |
| 7 Oct 2007: |
v 1.07, Some bugfixing |
| 6 Oct 2007: |
v 1.06, Upgrade for 4.1, Release with Foswiki:Extensions.BuildContrib |
| 29 Sep 2007: |
v 1.05, Indexing of form fields |
| 16 Sep 2007: |
v 1.04, Stringifier plugins for doc, xls and html |
| 13 Sep 2007: |
v 1.03, Indexing of PDF and TXT attachments |
| 08 Sep 2007: |
v 1.02, Index and update script enhanced |
| 24 Aug 2007: |
v 1.01, Update script included, Result uses highlighter |
| 14 Aug 2007: |
Initial version (v1.000) |
| Dependencies: |
| Name | Version | Description |
|---|
| KinoSearch | >0 | Required | | File::MMagic | >0 | Required | | Module::Pluggable | >0 | Required | | HTML::TreeBuilder | >0 | Required | | Spreadsheet::ParseExcel | >0 | Required for .xls files | | Spreadsheet::XLSX | >0 | Required for .xlsx files | | CharsetDetector | >0 | Required | | Encode | >0 | Required | | Error | >0 | Required | | ppthtml | >0 | Required for indexing .ppt files. Part of xlhtml | | pdftotext | >0 | Required for indexing .pdf. Part of xpdf-utils | | antiword | >0 | One of antiword, abiword or wvWare is required for .doc files | | abiword | >0 | One of antiword, abiword or wvWare is required for .doc files | | wvWare | >0 | One of antiword, abiword or wvWare is required for .doc files | | docx2txt | >0 | Required for .docx files. Available from http://sourceforge.net/projects/docx2txt/ | | pptx2txt | >0 | Required for .pptx files. Available from http://sourceforge.net/projects/pptx2txt/ | |
| Add-on Home: |
http://foswiki.org/Extensions/KinoSearchContrib |
| Support: |
http://foswiki.org/Support/KinoSearchContrib |