next up previous contents index
Next: 4.6 Incorporating manually generated Up: 4 The Gatherer Previous: Customizing the summarizing

4.5 Setting variables in the Gatherer configuration file

   

       

In addition to customizing the steps described in Section 4.4.4, you can customize the Gatherer by setting variables in the Gatherer configuration file. This file consists of two parts: a list of variables that specify information about the Gatherer (such as its name, host, and port number), and two lists of URLs (divided into RootNodes and LeafNodes) from which to collect indexing information. Section 4 shows an example Gatherer configuration file. In this section we focus on the variables that the user can set in the first part of the Gatherer configuration file.

Each variable name starts in the first column, ends with a colon, then is followed by the value. The following table shows the supported variables:

                             

        Data-Directory:         Directory where GDBM database is written.
        Errorlog-File:          File for logging errors.
        Essence-Options:        Any extra options to pass to Essence.
        Gatherd-Inetd:          Denotes that gatherd is run from inetd.
        Gatherer-Host:          Full hostname where the Gatherer is run.
        Gatherer-Name:          A Unique name for the Gatherer.
        Gatherer-Options:       Extra options for the Gatherer.
        Gatherer-Port:          Port number for gatherd.
        Gatherer-Version:       Version string for the Gatherer.
        HTTP-Proxy:             host:port of your HTTP proxy.
        Lib-Directory:          Directory where configuration files live.
        Local-Mapping:          Mapping information for local gathering.
        Log-File:               File for logging progress.
        Top-Directory:          Top-level directory for the Gatherer.
        Working-Directory:      Directory for tmp files and local disk cache.

Notes:

                       

The Essence options are:

Option                  Meaning
--------------------------------------------------------------------
--allowlist filename    File with list of types to allow
--fake-md5s             Generates MD5s for SOIF objects from a .unnest program
--full-text             Use entire file instead of summarizing.  Alternatively,
                        you can perform full text indexing of individual file
                        types by using the {\tt FullText.sum} summarizer (see
                        Section~\ref{sec:cust-summarize} for details).
--max-deletions n       Number of GDBM deletions before reorganization
--minimal-bookkeeping   Generates a minimal amount of bookkeeping attrs
--no-access             Do not read contents of objects
--no-keywords           Do not automatically generate keywords
--stoplist filename     File with list of types to remove
--type-only             Only type data; do not summarize objects

 

A particular note about full text summarizing: Using the Essence --full-text option causes files not to be passed through the Essence content extraction mechanism. Instead, their entire content is included in the SOIF summary stream. In some cases this may produce unwanted results (e.g., it will directly include the PostScript for a document rather than first passing the data through a PostScript to text extractor, providing few searchable terms and large SOIF objects). Using the individual file type summarizing mechanism described in Section 4.4.4 will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence --full-text option to perform content extraction before including the full text of documents.

   



next up previous contents index
Next: 4.6 Incorporating manually generated Up: 4 The Gatherer Previous: Customizing the summarizing



Darren Hardy
Mon Apr 3 15:22:37 MDT 1995