The contents of a crawl directory

Directory /var/opt/sitevision-crawler/<crawl_directory>/conf

This is the directory where the configuration files used by the crawler are located. These are the configuration files that can be found here:

Configuration file containing the domains to be indexed. In order for the crawler to index an address, its domain must be included in the file. Otherwise, the address will not be indexed.

Configuration file contains address and login details for Sitevision.

Configuration file that allows you to adjust the settings that determine how the indexing should be done.

Configuration file containing rules for which addresses are to be indexed.
Here you can, for example, set up which protocols are to be indexed or whether certain addresses should be excluded based on regular expressions.

Directory /var/opt/sitevision-crawler/<crawl_directory>/conf/seed

In the urls.txt file, which is located in this directory, the addresses from which the indexing will be based on when indexing is started for the first time.

Directory /var/opt/sitevision-crawler/<crawl_directory>/data

It stores all the information retrieved during the indexing of addresses.

By emptying this directory, the indexing will start from the beginning. Please note that the data that is stored in the search index in Sitevision must also be removed in order for the
for the search hits to disappear from the search modules.

Command to empty the data directory and remove all search hits from the search index:


Directory /var/opt/sitevision-crawler/<crawl_directory>/logs

The directory contains hadoop.log, cron.log and sitemaps-cron.log which contain the log messages generated during the indexing process.

Directory /var/opt/sitevision-crawler/<crawl_directory>/plugins

Directory where any extensions to the crawler, such as plugins, are placed.
The current underlying version of Nutch is 1.17, any extensions should therefore be compiled against this version.

Directory /var/opt/sitevision-crawler/<crawl_directory>/conf/sitemap

The file sitemaps.txt, located in this directory, adds the addresses of the sitemaps to be indexed will be based on. The content of these sitemaps will then be used as a starting point for indexing. Based on the resources found via the listed sitemaps, the crawler will extract links and work its way through the structure.

Sitemaps can be seen as an extended version of seeds.

Keep in mind that by default the crawler performs strict checks on sitemaps, check that the sitemaps follow the specification.