Installation of Sitevision Crawler

The first thing to do is to download the installation file containing Sitevision Crawler. This can be done from My pages, at this link:

https://minasidor.sitevision.se/sjalvservice/hamta-sitevision/hamta-sitevision-crawler

This function requires license feature "Search Enterprise".

Install the distribution on your Linux server with the following command:
rpm -i sitevision-crawler-1.6.0.x86_64.rpm

Executable files

The crawler's executable files are located under the directory /opt/sitevision-crawler/bin/.

Data, logging and configuration

In the deployment, the executable files are separated from the data, logs, and configuration used for indexing.

The last mentioned are stored in "crawl directories" under /var/opt/sitevision-crawler. Each crawl directory is in turn separated from each other which means that it is possible to have completely different configuration files and logs for different indexing.

In connection with the installation, a "crawler_user" (svcusr) is created which can be used to execute the indexing itself. If another user is to be used for the indexing, it is important that this user has full access to the "crawl_directory".

Command to create a new "crawl directory":

/opt/sitevision-crawler/bin/svc-create-crawl-directory <crawl_directory> <crawler_user>

Manual indexation

Note: It is very important that you perform all manual indexing as the intended crawler user, by default: svcusr.
Otherwise, you risk having problems with file permissions later, especially if the manual operations were performed as root.

Command to perform a manual indexing for the <crawl_directory>:
/opt/sitevision-crawler/bin/svc-crawl <crawl_directory>

At each "indexing round", all outgoing links are extracted and recorded in the crawler. On subsequent executions of the 'svc-crawl', these links are also visited and the process continues.

Command to perform a manual indexing for <crawl_directory>, but based on one or more sitemaps:
/opt/sitevision-crawler/bin/svc-crawl-sitemaps <crawl_directory>.

Note: If you are indexing a website that has sitemap files that are gzip compressed, you must allow the crawler to download the files with the .gz file extension. See the configuration file regex-urlfilter.txt.

At each "indexing round", newly added pages are fetched from the specified sitemaps. After that, the execution continues
in the same way as when using svc-crawl.

If desired, all search hits can be removed from the crawler's database and from the Sitevision search index using the following commands.
Delete all search hits:
/opt/sitevision-crawler/bin/svc-delete-all <crawl_directory>.

Delete a single search hit:
/opt/sitevision-crawler/bin/svc-url <crawl_directory> <url_to_be_deleted>.

Repeated indexing with CRON

The crawler contains logic that, over time, learns how often different pages change and thus need to be re-indexed.

However, for re-indexing to occur, the crawler must be executed continuously at a periodic interval. This is done by registering the
the crawler in cron which will then automatically execute the indexing.

Repeated indexing starting from seed

Registering scheduled indexing in cron /opt/sitevision-crawler/bin/svc-cron-schedule-crawl <crawl_directory> <crawler_user>


Unregistering scheduled indexing in cron
/opt/sitevision-crawler/bin/svc-cron-unschedule-crawl <crawl_directory> <crawler_user>

Repeated indexing based on sitemaps

Registering scheduled indexing in cron
/opt/sitevision-crawler/bin/svc-cron-schedule-sitemaps-crawl <crawl_directory> <crawler_user>


Unregistering scheduled indexing in cron
/opt/sitevision-crawler/bin/svc-cron-unschedule-sitemaps-crawl <crawl_directory> <crawler_user>