Filtering in the context of indexing

When indexing web pages, it is not always certain that all the information on the page is should be searchable. This is usually the case for menus or navigation elements, as these are the same for all pages. If you search for something that is part of a menu, you will get results on all pages.

By default, the crawler will ignore the content of the HTML5 tags <nav>, <header> and <footer>.

Any links in the tags will be followed but information will not be indexed. This behavior is adjusted via the following setting in nutch-site.xml:

<property>
   <name>parser.html.text.ignore_tags</name>
   <value>nav,header,footer</value>
   <description>Comma separated list of HTML tags which should be ignored when extracting text content</description>
</property>

It is also possible to exclude parts of a web page from indexing by putting the following HTML comments within the actual html page content:

<!--sv-noindex-->.
This text between the comments will be ignored and not indexed.
<!--/sv-noindex-->