Indexing sources that require login

Indexing of sources that require a login works in Sitevision Crawler exactly the same way the same way as for a standard version of Nutch with one exception.

As of version 1.1 of Sitevision Crawler, it is now possible to authenticate against systems that use form-based login where the login form lacks an ID tag.

Example of httclient-auth.xml using the above function:

<auth-configuration>
   <credentials authMethod="formAuth"
                loginUrl="https://www.somesite.com/path/to/loginpage"
                loginFormActionUrl="Value of action attribute in the login form"
                loginRedirect="false">
      <loginPostData>
         <field name="username"
                value="crawler-user"/>
         <field name="password"
                value="crawler-password"/>
      </loginPostData>
      <additionalPostHeaders>
         <field name="User-Agent"
                value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1" />
      </additionalPostHeaders>
      <loginCookie>
         <policy>BROWSER_COMPATIBILITY</policy>
      </loginCookie>
   </credentials>
</auth-configuration>