Indexing sources that require login
Indexing of sources that require a login works in Sitevision Crawler exactly the same way the same way as for a standard version of Nutch with one exception.
As of version 1.1 of Sitevision Crawler, it is now possible to authenticate against systems that use form-based login where the login form lacks an ID tag.
Example of httclient-auth.xml using the above function:
<auth-configuration>
<credentials authMethod="formAuth"
loginUrl="https://www.somesite.com/path/to/loginpage"
loginFormActionUrl="Value of action attribute in the login form"
loginRedirect="false">
<loginPostData>
<field name="username"
value="crawler-user"/>
<field name="password"
value="crawler-password"/>
</loginPostData>
<additionalPostHeaders>
<field name="User-Agent"
value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1" />
</additionalPostHeaders>
<loginCookie>
<policy>BROWSER_COMPATIBILITY</policy>
</loginCookie>
</credentials>
</auth-configuration>