Logotype Sitevision Developer
Log in
Log in

Indexing sources that require login

Indexing of sources that require a login works in Sitevision Crawler exactly the same way the same way as for a standard version of Nutch with one exception.

As of version 1.1 of Sitevision Crawler, it is now possible to authenticate against systems that use form-based login where the login form lacks an ID tag.

Example of httclient-auth.xml using the above function:

xml
<auth-configuration> <credentials authMethod="formAuth" loginUrl="https://www.somesite.com/path/to/loginpage" loginFormActionUrl="Value of action attribute in the login form" loginRedirect="false"> <loginPostData> <field name="username" value="crawler-user"/> <field name="password" value="crawler-password"/> </loginPostData> <additionalPostHeaders> <field name="User-Agent" value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1" /> </additionalPostHeaders> <loginCookie> <policy>BROWSER_COMPATIBILITY</policy> </loginCookie> </credentials> </auth-configuration>
Did you find the content on this page useful?