This tutorial will teach you to crawl data from popular online shopping portal like (Amazon, Flipkart, Naptol and Jabong) and index this crawl data into Apache Solr. Also you will learn how to crawl ajax enabled and secured (https) site with Apach Nutch.

Please go through the previous tutorial to set up Apache Nutch 2.2 with MySql.

Update the nutch-site.xml:
cd ${APACHE_NUTCH_HOME}/runtime/local/conf

Edit the nutch-site.xml to enable crawling through secure https:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>http.agent.name</name>
<value>DemoWebCrawler</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.This allows selecting non-English language as default one to retrieve.It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information is available </description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>


<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
</configuration>

Update regex-urlfilter.txt to overcome the block urls:

The regex-urlfilter blocks urls that have querystring parameters:

skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

Modify that file so that urls with querystring parameters are crawled:

skip URLs containing certain characters as probable queries, etc.

-[*!@]

So comment (using #) those regex queries.More information on this check.

Now it lets crawl data/information from Naptol(mobile phone), Amazon(books) , Flipkart(sport shoes) and Jabong(sport shoes).

Edit your seed.txt and paste the following:

http://www.naaptol.com/brands/nokia/mobile-phones.html

http://www.flipkart.com/mens-footwear/shoes/sports-shoes/pr?sid=osp,cil,nit,1cu&otracker=hp_nmenu_sub_men_0_Sports%20Shoes

http://www.amazon.in/s/ref=nb_sb_noss_2/278-5129563-3057638?url=search-alias%3Daps&field-keywords=machine%20learning

http://www.jabong.com/men/shoes/sports-shoes/?source=topnav_men

Start crawling by typing the following into the command line:

bin/nutch inject urls
bin/nutch generate -topN 20
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb

Repeat the last four commands (generate, fetch, parse and updatedb) again.

Set up and index with Solr
Use latest version of Solr 4 (im using 4.9),other version 4+ will work fine too. Untar it to to $HOME/apachesolr4.X.XXX. This folder will be now referred to as ${APACHE_SOLR_HOME}.

Download from this link and use it to replace ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml .

From the terminal start solr:

cd ${APACHE_SOLR_HOME}/example

java -jar start.jar

You can check this is running by opening http://localhost:8983/solr in your web browser as hown below. Select collection1 from the core selector.

Leave that terminal running and from a different terminal type the following:

cd ${APACHE_NUTCH_HOME}/runtime/local/

bin/nutch solrindex http://localhost:8983/solr/ -reindex

You can now run queries using Solr versus your crawled content. Open http://localhost:8983/solr/#/collection1/query and assuming you have already crawled the above websites,type in the input box titled “q” or "fq" you can do a search by inputting

content: jabong OR content: nokia
(similarly try out for others like shoes,books etc)

and you should see something like this:

Congratulation :) for making your first small search engine ready from the crawl data of popular online shopping websites.

Big Data Analytics and Machine Learning

Saturday 24 January 2015

Web Crawling Naptol, Flipkart, Amazon, Jabong with Apache Nutch and Apache Solr

skip URLs containing certain characters as probable queries, etc.

skip URLs containing certain characters as probable queries, etc.

No comments:

Post a Comment

Labels

About Me