Apache Nutch 2 Solr 5 Hbase Setup Tutorial – 安裝教學

這篇是基於, Apache Nutch 2 的official tutorial建立的step by step教學。

首先來抓一下各個會需要用到的元件,並且注意版本。
Nutch:2.3.1
Hbase:0.98.8
Solr:5.5.2

調整Nutch設定檔:

nutch-site.xml

Customize your crawl properties(conf/nutch-site.xml)

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

cd到當前目錄:
mkdir -p urls
cd urls
touch seed.txt
加入:http://nutch.apache.org/

(Optional) Configure Regular Expression Filters

編輯conf/regex-urlfilter.txt

# accept anything else
+.

取代

 +^http://([a-z0-9]*\.)*nutch.apache.org/

或其他想要的regular expression

Specify the GORA backend in $NUTCH_HOME/conf/nutch-site.xml

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>

 

ivy.xml

Ensure the HBase gora-hbase dependency is available in $NUTCH_HOME/ivy/ivy.xml

    <!-- Uncomment this to use HBase as Gora backend. -->
    <dependency org="org.apache.gora" name="gora-hbase" 
         rev="0.6.1" conf="*->default" />

增加一個dependency

<dependency org="org.apache.hbase" name="hbase-common" rev="0.98.8-hadoop2" conf="*->default" />

 

gora.properties

在conf/gora.properties新增一個property

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

 

Build Nutch source code:
由於nutch官網上提供的是source code我們必須自己build(用ant)
在nutch home執行

ant runtime

 

成功後會出現 BUILD SUCCESSFUL, home下會出現一個runtime/


First Crawl

設定hbase:
hbase home 底下的conf/hbase-site.xml

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///home/testuser/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/testuser/zookeeper</value>
  </property>
</configuration>

注意:兩個value都要改成自己的path,第二個zookeeper的path隨自己高興

執行bin/start-hbase.sh

run起來後可以檢查web ui:http://localhost:60010/master-status

Screen Shot 2016-08-17 at 11.45.18 PM

cd hbase home 執行 bin/start-hbase.sh

回到nutch底下

執行bin/nutch inject urls -crawlId test.first.generate (inject 至 hbase)

執行 bin/nutch generate -crawlId test.first.generate -topN 1 (第一次只有一個,產生fetch list,同時會產生batch id)

執行 bin/nutch fetch -all -crawlId test.first.generate

執行 bin/nutch parse -all -crawlId test.first.generate

執行 bin/nutch updatedb -all -crawlId test.first.generate

可以以遞迴方式執行上方block內的動作

(For example: 第二次改topN=100)

,流程參照 nutch 1.x tutorial

–準備invert link -> 2.x看來不需invert links


設定Solr

mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org

 

啟動solr

cd 到 solr home底下,

執行bin/solr start -e cloud -noprompt

會啟動兩個solr instance

然後~我們就可以開始~進行最後一步~Indexing to Solr!

Unfortunately如果你前面的步驟都跟我一樣,然後執行apache的官方教學,你就會:

1.

<h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p><hr /><i><small>Powered by Jetty://</small></i>

Just like him: http://stackoverflow.com/questions/24654957/error-404-prob-accessing-solr-update-reason-not-found

還好底下的John大大很好心地為我們這些菜鳥解答,

只要加上core的名稱就好了(我的是gettingstarted_shard1_replica2)

2.

再來是

No IndexWriters activated

 

這看來是缺plugin的關係(Ref: http://stackoverflow.com/questions/17649567/nutch-message-no-indexwriters-activated-while-loading-to-solr/25945844#25945844),

我們就把${NUTCH_HOME}/conf/nutch-site.xml中加入

<property>
 <name>plugin.includes</name>
 <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>

於是我的indexing完整指令為:

bin/nutch solrindex http://localhost:8983/solr/gettingstarted_shard1_replica2 -all -crawlId test.first.generate

之後再到Solr query,就可以痛哭流涕的看到搜尋結果了!

Screen Shot 2016-08-17 at 11.45.04 PM

Ref:

Apache Nutch:http://wiki.apache.org/nutch/Nutch2Tutorial

Apache Nutch Command Options:https://wiki.apache.org/nutch/CommandLineOptions

Apache Solr:http://lucene.apache.org/solr/4_10_3/tutorial.html

Hbase:http://hbase.apache.org/book.html#quickstart

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s