Apache Nutch 1 + Apache Solr 3 設定

本篇Blog於2016/02/18所撰寫,雖然目前的Nutch版本已是2.x,且Solr版本為5.x,但我使用的Nutch版本為1.11,Solr版本為3.6.2

原因:

  • Apache Nutch:Apache Nutch2官方Tutorial提到:

    It is assumed that you have a working knowledge of configuring Nutch 1.X, as currently configuration in 2.X is more complex. It is important to take this in to consideration before progressing any further. We therefore strongly advise that you check out the Nutch 1.X tutorial.

  • Apache Solr:Apache Nutch1的tutorial文件並沒有提及對應的Solr版本,最一開始嘗試5.x的版本,結果發現不僅configure的方式及project架構與說明文件差了一大截,使用指令時也會throw出IOException,原因unknown,網路上亦有人討論4.x無法與Nutch1.x相容。

 

以下為Nutch+Solr設置步驟

  1. 至官方下載Apache Nutch & Apache Solr
  2. 依照NutchTutorial中的步驟驗證並測試Apache Nutch
  3. 依照SolrTutorial的敘述啟動Solr
    (user:~/solr/example$ java -jar start.jar),確定可以正常執行。(這是4.x的tutorial,start方式一樣,將就一下…)

    • 執行畫面Screen Shot 2016-02-18 at 10.59.07 PM.png
  4. 最後一步,在Nutch crawl完網頁之後,把結果丟到Solr indexing
  5. 完工!
    執行成功:
    Screen Shot 2016-02-19 at 12.23.12 AM.png查詢:
    Screen Shot 2016-02-19 at 12.25.16 AM.png
    結果:
    Screen Shot 2016-02-19 at 12.25.27 AM.png
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s