Apache Nutch 2 and Mysql Integration Tutorial – 整合教學

本篇教學主要內容為:

  1. 如何整合Apache Nutch 以及 Mysql
  2. 使用Nutch將資料存放進Mysql

Nutch版本:2.2.1

為何使用2.2.1?

因為Nutch的資料存放功能依賴Gora,  Gora可以Adapt各種類型的DB,

ex. HBase, Cassandra, Mysql…etc

但是Nutch的設定檔(ivy/ivy.xml)提到若是要在Nutch中使用Mysql的功能, 必須將gora-core Downgrade至0.2.1,

假設在目前最新的2.3.1版照做,  那就會出現錯誤:

InjectorJob: java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation

根據這篇的說明為hadoop版本衝突, 看來是ant的問題, 那就去dependency下把hadoop-core-1.0.1.jar刪掉, 結果又會跳出另一個不支援分散式處理的錯誤,

以上的錯誤應該都是nutch/gora/hadoop之間api版本相容性的問題,

於是最後downgrade nutch到2.2.1之後便可以成功執行nutch的功能。

 

以下是安裝及執行步驟:

大部分可以參考上一篇

APACHE NUTCH 2 AND SOLR 5 AND HBASE 0.98 SETUP TUTORIAL – 安裝教學

MYSQL設定

建立database

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

建立table(Optional, 假設自行建立新table, table name 需要與crawlId相關, 規則為crawlId_webpage, 假設nutch db帳號擁有權限, 可以讓他自動建立, 以下步驟可以略過。)

use nutch;
CREATE TABLE webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` longtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

需要修改的部分為:

IVY.XML

<!– Gora artifacts –>

只需要enable下列三個dependencies,並且確認版本要跟下面的一模一樣

 <!--================-->
 <!-- Gora artifacts -->
 <!--================-->
 <dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
 <!-- Uncomment this to use SQL as Gora backend. It should be noted that the
 gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should
 downgrade to gora-core 0.2.1 in order to use SQL as a backend. -->
 <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
 <!-- Uncomment this to use MySQL as database with SQL as Gora store. -->
 <dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>

GORA.PROPERTIES

加入以下section, 其他全部註解

# MySQL properties
################################

gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8&autoReconnect=true
gora.sqlstore.jdbc.user=username
gora.sqlstore.jdbc.password=password

gora-sql-mapping.xml

修改:

<!-- parse fields -->
中的
<field name="text" column="text" length="32000" jdbc-type="text"/>

 

不然會出現錯誤,這在大部分的tutorial中沒有提到,假設沒有修改的話mysql那邊會跳Exception。

nutch-site.xml

<configuration>
 <property>
  <name>http.agent.name</name>
  <value>My Nutch Spider</value>
 </property>
 <property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
 </property>
 <property>
  <name>storage.data.store.class</name>
  <value>org.apache.gora.sql.store.SqlStore</value>
  <description>The Gora DataStore class for storing and retrieving data.
   Currently the following stores are available: ….
  </description>
 </property>
</configuration>

Command

全部都與APACHE NUTCH 2 AND SOLR 5 AND HBASE 0.98 SETUP TUTORIAL – 安裝教學的指令一樣,除了把最後一個步驟

執行 bin/nutch updatedb -all -crawlId test.first.generate

改成

bin/nutch updatedb -crawlId test.first.generate

 

主要參考:

https://anil.io/blog/apache/nutch/apache-nutch-2-2-mysql-and-solr-5-2-1-tutorial/

官方連結:

http://www.solutions.asia/?p=180

 

錯誤問題:

https://issues.apache.org/jira/browse/NUTCH-1473

https://issues.apache.org/jira/browse/NUTCH-1497

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s