[文件] Compass Search Engine 學習筆記

作者TonyQ (沉默是金)

看板java

標題[文件] Compass Search Engine 學習筆記

時間Tue Jul 8 11:36:47 2008

前情概要: compass是以 Lucene 為底層的搜尋引擎介面, 它可以提供你對搜尋需求的滿足,藉由預先建立起索引檔, 可以把全文檢索所需的搜尋優化到能接受的程度. 底下是邊看邊寫寫下來的結論 , 多餘的想法跟對話也是忠實呈現 , 希望對於有相關需求的使用者能有所介紹...-.-;; This is NOT a step-by-step guideline. 這不是一份[下一步-下一步-下一步]的指引, 而是筆記下來所需要的東西, 如果您無法理解,可能是我表達能力太差,也可能是您需要閱讀一些相關知識. ────────────────────────────────── 目前看下來的一點結論 ────────────────────────────────── 1.compass 在界面上是採用與hibernate相似的介面一樣是 Factory Session Transation http://0rz.tw/8b4nh 摘錄自上述網頁說明 Compass API As you will learn in this chapter, Compass high level API looks strangely familiar. If you used an ORM framework (Hibernate, JDO or JPA), you should feel right at home. This is of-course, intentional. The aim is to let the developer learn as less as possible in terms of interaction with Compass. Also, there are so many design patterns of integrating this type of API with different applications models, that it is a shame that they won't be used with Compass as well. For Hibernate users, Compass maps to SessionFactory, CompassSession maps to Session, and CompassTransaction maps to Transaction. 相關使用範例請參照原本的資料, 由於hibernate會建立與資料庫的鏈結 (connection), 看到這裡我心中的疑惑是那資料庫從何而來 . 基於效率考量 , 顯然不會是去讀取原本的db , 就先前的所知是會建立index檔, 相信在接下來的章節會有所解答. ────────────────────────────────── 果不其然, 接下來指引談論到的就是 Connection! 但仍然不是預期中的索引檔,而是從何讀取. http://0rz.tw/864n5 這一章主要提到的都是connection的設定檔如何撰寫, 由於這章本身很長,講述各種的方式, 而且顯然Jdbc由於要考慮到資料庫的各項設定,所以篇幅稍長, 各位看官可根據自己需要的方式去選擇需要的章節, 在這裡我所需要的是File System Store的部分. 不過我還是將章節表列如下. 4.1. File System Store 4.2. RAM Store 4.3. Jdbc Store 4.3.1. Managed Environment 4.3.2. Data Source Provider 4.3.2.1. Driver Manager 4.3.2.2. Jakarta Commons DBCP 4.3.2.3. c3p0 4.3.2.4. JNDI 4.3.2.5. External 4.3.3. File Entry Handler 4.3.4. DDL 4.4. Lock Factory 4.5. Local Directory Cache 4.6. Lucene Directory Wrapper 4.6.1. SyncMemoryMirrorDirectoryWrapperProvider 4.6.2. AsyncMemoryMirrorDirectoryWrapperProvider 這裡定義的是索引檔要怎麼存放在什麼地方, 也定義要從哪裡讀取索引檔. ────────────────────────────────── 接下來讓我們快速的進到下一步... ────────────────────────────────── 各位看官!!!四張Ａ!!!(被巴) (不小心拿錯劇本...讓我們切回原本主題..) 光看這章的章節名稱也知道我們終於要進入搜尋引擎的主題了. 這章就沒什麼好跳過的了...努力的啃吧 ...都是原理解釋居多... ────────────────────────────────── Chapter 5. Search Engine http://0rz.tw/6b4oU ---------------------------------------------------- 5.1. Introduction Compass Core provides an abstraction layer on top of the wonderful Lucene Search Engine. Compass also provides several additional features on top of Lucene, like two phase transaction management, fast updates, and optimizers. When trying to explain how Compass works with the Search Engine, first we need to understand the Search Engine domain model. 要解釋如何使用 Compass 及底層的Lucene 來達到目標, 我們必須要了解搜尋引擎的基礎模型. ---------------------------------------------------- 5.2. Alias, Resource and Property Resource represents a collection of properties. You can think about it as a virtual document - a chunk of data, such as a web page, an e-mail message, or a serialization of the Author object. A Resource is always associated with a single Alias and several Resources can have the same Alias. The alias acts as the connection between a Resource and its mapping definitions (OSEM/XSEM/RSEM). A Property is just a place holder for a name and value (both strings). A Property within a Resource represents some kind of meta-data that is associated with the Resource like the author name. Resource(資源) 代表一群屬性, 你可以想像它是個虛擬文件, 一大片的資料, 像一個網頁,一個email訊息,或者一連串的 Author 物件. 一個 Resource 總會關係著一個 Alias(別名) , 並且可能會有多個Resource 共用同一個別名. Alias 扮演著在資源及其映射*的定義中的連線角色. 一個 Property (屬性) 是放置一個name 與 value (兩者皆為字串). *TonyQ 註: 這部分由於部分屬於基礎知識,所以我不特別在此著墨, 在網上找到能解釋些名詞的blog,可以參考. http://blogger.org.cn/blog/more.asp?name=lhwork&id=18505 Property可以對應到 System.property 的概念, OSEM = Object /Search Engine Mapping XSEM = XML / Search Engine Mapping RSEM = Resource / Search Engine Mapping Every Resource is associated with one or more id properties. They are required for Compass to manage Resource loading based on ids and Resource updates (a well known difficulty when using Lucene directly). Id properties are defined either explicitly in RSEM definitions or implicitly in OSEM/XSEM definitions. 每個Resource 關係著一到多個id 屬性, id對 Compass 來講是必須的, 因為它採用 id 來作為管理讀取與更新的屬性. id屬性在 RSEM 被明確的定義,也被包含在 OSEM/XSEM 中. For Lucene users, Compass Resource maps to Lucene Document and Compass Property maps to Lucene Field. 對Lucene 使用者而言, Compass Resource可以看做是 Lucene Document, 且 Compass Peroperty 也對應到 Lucene Field. ---------------------------------------------------- 5.2.1. Using Resource/Property When working with RSEM, resources acts as your prime data model. They are used to construct searchable content, as well as manipulate it. When performing a search, resources be used to display the search results. 當使用 RSEM時, 資源就像是你的原始資料模組(一個資料存放的單位), 他們不但被用在建構可搜尋的內容，而且也巧妙的被處理. 當執行一個查詢,資源被用在展示這些查詢的結果. Another important place where resources can be used, which is often ignored, is with OSEM/XSEM. When manipulating search content through the use of the application domain model (in case of OSEM), or through the use of xml data structures (in case of XSEM), resources are rarely used. They can be used when performing search operations. Based on your mapping definition, the semantic model could be accessed in a uniformed way through resources and properties. 另一個重點是在於資源可被運用在何處, 在 OSEM/XSEM 的模式下 , 資源常常被忽略不用. 當操作搜尋內容透過使用 OSEM的程式資料模組* , 或 XSEM的xml資料結構時 , 資源幾乎不使用, TonyQ 註: application domain model 應指程式執行時的Object model Lets simplify this statement by using an example. If our application has two object types, Recipe and Ingredient, we can map both recipe title and ingredient title into the same semantic meta-data name, title (Resource Property name). This will allow us when searching to display the search results (hits) only on the Resource level, presenting the value of the property title from the list of resources returned. 我們來透過一個範例來簡化這些敘述 , 如果我們的應用程式(簡稱ap) 有兩個物件型態, 食譜(Recipe) 跟原料(Ingredient) ,我們可以將食譜標題跟原料標題對應到同樣語義的原始資料名稱,也就是標題("title"). 這將會允許我們,當搜尋到要顯示搜尋結果(稱作hits)時, 可以只停留在資源層級,不需瞭解更底層的資料來源. 而是展示從資源回傳的清單中 , property "title"的值, ---------------------------------------------------- 5.3. Analyzers Analyzers are components that pre-process input text. They are also used when searching (the search string has to be processed the same way that the indexed text was processed). Therefore, it is usually important to use the same Analyzer for both indexing and searching. 分析者是對輸入資料做預先處理的一群原件, 他們也被用在搜尋時 . (搜尋的字串也同樣需要透過跟建立索引的字串一樣的方法被處理*) 因此, 重點在於通常使用同樣的分析者來建立索引與查詢. TonyQ註: 像 "hello hi" 可能需要分成兩關鍵字hello 跟hi Analyzer is a Lucene class (which qualifies to org.apache.lucene.analysis.Analyzer class). Lucene core itself comes with several Analyzers and you can configure Compass to work with either one of them. If we take the following sentence: "The quick brown fox jumped over the lazy dogs", we can see how the different Analyzers handle it: 分析者是一個Lucene class . (限定是 org.apache.lucene.analysis.Analyzer類別). Lucene 核心自帶有數個分析者,你可以設定Compass去使用其中任何一個, 如果我們使用底下的敘述 "The quick brown fox jumped over the lazy dogs" 我們將會看到不同的分析者如何處理它 (表列如下) whitespace (org.apache.lucene.analysis.WhitespaceAnalyzer): [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] simple (org.apache.lucene.analysis.SimpleAnalyzer): [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] stop (org.apache.lucene.analysis.StopAnalyzer): [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] standard (org.apache.lucene.analysis.standard.StandardAnalyzer): [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] TonyQ按:顯而易見的差異,stop跟stanard少了兩個the , simple則不考慮 The跟the的大小寫差異(一律小寫). Lucene also comes with an extension library, holding many more analyzer implementations (including language specific analyzers). Compass can be configured to work with all of them as well. Lucene 也擁有一個延伸的資源庫(library), 擁有許多分析者實作, (包括特定語言的分析者) , Compass 亦可被設定成使用這些分析者. ---------------------------------------------------- 底下這些章節請自行閱讀...主要是設定檔...目前先跳過 5.3.1. Configuring Analyzers 5.3.2. Analyzer Filter 5.3.3. Handling Synonyms (同義字處理) 5.4. Query Parser 5.5. Index Structure (索引結構) 5.6. Transaction (這章談到一些交易的細節) 5.6.1. Locking 5.6.2. Isolation 5.6.2.1. read_committed 5.6.2.2. serializable 5.6.2.3. lucene 5.6.3.2. FS Transaction Log 5.7. All Support (設定是否支援 OSEM, RSEM, XSEM) 5.8. Sub Index Hashing (子索引建表方案) 5.8.1. Constant Sub Index Hashing 5.8.2. Modulo Sub Index Hashing 5.8.3. Custom Sub Index Hashing 5.9. Optimizers 最佳化 5.9.1. Scheduled Optimizers 5.9.2. Aggressive Optimizer 5.9.3. Adaptive Optimizer 5.9.4. Null Optimizer 5.10. Merge 5.10.1. Merge Policy 5.10.2. Merge Scheduler 5.11. Index Deletion Policy 5.12. Spell Check / Did You Mean 語法檢查 5.12.1. Spell Index spell check 是個值得著墨的功能, 透過開啟這個功能可以有[建議結果]. 就類似google的[你是不是要查 xxxx ]的提示語, 有興趣的可以翻原文這個章節... btw 我用的版本沒有支援的樣子,推測是2.0以後才支援的項目.. 5.13. Direct Lucene 5.13.1. Wrappers 5.13.2. Searcher And IndexReader ---------------------------------------------------- 總結, 第五章是解釋搜尋原理,以及如何 tuning 搜尋結果的部分, 並且為Lucene user解釋 Compass與Lucene之間的關連.. ────────────────────────────────── 接下來六七八章分別是 OSEM , XSEM , RSEM的操作方式 , 我需要處理的主要是OSEM的部分(因為有用hibernate轉成物件) , 所以接下來只著墨在 OSEM的部分 .(也就是 How to write .cpm.xml) ------------------------------------------------------- 接下來參考 http://0rz.tw/514pS 的資料當hbm來寫. 不同類別的東西可以再下一個meta constant來補, 在querystring的時候再針對那個meta搜尋(+type=xxxx +(:keyword)"就好. 需要注意的地方 (我一開始弄反了) <class name="Category" alias="category" root="false"> <id name="sid"> ^^^^^^^^^^^^^^^ class欄位(field)名稱 <meta-data>cat_sid</meta-data> ^^^^^^^索引值名稱 (properties name) </id> .. .. </class> 另外如果有需要對結果作排序可考慮... (當然也要排序的欄位須要先行準備好!!) compass.openSession().queryBuilder().queryString("test").toQuery(). addSort("description",CompassQuery.SortPropertyType.STRING).hits(); 查詢用的query string 細節: http://0rz.tw/514pR -- 結束得有點虎頭蛇尾, 不過基本上 cpm就是寫id跟要蒐的欄位,還有一些可能需要的常數或欄位, 之後就進行build索引的動作 , 這裡我是交給Gps去處理 , Compass 在 Spring有不錯的整合 (文件中稱Compass::Spring), 所以這邊我也就先略去不寫 , 請參閱原文件 . 我能閱讀文件的時間用的差不多了, 有興趣研究的再討論吧 ,有問題歡迎指正/批評/討論 . -- I am a person, and I am always thinking . Thinking in love , Thinking in life , Thinking in why , Thinking in worth. I can't believe any of what , I am just thinking then thinking , but worst of all , most of mine is thinking not actioning... -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 220.128.219.202