Lucene in action 笔记 term vector

xiaoxiao2024-01-25 165

Leveraging term vectors 所谓term vector, 就是对于documents的某一field,如title,body这种文本类型的, 建立词频的多维向量空间.每一个词就是一维, 这维的值就是这个词在这个field中的频率.

如果你要使用term vectors, 就要在indexing的时候对该field打开term vectors的选项:

Field options for term vectorsTermVector.YES – record the unique terms that occurred, and their counts, in each document, but do not store any positions or offsets information.TermVector.WITH_POSITIONS – record the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets.TermVector.WITH_OFFSETS – record the unique terms and their counts, with the offsets (start & end character position) of each occurrence of every term, but no positions.TermVector.WITH_POSITIONS_OFFSETS – store unique terms and their counts, along with positions and offsets.TermVector.NO – do not store any term vector information. If Index.NO is specified for a field, then you must also specify TermVector.NO.

这样在index完后, 给定这个document id和field名称, 我们就可以从IndexReader读出这个term vector(前提是你在indexing时创建了terms vector): TermFreqVector termFreqVector = reader.getTermFreqVector(id, "subject"); 你可以遍历这个TermFreqVector去取出每个词和词频, 如果你在index时选择存下offsets和positions信息的话, 你在这边也可以取到. 有了这个term vector我们可以做一些有趣的应用: 1) Books like this 比较两本书是否相似,把书抽象成一个document文件, 具有author, subject fields. 那么现在就通过这两个field来比较两本书的相似度. author这个field是multiple fields, 就是说可以有多个author, 那么第一步就是比author是否相同, String[] authors = doc.getValues("author"); BooleanQuery authorQuery = new BooleanQuery(); // #3 for (int i = 0; i < authors.length; i++) { // #3 String author = authors[i]; // #3 authorQuery.add(new TermQuery(new Term("author", author)), BooleanClause.Occur.SHOULD); // #3 } authorQuery.setBoost(2.0f); 最后还可以把这个查询的boost值设高, 表示这个条件很重要, 权重较高, 如果作者相同, 那么就很相似了. 第二步就用到term vector了, 这里用的很简单, 单纯的看subject field的term vector中的term是否相同, TermFreqVector vector = // #4 reader.getTermFreqVector(id, "subject"); // #4 BooleanQuery subjectQuery = new BooleanQuery(); // #4 for (int j = 0; j < vector.size(); j++) { // #4 TermQuery tq = new TermQuery(new Term("subject", vector.getTerms()[j])); subjectQuery.add(tq, BooleanClause.Occur.SHOULD); // #4 } 2) What category? 这个比上个例子高级一点, 怎么分类了,还是对于document的subject, 我们有了term vector. 所以对于两个document, 我们可以比较这两个文章的term vector在向量空间中的夹角, 夹角越小说明这个两个document越相似. 那么既然是分类就有个训练的过程, 我们必须建立每个类的term vector作为个标准, 来给其它document比较. 这里用map来实现这个term vector, (term, frequency), 用n个这样的map来表示n维. 我们就要为每个category来生成一个term vector, category和term vector也可以用一个map来连接.创建这个category的term vector, 这样做: 遍历这个类中的每个document, 取document的term vector, 把它加到category的term vector上. private void addTermFreqToMap(Map vectorMap, TermFreqVector termFreqVector) { String[] terms = termFreqVector.getTerms(); int[] freqs = termFreqVector.getTermFrequencies(); for (int i = 0; i < terms.length; i++) { String term = terms[i]; if (vectorMap.containsKey(term)) { Integer value = (Integer) vectorMap.get(term); vectorMap.put(term, new Integer(value.intValue() + freqs[i])); } else { vectorMap.put(term, new Integer(freqs[i])); } } } 首先从document的term vector中取出term和frequency的list, 然后从category的term vector中取每一个term, 把document的term frequency加上去.OK了有了这个每个类的category, 我们就要开始计算document和这个类的向量夹角了 cos = A*B/|A||B| A*B就是点积, 就是两个向量每一维相乘, 然后全加起来. 这里为了简便计算, 假设document中term frequency只有两种情况, 0或1.就表示出现或不出现 private double computeAngle(String[] words, String category) { // assume words are unique and only occur once Map vectorMap = (Map) categoryMap.get(category); int dotProduct = 0; int sumOfSquares = 0; for (int i = 0; i < words.length; i++) { String word = words[i]; int categoryWordFreq = 0; if (vectorMap.containsKey(word)) { categoryWordFreq = ((Integer) vectorMap.get(word)).intValue(); } dotProduct += categoryWordFreq; // optimized because we assume frequency in words is 1 sumOfSquares += categoryWordFreq * categoryWordFreq; } double denominator; if (sumOfSquares == words.length) { // avoid precision issues for special case denominator = sumOfSquares; // sqrt x * sqrt x = x } else { denominator = Math.sqrt(sumOfSquares) * Math.sqrt(words.length); } double ratio = dotProduct / denominator; return Math.acos(ratio); } 这个函数就是实现了上面那个公式还是比较简单的.

3) MoreLikeThis

对于找到比较相似的文档，lucene还提供了个比较高效的接口，MoreLikeThis接口

http://lucene.apache.org/java/1_9_1/api/org/apache/lucene/search/similar/MoreLikeThis.html

对于上面的方法我们可以比较每两篇文档的余弦值，然后对余弦值进行排序，找出最相似的文档，但这个方法的最大问题在于计算量太大，当文档数目很大时，几乎是无法接受的，当然有专门的方法去优化余弦法，可以使计算量大大减少，但这个方法精确，但门槛较高。

这个接口的原理很简单，对于一篇文档中，我们只需要提取出interestingTerm（即tf×idf高的词），然后用lucene去搜索包含相同词的文档，作为相似文档，这个方法的优点就是高效，但缺点就是不准确，这个接口提供很多参数，你可以配置来选择interestingTerm。

MoreLikeThis mlt = new MoreLikeThis(ir);

Reader target = ...

// orig source of doc you want to find similarities to

Query query = mlt.like( target);

Hits hits = is.search(query);

用法很简单，这样就可以得到，相似的文档

这个接口比较灵活，你可以不直接用like接口，而是用retrieveInterestingTerms(Reader r)

这样你可以获得interestingTerm，然后怎么处理就根据你自己的需要了。

本文章摘自博客园，原文发布日期：2011-07-04