lucene 分词相关的类

    xiaoxiao2026-05-02  8

    TokemStream

    org.apache.lucene.analysis.TokenStream

    一个 抽象类。一个TokenStream会枚举若干个token的序列,要么来自文档的域,要门来自查询文本。

    A TokenStream enumerates the sequence of tokens, either from Fields of a Document or from query text. 

    TokenStream org.apache.lucene.analysis.Analyzer.tokenStream(String fieldName, Reader reader) 从reader的文本中得到一个Analyzer分词后的TokenStream。 Creates a TokenStream which tokenizes all the text in the provided Reader.

    void org.apache.lucene.analysis.TokenStream.reset() throws IOException 将TokenStream的游标重置到初始位置。 Resets this stream to the beginning.

    boolean org.apache.lucene.analysis.TokenStream.incrementToken() throws IOException 消费者,也就是IndexWriter使用这个方法来获得下一个token。 Consumers (i.e., IndexWriter) use this method to advance the stream to the next token.  org.apache.lucene.analysis.tokenattributes.CharTermAttribute 一个token的词文本。 The term text of a Token.

    <CharTermAttribute> CharTermAttribute org.apache.lucene.util.AttributeSource.getAttribute(Class<CharTermAttribute> attClass) 获得指定的Attribute。 The caller must pass in a Class<? extends Attribute> value. Returns the instance of the passed in Attribute contained in this AttributeSource。

    Tokenizer

    org.apache.lucene.analysis. Tokenizer 一个Tokenizer是一个输入为Reader的 TokenStream。 A Tokenizer is a TokenStream whose input is a Reader. 

    TokenFilter

    org.apache.lucene.analysis. TokenFilter 一个TokenFilter是一个输入为其他TokenStream的TokenStream。用于过滤。 A TokenFilter is a TokenStream whose input is another TokenStream.  org.apache.lucene.analysis. LowerCaseFilter 将token替换为小写。 Normalizes token text to lower case.  org.apache.lucene.analysis. StopFilter 从一个TokenStream中去除停用词。 Removes stop words from a token stream. 

    Analyzer

    org.apache.lucene.analysis. KeywordAnalyzer 将整个stream作为一个token。适用于邮政编码、产品名称等。 "Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names. org.apache.lucene.analysis. ReusableAnalyzerBase 一个Analyzer的方便的子类,可以方便地实现TokenStream的重用。 An convenience subclass of Analyzer that makes it easy to implement TokenStream reuse. 相关资源:Lucene分词器资源包
    最新回复(0)