solr-4.10.0自定义逗号分词器
今天需要将一个以逗号分隔的字段建立到索引库中去,没找到有现成的逗号分隔符分词器,于是看了看源码里空格分词器WhitespaceTokenizerFactory的写法。照葫芦画瓢写了一个逗号分词器:原文引自:http://www.xuebuyuan.com/580106.html但是版本太低 与现在版本所继承的类有所区别、故反编译了一下源码找到 WhitespaceTokenizerFactory类 在包lucene-analyzers-common-4.10.0.jar 下。CommaTokenizerFactory.java import java.io.Reader;import java.util.Map;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeFactory;
/**
*@Function: 自定义逗号分词
*@Class Name: CommaTokenizerFactory
*@Author: zhangZhiPeng
*@Date: 2014年10月21日
*@Modifications:
*@Modifier Name; Date; The Reason for Modifying
*
*/
public class CommaTokenizerFactory extends TokenizerFactory{
public CommaTokenizerFactory(Map args)
{
super(args);
if (!(args.isEmpty()))
throw new IllegalArgumentException("Unknown parameters: " + args);
}
public CommaTokenizer create(AttributeFactory factory, Reader input)
{
if (this.luceneMatchVersion == null)
//return new WhitespaceTokenizer(factory, input);
return new CommaTokenizer(factory, input);
//return new WhitespaceTokenizer(this.luceneMatchVersion, factory, input);
return new CommaTokenizer(this.luceneMatchVersion, factory, input);
}
}CommaTokenizer.javaimport java.io.Reader;
import org.apache.lucene.analysis.util.CharTokenizer;
import org.apache.lucene.util.AttributeFactory;
import org.apache.lucene.util.Version;
/**
*@Function: 自定义逗号分词
*@Class Name: CommaTokenizer
*@Author: zhangZhiPeng
*@Date: 2014年10月21日
*@Modifications:
*@Modifier Name; Date; The Reason for Modifying
*
*/
public class CommaTokenizer extends CharTokenizer{
public CommaTokenizer(Reader in)
{
super(in);
}
@Deprecated
public CommaTokenizer(Version matchVersion, Reader in)
{
super(matchVersion, in);
}
public CommaTokenizer(AttributeFactory factory, Reader in)
{
super(factory, in);
}
@Deprecated
public CommaTokenizer(Version matchVersion, AttributeFactory factory, Reader in)
{
super(matchVersion, factory, in);
}
protected boolean isTokenChar(int c)
{
//return (!(Character.isWhitespace(c)));
// 44表示逗号
return !(c == 44);
}
}判断是否等于44,如果等于就返回false,否则返回true。返回false表示分词。44是逗号的asc码值,比如a的asc码值为97,如果不知道一个字符对应的值为多少,可以这样:char[] c = new char[]{'a',',','b'};Character.codePointAt(c, 1);获得char数组里index为1的字符的asc码值。然后把写好的类打进lucene-analyzers-common-4.10.0.jarschema.xml 里加入一下内容:<fieldType name="text_comma" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="org.apache.lucene.analysis.core.CommaTokenizerFactory"/>
</analyzer>
</fieldType>
重启下 solr测试:
涨姿势了
页:
[1]