今天需要将一个以逗号分隔的字段建立到索引库中去,没找到有现成的逗号分隔符分词器,于是看了看源码里空格分词器WhitespaceTokenizerFactory的写法。照葫芦画瓢写了一个逗号分词器: 但是版本太低 与现在版本所继承的类有所区别、故反编译了一下源码找到 WhitespaceTokenizerFactory类 在包lucene-analyzers-common-4.10.0.jar 下。 CommaTokenizerFactory.java - import java.io.Reader;
- import java.util.Map;
- import org.apache.lucene.analysis.util.TokenizerFactory;
- import org.apache.lucene.util.AttributeFactory;
- /**
- *@Function: 自定义逗号分词
- *@Class Name: CommaTokenizerFactory
- *@Author: zhangZhiPeng
- *@Date: 2014年10月21日
- *@Modifications:
- *@Modifier Name; Date; The Reason for Modifying
- *
- */
- public class CommaTokenizerFactory extends TokenizerFactory {
- public CommaTokenizerFactory(Map args)
- {
- super(args);
- if (!(args.isEmpty()))
- throw new IllegalArgumentException("Unknown parameters: " + args);
- }
- public CommaTokenizer create(AttributeFactory factory, Reader input)
- {
- if (this.luceneMatchVersion == null)
- //return new WhitespaceTokenizer(factory, input);
- return new CommaTokenizer(factory, input);
- //return new WhitespaceTokenizer(this.luceneMatchVersion, factory, input);
- return new CommaTokenizer(this.luceneMatchVersion, factory, input);
- }
- }
复制代码 CommaTokenizer.java- import java.io.Reader;
- import org.apache.lucene.analysis.util.CharTokenizer;
- import org.apache.lucene.util.AttributeFactory;
- import org.apache.lucene.util.Version;
- /**
- *@Function: 自定义逗号分词
- *@Class Name: CommaTokenizer
- *@Author: zhangZhiPeng
- *@Date: 2014年10月21日
- *@Modifications:
- *@Modifier Name; Date; The Reason for Modifying
- *
- */
- public class CommaTokenizer extends CharTokenizer{
- public CommaTokenizer(Reader in)
- {
- super(in);
- }
- @Deprecated
- public CommaTokenizer(Version matchVersion, Reader in)
- {
- super(matchVersion, in);
- }
- public CommaTokenizer(AttributeFactory factory, Reader in)
- {
- super(factory, in);
- }
- @Deprecated
- public CommaTokenizer(Version matchVersion, AttributeFactory factory, Reader in)
- {
- super(matchVersion, factory, in);
- }
- protected boolean isTokenChar(int c)
- {
- //return (!(Character.isWhitespace(c)));
- // 44表示逗号
- return !(c == 44);
- }
-
- }
复制代码判断是否等于44,如果等于就返回false,否则返回true。返回false表示分词。44是逗号的asc码值,比如a的asc码值为97,如果不知道一个字符对应的值为多少,可以这样: char[] c = new char[]{'a',',','b'}; Character.codePointAt(c, 1); 获得char数组里index为1的字符的asc码值。 然后把写好的类打进 lucene-analyzers-common-4.10.0.jar schema.xml 里加入一下内容: - <fieldType name="text_comma" class="solr.TextField" positionIncrementGap="100">
- <analyzer>
- <tokenizer class="org.apache.lucene.analysis.core.CommaTokenizerFactory"/>
- </analyzer>
- </fieldType>
复制代码
重启下 solr测试:
|