org.apache.lucene.analysis.miscellaneous.WordDelimiterFilterFactory Java Examples
The following examples show how to use
org.apache.lucene.analysis.miscellaneous.WordDelimiterFilterFactory.
You can vote up the ones you like or vote down the ones you don't like,
and go to the original project or source file by following the links above each example. You may check out the related API usage on the sidebar.
Example #1
Source File: TestWordDelimiterFilterFactory.java From lucene-solr with Apache License 2.0 | 4 votes |
@Test public void testCustomTypes() throws Exception { String testText = "I borrowed $5,400.00 at 25% interest-rate"; ResourceLoader loader = new SolrResourceLoader(TEST_PATH().resolve("collection1")); Map<String,String> args = new HashMap<>(); args.put("luceneMatchVersion", Version.LATEST.toString()); args.put("generateWordParts", "1"); args.put("generateNumberParts", "1"); args.put("catenateWords", "1"); args.put("catenateNumbers", "1"); args.put("catenateAll", "0"); args.put("splitOnCaseChange", "1"); /* default behavior */ WordDelimiterFilterFactory factoryDefault = new WordDelimiterFilterFactory(args); factoryDefault.inform(loader); TokenStream ts = factoryDefault.create(whitespaceMockTokenizer(testText)); BaseTokenStreamTestCase.assertTokenStreamContents(ts, new String[] { "I", "borrowed", "5", "540000", "400", "00", "at", "25", "interest", "interestrate", "rate" }); ts = factoryDefault.create(whitespaceMockTokenizer("foo\u200Dbar")); BaseTokenStreamTestCase.assertTokenStreamContents(ts, new String[] { "foo", "foobar", "bar" }); /* custom behavior */ args = new HashMap<>(); // use a custom type mapping args.put("luceneMatchVersion", Version.LATEST.toString()); args.put("generateWordParts", "1"); args.put("generateNumberParts", "1"); args.put("catenateWords", "1"); args.put("catenateNumbers", "1"); args.put("catenateAll", "0"); args.put("splitOnCaseChange", "1"); args.put("types", "wdftypes.txt"); WordDelimiterFilterFactory factoryCustom = new WordDelimiterFilterFactory(args); factoryCustom.inform(loader); ts = factoryCustom.create(whitespaceMockTokenizer(testText)); BaseTokenStreamTestCase.assertTokenStreamContents(ts, new String[] { "I", "borrowed", "$5,400.00", "at", "25%", "interest", "interestrate", "rate" }); /* test custom behavior with a char > 0x7F, because we had to make a larger byte[] */ ts = factoryCustom.create(whitespaceMockTokenizer("foo\u200Dbar")); BaseTokenStreamTestCase.assertTokenStreamContents(ts, new String[] { "foo\u200Dbar" }); }
Example #2
Source File: LindenWordDelimiterAnalyzer.java From linden with Apache License 2.0 | 3 votes |
/** * generateWordParts * Causes parts of words to be generated: * <p/> * "PowerShot" => "Power" "Shot" * <p> * generateNumberParts * Causes number subwords to be generated: * <p/> * "500-42" => "500" "42" * <p> * catenateWords * Causes maximum runs of word parts to be catenated: * <p/> * "wi-fi" => "wifi" * <p> * catenateNumbers * Causes maximum runs of word parts to be catenated: * <p/> * "500-42" => "50042" * <p> * catenateAll * Causes all subword parts to be catenated: * <p/> * "wi-fi-4000" => "wifi4000" * <p> * preserveOriginal * Causes original words are preserved and added to the subword list (Defaults to false) * <p/> * "500-42" => "500" "42" "500-42" * <p> * splitOnCaseChange * If not set, causes case changes to be ignored (subwords will only be generated * given SUBWORD_DELIM tokens) * <p> * splitOnNumerics * If not set, causes numeric changes to be ignored (subwords will only be generated * given SUBWORD_DELIM tokens). * <p> * stemEnglishPossessive * Causes trailing "'s" to be removed for each subword * <p/> * "O'Neil's" => "O", "Neil" */ public LindenWordDelimiterAnalyzer(Map<String, String> params) { if (params.containsKey(SET_STOP_WORDS)) { this.setStopWords = Boolean.parseBoolean(params.get(SET_STOP_WORDS)); params.remove(SET_STOP_WORDS); } if (params.containsKey(TO_LOWER_CASE)) { this.toLowerCase = Boolean.parseBoolean(params.get(TO_LOWER_CASE)); params.remove(TO_LOWER_CASE); } factoryDefault = new WordDelimiterFilterFactory(params); }