[Dovecot] solr substring schema.xml
I'm trying a modified schema.xml with solr - it appears I now have substring searches!
I took the schema.xml file shipped with Dovecot, and modified the text field definition to be:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="15"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>
Changing to the new "SnowballPorter" instead of the deprecated "EnglishPorter" filters is probably minor - the magic is the "NGramFilterFactory". 3 & 15 seemed reasonable defaults for the min/max size to search on.
Daniel
On 6/3/2011 5:54 AM, Timo Sirainen wrote:
On Sun, 2011-05-29 at 02:09 -0700, Daniel Miller wrote:
I'm trying a modified schema.xml with solr - it appears I now have substring searches! How large are your indexes compared to mailbox size?
du -c -b /var/mail/domain = 4913315733 du -c -b /var/mail/attachments = 29672490629 du -c -b /var/mail/solr = 12809981456
at the moment, I have an hourly cronjob -
doveadm search -A text zyxabcxyz > /dev/null
java -Ddata=args -jar /raid/mail/solr/exampledocs/post.jar
'<commit waitFlush="false" waitSearcher="false" expungeDeletes="true"/>'
/dev/null java -Ddata=args -jar /raid/mail/solr/exampledocs/post.jar
'<optimize waitFlush="false" waitSearcher="false"/>' > /dev/nullDaniel L. Miller, VP - Engineering, SET AM Fire & Electronic Services, Inc. [AMFES] dmiller@amfes.com 702-312-5276
participants (2)
-
Daniel Miller
-
Timo Sirainen