Item14807: Solr language detection does not work
Priority: Normal
Current State: Closed
Released In: n/a
Target Release: n/a
Applies To: Extension
Component: SolrPlugin
Branches: master
--
UlrichLeodolter - 28 Dec 2018
Solr langauge detection does not work for me.
I tried to set
CONTENT_LANGUAGE in
SitePreferences, but after full reindexing still almost all
documents have
language en, except documents which have CONTENT_LANGUAGE explicitly set in meta preferences.
* Set CONTENT_LANGUAGE = de
* Set CONTENT_LANGUAGE = detect
What can be the reason ?
The main reason i found is that langid.whitelist in solrconfig.xml does not allow whitespace.
The second reason is that only title is solr.StrField, summary and text are solr.TextField.
This is DEBUG output after chainging typeClass to solr.TextField and removing title from langid.fl
2018-12-28 09:51:24.830 DEBUG (qtp99747242-14) [ x:foswiki] o.a.s.u.p.LanguageIdentifierUpdateProcessor Language fallback to value en
2018-12-28 09:51:24.830 DEBUG (qtp99747242-14) [ x:foswiki] o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Appending field summary
2018-12-28 09:51:24.830 DEBUG (qtp99747242-14) [ x:foswiki] o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Appending field text
2018-12-28 09:51:24.841 DEBUG (qtp99747242-14) [ x:foswiki] o.a.s.u.p.LanguageIdentifierUpdateProcessor Detected a language not in whitelist (de), using fallback en
2018-12-28 09:51:24.842 DEBUG (qtp99747242-14) [ x:foswiki] o.a.s.u.p.LanguageIdentifierUpdateProcessor Detected main document language from fields [summary, text]: en
2018-12-28 09:51:24.842 DEBUG (qtp99747242-14) [ x:foswiki] o.a.s.u.p.LanguageIdentifierUpdateProcessor Mapping field summary using document global language en
2018-12-28 09:51:24.842 DEBUG (qtp99747242-14) [ x:foswiki] o.a.s.u.p.LanguageIdentifierUpdateProcessor Doing mapping from summary with language en to field summary_en
This are the final changes i did, after that language detection works fine.
diff -c solrconfig.xml.orig solrconfig.xml
*** solrconfig.xml.orig 2018-10-11 16:02:03.000000000 +0200
--- solrconfig.xml 2018-12-28 10:22:05.875754466 +0100
***************
*** 1741,1756 ****
<updateRequestProcessorChain name="foswiki_chain">
<processor class="solr.TruncateFieldUpdateProcessorFactory">
! <str name="typeClass">solr.StrField</str>
<int name="maxLength">32764</int>
</processor>
<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<!-- processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory"-->
<lst name="defaults">
! <str name="langid.fl">title,summary,text</str>
<str name="langid.langField">language</str>
<!-- languages we've got a tokenizer for - minus da as it brings down accuracies for the other languages (wtf) -->
! <str name="langid.whitelist">ar, bg, ca, cjk, ckb, cz, de, el, en, es, eu, fa, fi, fr, ga, gl, hi, hu, hy, id, it, ja, lv, nl, no, pt, ro, ru, sv, th, tr</str>
<str name="langid.overwrite">false</str>
<str name="langid.threshold">0.7</str>
<str name="langid.fallback">en</str>
--- 1741,1756 ----
<updateRequestProcessorChain name="foswiki_chain">
<processor class="solr.TruncateFieldUpdateProcessorFactory">
! <str name="typeClass">solr.TextField</str>
<int name="maxLength">32764</int>
</processor>
<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<!-- processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory"-->
<lst name="defaults">
! <str name="langid.fl">summary,text</str>
<str name="langid.langField">language</str>
<!-- languages we've got a tokenizer for - minus da as it brings down accuracies for the other languages (wtf) -->
! <str name="langid.whitelist">ar,bg,ca,cjk,ckb,cz,de,el,en,es,eu,fa,fi,fr,ga,gl,hi,hu,hy,id,it,ja,lv,nl,no,pt,ro,ru,sv,th,tr</str>
<str name="langid.overwrite">false</str>
<str name="langid.threshold">0.7</str>
<str name="langid.fallback">en</str>
--
UlrichLeodolter - 28 Dec 2018
I'd still like to have the title part of the language detection: how about adding the the field
title_std
to
langid.fl
?
--
MichaelDaum - 29 Dec 2018
Did not look into this, but title_std seems to be good choice. I have added it to our solrconfig.xml
--
UlrichLeodolter - 30 Dec 2018
Thanks! I'll add your improvements to the next release.
--
MichaelDaum - 30 Dec 2018
Experimenting with these settings and now it definitely works but I get too many false positives to the point automatic languate detection is almost useless.
Now testing with
langid.fl=catchall
as well as
langid.threshold=0.8
--
MichaelDaum - 03 Jan 2019
For me language detection works pretty well (using solr 5.5.5), in our internal wiki we have about 3950 document (except System web and Web topics), most of them are German. Below is the language facet (from solr admin search
:)
"language": [
"en",
7800,
"de",
3789,
"it",
10,
"fr",
9,
"es",
4,
"id",
4,
"no",
4,
"fi",
2,
"hu",
2,
"nl",
2,
"pt",
1,
"sv",
1
]
--
UlrichLeodolter - 03 Jan 2019