Opened 11 months ago
#1085 new enhancement
BibIndex: add possibility to transliterate phrases
| Reported by: | simko | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | |
| Component: | BibIndex | Version: | |
| Keywords: | Cc: |
Description
1) It will be useful to add a possibility to transliterate phrases upon indexing time, especially for author names. This will read string stored in the DB and will optionally generate more terms to index, quite like stemming does, depending on index configuration.
2) We may want to separate generated terms into different indexes though, in case one would like to search for exact value as opposed to 'fuzzy' transliterated value. Kind of like Xapian does with its Z forms for stemming, so that people could search for both stemmed or non-stemmed version. This could be also applied to stemming, we've had such requests in the past.
2) We could use unidecode for this. Theodoros writes:
From all(?) the packages, I found that Uridecode (http://pypi.python.org/pypi/Unidecode) supports most of the languages that could be transliterated (although, for Greek it does not support the standard ISO 843 but a 'custom' one which is not very good as a practice. More details for the official transliteration standards for Greek, here: http://transliteration.eki.ee/pdf/Greek.pdf) The usage is very simple and I run an example with the following VERY complex Unicode string (with Hebrew, Hindi, Chinese and Greek): --------------- The decomposition mapping is <츠, U+11B8>, and not <0x110E, ᅳ, 11B8>. <p>The title says פעילות הבינאום, W3C in Hebrew</p> abcáßçकखी國際𐎄𐎔𐎘 Ελληνικά --------------- and is converted to: --------------- The decomposition mapping is <ceu, b>, and not <c, eu, 11B8>. <p>The title says p`ylvt hbynvm, W3C in Hebrew</p> abcassckkhiiGuo Ji Ellenika ---------------
3) As for per-index configuration of the transliteration, see also ticket:852.
