Opened 11 months ago

#1085 new enhancement

BibIndex: add possibility to transliterate phrases

Reported by: simko Owned by:
Priority: major Milestone:
Component: BibIndex Version:
Keywords: Cc:

Description

1) It will be useful to add a possibility to transliterate phrases upon indexing time, especially for author names. This will read string stored in the DB and will optionally generate more terms to index, quite like stemming does, depending on index configuration.

2) We may want to separate generated terms into different indexes though, in case one would like to search for exact value as opposed to 'fuzzy' transliterated value. Kind of like Xapian does with its Z forms for stemming, so that people could search for both stemmed or non-stemmed version. This could be also applied to stemming, we've had such requests in the past.

2) We could use unidecode for this. Theodoros writes:

From all(?) the packages, I found that Uridecode
(http://pypi.python.org/pypi/Unidecode) supports most of the languages
that could be transliterated (although, for Greek it does not support
the standard ISO 843 but a 'custom' one which is not very good as a
practice. More details for the official transliteration standards for
Greek, here: http://transliteration.eki.ee/pdf/Greek.pdf)

The usage is very simple and I run an example with the following VERY
complex Unicode string (with Hebrew, Hindi, Chinese and Greek):
---------------
The decomposition mapping is <츠, U+11B8>, and not <0x110E, ᅳ, 11B8>.
<p>The title says ‫פעילות הבינאום, W3C‬ in Hebrew</p>
abcáßçकखी國際𐎄𐎔𐎘
Ελληνικά
---------------

and is converted to:

---------------
The decomposition mapping is <ceu, b>, and not <c, eu, 11B8>.
<p>The title says p`ylvt hbynvm, W3C in Hebrew</p>
abcassckkhiiGuo Ji
Ellenika
---------------

3) As for per-index configuration of the transliteration, see also ticket:852.

Change History (0)

Note: See TracTickets for help on using tickets.