Opened 22 months ago

Last modified 5 months ago

#1085 new enhancement

BibIndex: add possibility to transliterate phrases

Reported by: simko Owned by:
Priority: major Milestone:
Component: BibIndex Version:
Keywords: Cc:

Description

1) It will be useful to add a possibility to transliterate phrases upon indexing time, especially for author names. This will read string stored in the DB and will optionally generate more terms to index, quite like stemming does, depending on index configuration.

2) We may want to separate generated terms into different indexes though, in case one would like to search for exact value as opposed to 'fuzzy' transliterated value. Kind of like Xapian does with its Z forms for stemming, so that people could search for both stemmed or non-stemmed version. This could be also applied to stemming, we've had such requests in the past.

2) We could use unidecode for this. Theodoros writes:

From all(?) the packages, I found that Uridecode
(http://pypi.python.org/pypi/Unidecode) supports most of the languages
that could be transliterated (although, for Greek it does not support
the standard ISO 843 but a 'custom' one which is not very good as a
practice. More details for the official transliteration standards for
Greek, here: http://transliteration.eki.ee/pdf/Greek.pdf)

The usage is very simple and I run an example with the following VERY
complex Unicode string (with Hebrew, Hindi, Chinese and Greek):
---------------
The decomposition mapping is <츠, U+11B8>, and not <0x110E, ᅳ, 11B8>.
<p>The title says ‫פעילות הבינאום, W3C‬ in Hebrew</p>
abcáßçकखी國際𐎄𐎔𐎘
Ελληνικά
---------------

and is converted to:

---------------
The decomposition mapping is <ceu, b>, and not <c, eu, 11B8>.
<p>The title says p`ylvt hbynvm, W3C in Hebrew</p>
abcassckkhiiGuo Ji
Ellenika
---------------

3) As for per-index configuration of the transliteration, see also ticket:852.

Change History (1)

comment:1 Changed 5 months ago by arwagner

In case of authority records one might get a bunch of such "transliterations" via "additional name forms" in the 400% and friends. Say you have

1001_ $a Müller, Hans
4001_ $a Muller, H
4001_ $a Mueller, H

it would signify that Muller or Mueller are "known names" for Müller.

Note. that this also handles the case that Müller, Hans changes his name like

4001_ $a Schmidt, H

so "Schmidt, H" would be also a valid form for "Müller, Hans" of the name. Thinking for marriage/divorce/pseudonyms/... For an elaborate example of this issue check out

http://viaf.org/viaf/24602065/#Goethe,_Johann_Wolfgang_%CB%9Cvon%C5%93_1749-1832

so this should be taken into account here as well.

Note: See TracTickets for help on using tickets.