Opened 3 years ago

Last modified 3 years ago

#827 new defect

Error when PyStemmer is not installed and stemming is still enabled

Reported by: rchyla Owned by:
Priority: major Milestone:
Component: BibIndex Version:
Keywords: Cc:

Description

I don't know (yet) the cause, but when the global index (in the demo site) has stemming enabled AND if PyStemmer is not enabled, the index will contain only stemmed values, but not original tokens.

To reproduce:

  1. load demo records
  2. search for ellis [0 hits]
  3. search for elli [11 hits]

Or:

  1. load demo records
  2. change configuration of global index (deactivate stemming)
  3. search for ellis [11 hits]

The issue is solved by installing PyStemmer, but PyStemmer is only recommended.

I have to find out what is doing the stemming instead and why it is not indexing also the original words.

Change History (2)

comment:1 Changed 3 years ago by skaplun

  • Status changed from new to infoneeded_new

When PyStemmer is not installed, Invenio will fall-back on a pythonic implementation of the Porter stemming algorithm for English, and will still apply stemming if required.

When stemming is enabled on an index, only the stemmed word is stored, not the original one. Therefore I don't see anything going wrong in what you mention above.

If you actually disable stemming (regardless of the status of the installation of PyStemmer), then the original term will instead be stored in the index...

Cheers!

Sam

comment:2 Changed 3 years ago by skaplun

  • Status changed from infoneeded_new to new

Actually Ludmila pointed me out the fact that you are actually doing a high-level searching (when replying you I had in mind just low level fiddling with the indexing tables). So yep, indeed what you say underlines there might be a bug between what the indexing engine does WRT stemming when PyStemmer is not installed Vs. what the search engine does WRT stemming in the same situation. And it might well be that the search engine simply assume no stemming in case of no PyStememer (which is wrong WRT the indexing layer...)

I will look more into this...

Note: See TracTickets for help on using tickets.