Opened 4 years ago

Last modified 3 years ago

#71 new enhancement

WebSearch: new option to include title words in the record URLs

Reported by: simko Owned by:
Priority: minor Milestone:
Component: WebSearch Version:
Keywords: Cc:

Description

It may help moderately with search engine rankings if detailed record
pages contain important words (from title). That is, when we set up
nicely not only the page title and the meta header section as we
already do, but also put the title into the URL as well.

To achieve this, we can introduce a new config variable named like
CFG_WEBSTYLE_DETAILED_RECORD_LINKS that would have values like:

  • 0 (=normal style):

http://site.com/record/32
http://site.com/record/32/holdings/

  • 1 (=embed titles in URLs):

http://site.com/record/32-basic-nuclear-electronics
http://site.com/record/32-basic-nuclear-electronics/holdings/

Here, for simplicity, the URL dispatcher can still use only the record
ID as significant when deciding about the dispatch, so it could ignore
any text coming after the record ID and a dash. Or else it could use
that text in order to fuzzy-check the title. The latter bit may be
interesting for lets-provide-meaningful-URLs use case discussed
elsewhere. (e.g. DOI instead of recID)

Change History (2)

comment:1 Changed 4 years ago by skaplun

Indeed as discussed IRL, we should probably raise a 404 when the title used is wrong (to avoid misuses, e.g. for SPAM purposes).

This can be implemented via a tmpl_ function so that the final admin would be able to use whatever algorithm to produce the semantic part.

A possible default implementation might be to take the 4 longest words in the title and use them in order of appearence (e.g.

"Search for the minimal universal extra dimension model at the LHC with ps =7 TeV"

would become

"search-minimal-universal-dimension"

)

Moreover a function should be implemented to check that all this words actually are part of the title. A problem would arise if the record has been modified. In that case, previous version of the record should be checked for. This would be computationally heavy, but would happen rarely.

comment:2 Changed 3 years ago by jcaffaro

As discussed IRL, raising 404 when title does not match is problematic for cases where the title has been updated. If we still want to resolve previous URLs to the record (but avoid misuses) we would have to a) check the titles in the history of the record or b) keep a list of resolved URLs. Alternatively we can c) accept any string (even bad ones) but immediately redirect to the canonical URL so that misuses are less visible.

Note: See TracTickets for help on using tickets.