Opened 3 years ago

Last modified 3 months ago

#711 in_work enhancement

System Collections

Reported by: skaplun Owned by: skaplun
Priority: major Milestone: v1.2
Component: WebSearch Version: master
Keywords: SCOAP3 Cc:

Description

For general repository health monitoring purposes or also to re-factor certain computational intensive algorithm that are spread-around the Invenio codebase by pre-computing special collections it would be great to enhance WebColl and in general the WebSearch module to support a new type of collection called System Collections. These collections would be as normal collections in everything but their definition which can't be expressed by a normal query and must be therefore directly be specified in the code base. These System Collections are:

Empty Records
containing all the records that have an ID but nothing else (i.e. no XM)
Deleted Records
containing all the records that have DELETED in 980__% (which is the convention in Invenio to marc a record as deleted.
Restricted Records
containing all the records that belong to at least one restricted collection (this would greatly speed up the runtime computation for checking authorizations)
Classified Records
containing all the records that belong to at least one real collection (if a record does not belong to such collection will surely not be available to anyone but its owner or superadmin)
Unclassified Records
this is the counterpart of Classified Records, and will contain all the records that do not belong to at least a collection and are therefore accessible only to their owners or to superadmin
Existing Records
this is the union of the Classified and Unclassified Records collections
Public Records
this will be sort of an alias to the Home collection as it will contain all the records that are searchable from the home and are a priory discoverable by a crawler.
Note that a new record will initially not belong to any of the above collections (as webcoll will still need to be run). Subsequently, after webcoll will have classified it, it will either belong to the Classified Records collection or to the Unclassified Records collection


In order to make this collection safe, they will be actually called with an umprobable name such as "System Collection -- Emtpy Records" and be treated in special ways both by WebColl and by the WebSearch Admin Interface (e.g. it should not be possible to delete such a collection and if an admin attach these collections as real child of real collections, webcoll must ignore them in the computation of the real collection.)


tabcreate.sql can come with a default configuration where there is a non attached System Collection with all of the above collections attached as virtual collections.

Change History (11)

comment:1 Changed 3 years ago by simko

One more special collection would be Merged records that will list all the records that used to be independent but that cataloguers merged as dupes in BibMerge, via 970__d. This is a special category of deleted records that may be useful to single out. See also ticket:514.

Speaking of terminology, "classified records" ("unclassified records") may create an unwanted link to BibClassify, so perhaps "alive records" ("zombie records") or "attributed records" ("unattributed records") would be better, as we mused about originally.

comment:2 follow-up: Changed 3 years ago by jblayloc

I think this ticket is a really good idea.

Perhaps if we want the system collections to be unlikely to have namespace conflicts, we should set and fetch their names via CFG variables (which probably shouldn't be in the normal place), and the names could be even less likely to produce conflicts. A SHA1 of the system time and CFG variable name, for example.

comment:3 Changed 3 years ago by jcaffaro

I can also imagine (if possible and useful) that the following "system" collections would be good candidates:

Authority Records
containing all the authority record (dunno the criteria yet)
Bibliographic Records
containing all the bibliographic records (dunno the criteria yet)

(Just throwing it there, following some IRL musings with wiki:Team/ChristopherDickinson on authority records, though the authority collections might be handled in a slightly more flexible way, and might be out of the scope of this ticket)

comment:4 in reply to: ↑ 2 Changed 3 years ago by skaplun

Hi Joe,

Replying to jblayloc:

I think this ticket is a really good idea.

Perhaps if we want the system collections to be unlikely to have namespace conflicts, we should set and fetch their names via CFG variables (which probably shouldn't be in the normal place), and the names could be even less likely to produce conflicts. A SHA1 of the system time and CFG variable name, for example.

Well this collections will be defined from the beginning in Invenio, so they will have each a well defined name, which doesn't need to change all the time. In principle to call the System Collection -- FOO Records should be enough geeky to avoid any conflict with admins, but yes, a CFG_WEBSEARCH_SYSTEM_COLLECTIONS variable (statically stored in search_engine_config.py will make it easy to write checks in the Admin interface to avoid admins to uses these special names (which is definitively unlikely :-) ).

comment:5 Changed 2 years ago by rajimene

  • Owner set to rajimene
  • Status changed from new to assigned

comment:6 Changed 2 years ago by rajimene

  • Status changed from assigned to in_merge

comment:7 Changed 2 years ago by simko

  • Milestone v1.1 deleted

Milestone v1.1 deleted

comment:8 Changed 15 months ago by skaplun

  • Milestone set to v1.2

comment:9 Changed 11 months ago by simko

  • Owner changed from rajimene to simko
  • Status changed from in_merge to in_review

comment:10 Changed 3 months ago by skaplun

  • Owner changed from simko to skaplun
  • Status changed from in_review to in_work

The current implementation suffer from performance issues (due to actually assigning recids one by one to system collections rather than using set theory and intbitset).

I have rebased the current implementation against latest master and will work towards make it production ready. (See: sam/711-system-collections)

comment:11 Changed 3 months ago by skaplun

  • Keywords SCOAP3 added
  • Priority changed from minor to major
  • Version set to master
Note: See TracTickets for help on using tickets.