Opened 2 years ago
Last modified 2 weeks ago
#711 in_review enhancement
System Collections
| Reported by: | skaplun | Owned by: | simko |
|---|---|---|---|
| Priority: | minor | Milestone: | v1.2 |
| Component: | WebSearch | Version: | |
| Keywords: | Cc: |
Description
For general repository health monitoring purposes or also to re-factor certain computational intensive algorithm that are spread-around the Invenio codebase by pre-computing special collections it would be great to enhance WebColl and in general the WebSearch module to support a new type of collection called System Collections. These collections would be as normal collections in everything but their definition which can't be expressed by a normal query and must be therefore directly be specified in the code base. These System Collections are:
- Empty Records
- containing all the records that have an ID but nothing else (i.e. no XM)
- Deleted Records
- containing all the records that have DELETED in 980__% (which is the convention in Invenio to marc a record as deleted.
- Restricted Records
- containing all the records that belong to at least one restricted collection (this would greatly speed up the runtime computation for checking authorizations)
- Classified Records
- containing all the records that belong to at least one real collection (if a record does not belong to such collection will surely not be available to anyone but its owner or superadmin)
- Unclassified Records
- this is the counterpart of Classified Records, and will contain all the records that do not belong to at least a collection and are therefore accessible only to their owners or to superadmin
- Existing Records
- this is the union of the Classified and Unclassified Records collections
- Public Records
- this will be sort of an alias to the Home collection as it will contain all the records that are searchable from the home and are a priory discoverable by a crawler.
In order to make this collection safe, they will be actually called with an umprobable name such as "System Collection -- Emtpy Records" and be treated in special ways both by WebColl and by the WebSearch Admin Interface (e.g. it should not be possible to delete such a collection and if an admin attach these collections as real child of real collections, webcoll must ignore them in the computation of the real collection.)
tabcreate.sql can come with a default configuration where there is a non attached System Collection with all of the above collections attached as virtual collections.
Change History (9)
comment:1 Changed 2 years ago by simko
comment:2 follow-up: ↓ 4 Changed 2 years ago by jblayloc
I think this ticket is a really good idea.
Perhaps if we want the system collections to be unlikely to have namespace conflicts, we should set and fetch their names via CFG variables (which probably shouldn't be in the normal place), and the names could be even less likely to produce conflicts. A SHA1 of the system time and CFG variable name, for example.
comment:3 Changed 2 years ago by jcaffaro
I can also imagine (if possible and useful) that the following "system" collections would be good candidates:
- Authority Records
- containing all the authority record (dunno the criteria yet)
- Bibliographic Records
- containing all the bibliographic records (dunno the criteria yet)
(Just throwing it there, following some IRL musings with wiki:Team/ChristopherDickinson on authority records, though the authority collections might be handled in a slightly more flexible way, and might be out of the scope of this ticket)
comment:4 in reply to: ↑ 2 Changed 2 years ago by skaplun
Hi Joe,
Replying to jblayloc:
I think this ticket is a really good idea.
Perhaps if we want the system collections to be unlikely to have namespace conflicts, we should set and fetch their names via CFG variables (which probably shouldn't be in the normal place), and the names could be even less likely to produce conflicts. A SHA1 of the system time and CFG variable name, for example.
Well this collections will be defined from the beginning in Invenio, so they will have each a well defined name, which doesn't need to change all the time. In principle to call the System Collection -- FOO Records should be enough geeky to avoid any conflict with admins, but yes, a CFG_WEBSEARCH_SYSTEM_COLLECTIONS variable (statically stored in search_engine_config.py will make it easy to write checks in the Admin interface to avoid admins to uses these special names (which is definitively unlikely :-) ).
comment:5 Changed 19 months ago by rajimene
- Owner set to rajimene
- Status changed from new to assigned
comment:6 Changed 18 months ago by rajimene
- Status changed from assigned to in_merge
comment:8 Changed 5 months ago by skaplun
- Milestone set to v1.2
comment:9 Changed 2 weeks ago by simko
- Owner changed from rajimene to simko
- Status changed from in_merge to in_review

One more special collection would be Merged records that will list all the records that used to be independent but that cataloguers merged as dupes in BibMerge, via 970__d. This is a special category of deleted records that may be useful to single out. See also ticket:514.
Speaking of terminology, "classified records" ("unclassified records") may create an unwanted link to BibClassify, so perhaps "alive records" ("zombie records") or "attributed records" ("unattributed records") would be better, as we mused about originally.