BibUpload: optional use of bibxxx tables
|Reported by:||simko||Owned by:||simko|
During bibupload, the incoming record is broken according to MARC tags
into many bibxxx tables (bib10x, bib11x, etc) which results in
many SQL queries being done by bibupload. Advantage is doing so is
that the end users can then simply search in any MARC
tag. Disadvantage in doing so is that the uploading step takes time,
and that we are preparing indexes that may perhaps not even be used by
the end users at all. (Since they typically search in logical field
indexes, say firstauthor:ellis, not in physical MARC tags, say
In certain situations, it would be better not to create these indexes
during upload time, but to defer handing them for the indexing time.
(Especially when using external indexer such as Solr for the record
For this, it would be good to introduce a new configuration option
called say CFG_BIBUPLOAD_USE_BIBXXX that would be True by default
but that could optionally be set to False on a per-site basis. When
set to False, the stage 4 of bibupload (=filling of bibxxx tables)
would not be executed.
This would result in bibupload speed-ups that can be illustrated by
the following example taken from INSPIRE-sized database (1M of
- example record CERN-TH-6002-91 from INSPIRE TEST (record ID 315385)
- timings to replace it, stage 4 enabled:
ncalls tottime percall cumtime percall filename:lineno(function) 1 0.006 0.006 4.112 4.112 bibupload.py:162(bibupload) 256 0.003 0.000 4.095 0.016 dbquery.py:141(run_sql) 1 0.001 0.001 2.632 2.632 bibupload.py:1550(update_database_with_metadata) 109 0.001 0.000 2.605 0.024 bibupload.py:822(insert_record_bibxxx) 1 0.000 0.000 1.255 1.255 bibupload.py:1780(delete_bibrec_bibxxx)
- timings to replace it, stage 4 disabled:
ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.020 0.020 bibupload.py:162(bibupload) 37 0.001 0.000 0.017 0.000 dbquery.py:141(run_sql) 1 0.000 0.000 0.006 0.006 bibupload.py:1780(delete_bibrec_bibxxx)
As can be seen, the upload time is faster by several orders of
magnitude, since we are not pre-creating those huge and possibly
non-useful bibxxx indexes.
Important note: while it is simple to introduce such a
CFG_BIBUPLOAD_USE_BIBXXX variable for record uploading processes,
this variable should be propagated to other Invenio modules such as
searcher/indexer that should read record metadata from pre-stored
MARCXML formats (see table bibfmt) rather than from bibxxx tables.
When bibxxx tables are not in use, other Invenio modules are not
free to rely on the existence of bibxxx tables anymore. So this
task is really bigger than it may seem. The settings of
CFG_BIBUPLOAD_USE_BIBXXX should therefore be progressively
propagated to all the Invenio modules that assume the existence of
bibxxx for granted, starting with the the most important modules
(indexer, searcher, editor, check for deleted records, etc).