Opened 3 years ago

Last modified 11 months ago

#671 in_review enhancement

BibUpload: optional use of bibxxx tables

Reported by: simko Owned by: simko
Priority: major Milestone: v1.2
Component: BibUpload Version:
Keywords: Cc:

Description

During bibupload, the incoming record is broken according to MARC tags
into many bibxxx tables (bib10x, bib11x, etc) which results in
many SQL queries being done by bibupload. Advantage is doing so is
that the end users can then simply search in any MARC
tag. Disadvantage in doing so is that the uploading step takes time,
and that we are preparing indexes that may perhaps not even be used by
the end users at all. (Since they typically search in logical field
indexes, say firstauthor:ellis, not in physical MARC tags, say
100__a:/ellis/.)

In certain situations, it would be better not to create these indexes
during upload time, but to defer handing them for the indexing time.
(Especially when using external indexer such as Solr for the record
the metadata.)

For this, it would be good to introduce a new configuration option
called say CFG_BIBUPLOAD_USE_BIBXXX that would be True by default
but that could optionally be set to False on a per-site basis. When
set to False, the stage 4 of bibupload (=filling of bibxxx tables)
would not be executed.

This would result in bibupload speed-ups that can be illustrated by
the following example taken from INSPIRE-sized database (1M of
records):

  • example record CERN-TH-6002-91 from INSPIRE TEST (record ID 315385)
  • timings to replace it, stage 4 enabled:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.006    0.006    4.112    4.112 bibupload.py:162(bibupload)
      256    0.003    0.000    4.095    0.016 dbquery.py:141(run_sql)
        1    0.001    0.001    2.632    2.632 bibupload.py:1550(update_database_with_metadata)
      109    0.001    0.000    2.605    0.024 bibupload.py:822(insert_record_bibxxx)
        1    0.000    0.000    1.255    1.255 bibupload.py:1780(delete_bibrec_bibxxx)
  • timings to replace it, stage 4 disabled:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.020    0.020 bibupload.py:162(bibupload)
       37    0.001    0.000    0.017    0.000 dbquery.py:141(run_sql)
        1    0.000    0.000    0.006    0.006 bibupload.py:1780(delete_bibrec_bibxxx)

As can be seen, the upload time is faster by several orders of
magnitude, since we are not pre-creating those huge and possibly
non-useful bibxxx indexes.

Important note: while it is simple to introduce such a
CFG_BIBUPLOAD_USE_BIBXXX variable for record uploading processes,
this variable should be propagated to other Invenio modules such as
searcher/indexer that should read record metadata from pre-stored
MARCXML formats (see table bibfmt) rather than from bibxxx tables.
When bibxxx tables are not in use, other Invenio modules are not
free to rely on the existence of bibxxx tables anymore. So this
task is really bigger than it may seem. The settings of
CFG_BIBUPLOAD_USE_BIBXXX should therefore be progressively
propagated to all the Invenio modules that assume the existence of
bibxxx for granted, starting with the the most important modules
(indexer, searcher, editor, check for deleted records, etc).

Change History (7)

comment:1 Changed 3 years ago by simko

I forgot to add a possibly obvious thing that in order to propagate the elimination of bibxxx tables to other Invenio modules faster, we can keep some of the most important ones (bib03x, bib97x, bib98x) so that the filtering of incoming records and handling of deleted records and collections and whatnot would not necessitate any codebase change and could be therefore kept as it is now. (So we would keep "small and useful" bibxxx tables, while we would eliminate only "big and not-so-useful" bibxxx tables, so to speak.)

comment:2 Changed 3 years ago by bthiell

  • Status changed from new to assigned

As discussed in the videoconf we had today, I implemented a light bibupload. I made the implementation more configurable than what is described here as we might still need to populate some tables for bibupload to run correctly. For example it might be a good idea to keep populating the tables that contain CFG_BIBUPLOAD_EXTERNAL_SYSNO_TAG and CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG so that bibupload can decide if a record is going to be overwritten.

The implementation relies on a new configuration variable which I called CFG_BIBUPLOAD_BIBXXX_TAGS and that accepts a comma-separated list of MARC tags. If left empty (default) then all tags will be stored and Invenio will run as normal.

Commit on Github

comment:3 follow-up: Changed 3 years ago by bthiell

  • Status changed from assigned to in_merge

First version of this is available on my Github: https://github.com/badzil/Invenio/tree/light_bibupload

The configuration variable CFG_BIBUPLOAD_BIBXXX_TAGS is a comma-separated list of tags which are handled at upload time. It is recommended to keep storing 035, 037, 970 and 980 to allow bibupload and webcoll to run correctly. Depending on the collections' dbqueries, other tags might be necessary. If this variable is left empty, then Invenio will run normally, i.e. store all tags. This should remain the default behavior.

At index time, bibindex first populates the bibxxx tables and then continues with its regular business. I've added my code to bibindex.bibindex_bibxxx_manager and only a call to this in bibindex.bibindex_engine.

I tested the regular Invenio and the fast upload and the results are consistent:

  • regular upload took 16 minutes.
  • fast upload took 4 minutes and populating the bibxxx tables took 12 minutes.

Same total time but the shorter upload time allows us to move quicker with the initial upload while indexing our metadata in Solr.

Comments are welcome.

comment:4 in reply to: ↑ 3 Changed 3 years ago by skaplun

Hi Benoit,

Replying to bthiell:

It is recommended to keep storing 035, 037, 970 and 980 to allow bibupload and webcoll to run correctly.

I think you should also add the OAI-related fields (those depends on invenio.conf). By default these are: 909COo, 909COp and I would like to propose (in a future branch to come) 909COq.

Cheers!

Sam

comment:5 Changed 3 years ago by bthiell

For now the default is to store everything at upload time : see https://github.com/badzil/Invenio/blob/light_bibupload/config/invenio.conf#L1173.

And by the way, I've limited the tag description to the first 3 digits (i.e. 909 and not 909COo) as in my opinion it doesn't really make sense to populate the bibxxx tables both at upload and index time.

Cheers.

comment:6 Changed 15 months ago by skaplun

  • Milestone set to v1.2

I think it would be cool to have this one in 1.2

comment:7 Changed 11 months ago by simko

  • Owner changed from bthiell to simko
  • Status changed from in_merge to in_review
Note: See TracTickets for help on using tickets.