Opened 2 years ago

Closed 2 years ago

Last modified 21 months ago

#856 closed enhancement (fixed)

BibSched: tasks not halting the queue on failure

Reported by: jlavik Owned by: Alessio Deiana <alessio.deiana@…>
Priority: critical Milestone:
Component: BibSched Version:
Keywords: Cc: alessio.deiana@…

Description

We all know sometimes BibTask's fails, be it dbdump failing or an oaiharvest timing out etc., causing the queue to exit automatic mode until human operators "arrive to the scene". Now, this can happen often in the middle of the night or at times when human operators are far away from "the scene". For some production systems it can seriously harm the flow of execution and service to have the BibSched queue halt for several hours, even in the middle of the night - due to nightly tasks such as harvesting. Some of these failures can be more harmless then others, but no matter the cause, the queue stops.

Now there are two ways of attacking this problem, besides having human operators more readily available. One way could be to add an configurable option to scheduled BibTasks to not stop the queue on failures. For example, a dbdump task failing can be a serious matter in itself, but it does not harm the running service per se. Of course, operators should still be made aware of the issue via the normal channels, but the queue should move on as usual.

A secondary or additional option, perhaps, would be to look into all the different Bibtasks and further define which errors are more significant then others and amend to have the lesser significant errors fail "silently" - not stopping the queue.

Whatever the option it should be easily configurable per instance which tasks can or cannot cause the BibSched queue to halt.

Attachments (1)

0001-BibSched-lets-each-task-decide-if-stopping-the-queue.patch (9.7 KB) - added by adeiana 2 years ago.
Per-task handling of the queue

Download all attachments as: .zip

Change History (20)

comment:1 Changed 2 years ago by jcaffaro

To be partially implemented for oaiharvest task. See #853.

comment:2 Changed 2 years ago by skaplun

  • Priority changed from major to critical

comment:3 Changed 2 years ago by adeiana

I reworked the patch in order to able to specify a list of module that will not stop the queue. Here is what I came up with
http://invenio-software.org/repo/personal/invenio-adeiana/commit/?id=c0926ac254517707d2a5d34972fa4af532b237ce

comment:4 Changed 2 years ago by skaplun

  • Status changed from new to in_merge

I had a look at your branch and it nicely implements what we talked about on the mailing-
list.
I think it can be merged into master.

P.s. you might want to change the log message by adding (closes #TICKETNUM) statement in the commit log, to automatically close this ticket when it is merged.

Cheers!

Sam

comment:5 Changed 2 years ago by simko

  • Status changed from in_merge to assigned

This feature is obviously very good to have, but instead of treating
errors coming from some tasks as blocking the queue, while errors
coming from other tasks and non-blocking the queue, I think it is
better to look at this problem not from task-specific point of view,
but rather from the error-specific point of view. This because
certain types of errors may need queue blockage and may occur for any
bibtask, say when DB is down.

Borrowing a terminology from the Lisp world, the errors can be roughly
classified into two types: "fatal errors" that would stop the queue,
and "continuable errors" that would not stop the queue, for other
fellow tasks.

Say that BibIndex cannot index certain records due to UTF-8 bug. Thus
it would emit continuable error (CERROR) which will not prevent other
waiting tasks such as BibRank from being launched by BibSched daemon.
But when BibIndex is awaken next time, it should refuse to run anew,
because the last time it ended up in the CERROR state.

Say that BibIndex cannot index certain records because DB is down or
disk is full. Thus it would emit fatal error (ERROR) that should
cause the queue to stop. There should be no need for the BibSched
daemon to wake up also BibRank and other waiting tasks, only to
discover that they crash in their turns.

Thus, if a continuable error occurs, only tasks of the same nature
would refuse to continue, while others can go. If a fatal error
occurs, everything stops.

Applying this point of view onto the current code base, we would need
to introduce a new continuable error type CERROR, and our current
ERROR would stay fatal error, but most of our bibtasks could be
transformed into emitting mostly CERROR's almost everywhere, except in
places such as BibUpload and friends. So, instead of white-listing
certain tasks for continuable errors such as refextract, that this patch
does, I think we could be more aggressive and start changing ERROR
into CERROR for most tasks, kind of like going for black-listing
certain error situations while white-listing most others. We can be
progressively changing all the tasks to distinguish betwenen emitting
ERROR and CERROR, as the time will permit. My concern here was mostly
that we should rather start by treating these cases on a per-error basis,
not on a per-task basis.

Please tell me what you think.

comment:6 Changed 2 years ago by jlavik

This sounds like the ideal approach. As experienced recently, a full disk will make most task puke errors and especially in the case of BibUpload, the error types differs a lot - with harmless ones being the usual case.

comment:7 follow-up: Changed 2 years ago by adeiana

Introducing recoverable errors in the task themselves is what I was thinking too as the next step.
However I see it as complementary, this is what I would like:
bibindex could throw a CERROR on utf-8 problems but an unexpected error would stop the queue
refextract should work the other way around where if it detects a database error it should throw an ERROR but throw a CERROR on an unexpected error

To achieve it, I see two ways:

  • We can add a new parameter when creating a task that defines what kind of errors are thrown by default. refextract would set CERROR by default. bibindex would set ERROR by default
  • Or we can handle it with that patch above and handle the behavior in bibsched by introducing yet another error type to force stopping the queue (that could be used by refextract on a db error).

comment:8 in reply to: ↑ 7 Changed 2 years ago by simko

Replying to adeiana:

However I see it as complementary, this is what I would like:
bibindex could throw a CERROR on utf-8 problems but an unexpected error would stop the queue
refextract should work the other way around where if it detects a database error it should throw an ERROR but throw a CERROR on an unexpected error

It may be the task itself who would catch expected and/or unexpected errors and raise proper error up to bibsched accordingly. E.g. to exit bibindex, we currently use:

task_update_status("ERROR")
self.put_into_db()
sys.exit(1)

while we could rather switch to using task_update_status("CERROR") there everywhere.

In order to catch "unexpected" errors differently by different tasks, it could be the task itself that would have an all-encompassing try/except around its run() function, so it would raise either global ERROR or global CERROR in its except clause, or something similar. Although it would be better to specify each "unexpected" exception explicitly, of course.

(In any case, we should not need to introduce any per-task specific CFG_BIBSCHED_CONTINUABLE_TASKS, once we rely on two different ERROR/CERROR types, I think.)

comment:9 Changed 2 years ago by adeiana

  • Cc alessio.deiana@… added

comment:10 Changed 2 years ago by Jerome Caffaro <jerome.caffaro@…>

In [c79800b85f5a7821d333453baa9b9f799ffbd080]:

BibHarvest: better handling of timeouts

  • When OAI source times out, keep trying a few times. (closes #853)
  • Adds CFG_OAI_FAILED_HARVESTING_STOP_QUEUE variable to let admin configure if failed harvesting should stop BibSched queue in cases where the execution can be fully recovered at next run of the task. CFG_OAI_FAILED_HARVESTING_EMAILS_ADMIN also lets admin configure if an email should be sent in these cases. (addresses #856)

comment:11 Changed 2 years ago by adeiana

I attached a path that introduces the CERROR allowing to choose per task whether we want to stop the queue.

  • I had to keep CERRORs out of bibupload because we want all bibuploads to execute in order.
  • This patch does little to change each module error handling as I am not familiar with all the modules.

Changed 2 years ago by adeiana

Per-task handling of the queue

comment:12 Changed 2 years ago by skaplun

  • Status changed from assigned to infoneeded

At sam/bibsched-nostop there is a squashed version of this branch on top of master.

To me, the failure of task as whole could fall in either CERROR Vs. ERROR status depending on two things:

  1. if the task is introducing changes that cannot be replied from the current status of the system (i.e. BibUpload). From this point of view, bibreformat, webcoll, bibindex, bibclassify, refextract, oai_repository_update etc. can all recover from the current state of the system (e.g. at worse bibindex can be re-indexed).
  2. if the admin decided that a given bibtask is a requirement for the QoS of his repository (say oaiharvest should never fails!)

For that reason I would propose that:

  1. by default only bibupload finishes with an ERROR, and every other task with a CERROR.
  2. a new CLI argument is introduced to bibtask in order to let the admin to specify if a given task is critical or not. (e.g. --stop-queue-on-error). Such flag should also be understood by task_low_level_submission.

comment:13 Changed 2 years ago by skaplun

  • Status changed from infoneeded to assigned

Ok this is ready to be merged in the branch sam/bibsched-nostop

comment:14 Changed 2 years ago by skaplun

  • Status changed from assigned to in_merge

comment:15 Changed 2 years ago by Alessio Deiana <alessio.deiana@…>

  • Owner set to Alessio Deiana <alessio.deiana@…>
  • Resolution set to fixed
  • Status changed from in_merge to closed

In [d74fa91f3f3d679300cdf27784b38ce333d07cc1]:

BibSched: new continuable error status for tasks

  • New --stop-on-error/--continue-on-error CLI parameter for bibtasks.
  • Handles CERROR and ERROR. CERROR is a continuable error that does not stop the queue. ERROR is a fatal error that stops the queue.
  • BibUpload will decide whether to issue an ERROR or a CERROR. (closes #856)
  • Always send emergency notification to CFG_SITE_ADMIN_EMAIL.

Co-authored-by: Samuele Kaplun <samuele.kaplun@…>

comment:16 Changed 21 months ago by Jerome Caffaro <jerome.caffaro@…>

In c79800b85f5a7821d333453baa9b9f799ffbd080:

BibHarvest: better handling of timeouts

  • When OAI source times out, keep trying a few times. (closes #853)
  • Adds CFG_OAI_FAILED_HARVESTING_STOP_QUEUE variable to let admin configure if failed harvesting should stop BibSched queue in cases where the execution can be fully recovered at next run of the task. CFG_OAI_FAILED_HARVESTING_EMAILS_ADMIN also lets admin configure if an email should be sent in these cases. (addresses #856)

comment:17 Changed 21 months ago by Jerome Caffaro <jerome.caffaro@…>

In c79800b85f5a7821d333453baa9b9f799ffbd080:

BibHarvest: better handling of timeouts

  • When OAI source times out, keep trying a few times. (closes #853)
  • Adds CFG_OAI_FAILED_HARVESTING_STOP_QUEUE variable to let admin configure if failed harvesting should stop BibSched queue in cases where the execution can be fully recovered at next run of the task. CFG_OAI_FAILED_HARVESTING_EMAILS_ADMIN also lets admin configure if an email should be sent in these cases. (addresses #856)

comment:18 Changed 21 months ago by Alessio Deiana <alessio.deiana@…>

In d74fa91f3f3d679300cdf27784b38ce333d07cc1:

BibSched: new continuable error status for tasks

  • New --stop-on-error/--continue-on-error CLI parameter for bibtasks.
  • Handles CERROR and ERROR. CERROR is a continuable error that does not stop the queue. ERROR is a fatal error that stops the queue.
  • BibUpload will decide whether to issue an ERROR or a CERROR. (closes #856)
  • Always send emergency notification to CFG_SITE_ADMIN_EMAIL.

Co-authored-by: Samuele Kaplun <samuele.kaplun@…>

comment:19 Changed 21 months ago by Alessio Deiana <alessio.deiana@…>

In d74fa91f3f3d679300cdf27784b38ce333d07cc1:

BibSched: new continuable error status for tasks

  • New --stop-on-error/--continue-on-error CLI parameter for bibtasks.
  • Handles CERROR and ERROR. CERROR is a continuable error that does not stop the queue. ERROR is a fatal error that stops the queue.
  • BibUpload will decide whether to issue an ERROR or a CERROR. (closes #856)
  • Always send emergency notification to CFG_SITE_ADMIN_EMAIL.

Co-authored-by: Samuele Kaplun <samuele.kaplun@…>

Note: See TracTickets for help on using tickets.