Opened 2 years ago

Closed 3 months ago

#991 closed defect (fixed)

Allow tasks to execute on the host set in "host" field

Reported by: adeiana Owned by: skaplun
Priority: major Milestone:
Component: BibSched Version:
Keywords: Cc:

Description

Before running a task, we bind it to one host, however previously

when trying to run it on that same host, the task was also excluded

Attachments (2)

Change History (17)

comment:1 Changed 2 years ago by adeiana

  • Status changed from new to in_merge

comment:2 Changed 2 years ago by skaplun

Hi Alessio,

can you detail a bit this ticket? What exactly is trying to solve? What do you mean by "previously"?

Cheers!

Sam

Version 0, edited 2 years ago by skaplun (next)

comment:3 Changed 2 years ago by skaplun

  • Status changed from in_merge to infoneeded

comment:4 Changed 2 years ago by adeiana

  • Status changed from infoneeded to assigned

self.tie_task_to_host() returns False when the host field is not empty.
As a result at line 1071 where it is called we never enter the code path where
we use os.system(COMMAND). Thus the command is never run when host is not
(the local hostname should be a valid value)

comment:5 Changed 2 years ago by skaplun

  • Status changed from assigned to infoneeded
  • What version of Invenio do you refer to? (since bibsched is being highly modified). Do you refer to latest master?
  • I think what you report as bug is instead a feature. tie_task_to_host will explicitly check that the given task hasn't already been tied to a host, and is hence be free to be scheduled on the current host. Otherwise if host variable was set already to a given value, this means that the task has already been tied, and we should hence avoid executing it twice. (either on the current host, if host has the value of the current host, or on a different host).

Why do you think the comman is never run?

comment:6 Changed 2 years ago by adeiana

  • Status changed from infoneeded to assigned

To be more explicit my queue current state stored in schTASK:

+-----+-------------------+---------------------+---------+----------+-----------+------------+
| id  | proc              | runtime             | status  | priority | host      | sequenceid |
+-----+-------------------+---------------------+---------+----------+-----------+------------+
| 476 | bibauthorid       | 2012-04-02 18:48:24 | WAITING |        0 | aso.local |       NULL |
| 475 | inveniogc         | 2012-04-02 18:48:30 | WAITING |        0 |           |       NULL |
| 486 | bibreformat       | 2012-04-02 18:48:30 | WAITING |        0 |           |       NULL |
| 468 | bibindex          | 2012-04-02 18:48:36 | WAITING |        0 |           |       NULL |
| 471 | bibindex:fulltext | 2012-04-02 18:48:40 | WAITING |        0 |           |       NULL |
| 469 | webcoll           | 2012-04-02 18:48:54 | WAITING |        0 |           |       NULL |
| 472 | bibindex:author   | 2012-04-02 18:48:56 | WAITING |        0 |           |       NULL |
| 470 | bibindex:global   | 2012-04-02 18:49:00 | WAITING |        0 |           |       NULL |
| 473 | bibrank           | 2012-04-02 18:49:06 | WAITING |        0 |           |       NULL |
| 500 | bibreformat       | 2012-04-02 21:30:28 | WAITING |        0 |           |       NULL |
+-----+-------------------+---------------------+---------+----------+-----------+------------+

This first task does not run.

Yes this problem is present on current master.

comment:7 Changed 2 years ago by adeiana

After your explanation I understand that this state can only by reached after triggering issue #943

comment:8 Changed 2 years ago by skaplun

I see!

Indeed I was about to write you that this state should never be reached :-) I guess that by catching SystemExit exception (as happens in the other ticket), this situation shouldn't be possibly reached anymore, since tasks will cleanly set the host value back to empty string.

So we can close this ticket as invalid, I guess :-)

comment:9 Changed 2 years ago by adeiana

I am afraid we still need to address this.

Actually, I encountered this problem again with an invalid interpreter for bibauthorid.
It seems like there are many ways a task can fail.
I am attaching a different way to tackle the problem.

comment:10 Changed 2 years ago by skaplun

Hi Alessio,

unfortunately, you can't really have bibsched to declare the task as in ERROR state, because most of the time, it's simply that the task is taking too much too start (e.g. because it creates enormous data structure upon start, as it happens with citation dictionaries).

Maybe we can still go in this direction (i.e. to have bibsched decide the task has failed, rather than the task), by first inspecting the task pid file and trying to ping the task with a UNIX signal. If no pid file exists, or if the number in the pid file does not correspond to a live task, then indeed bibsched can declare the task as ERROR as you propose. This can be implemented by exploiting the already existing get_task_pid function.

comment:11 follow-up: Changed 2 years ago by adeiana

Let's take this further,

I see multiple reasons not dependent of our code that would make it long to start.
Particularly an unresponsive AFS leaving the processes in D state for several seconds.
or there is a problem on startup we never reach the startup and enter some deadlock.

We have the same exact problem with RUNNING tasks.
If they were kill by the oom killer, they remain in RUNNING status. I already had this happening too.

Depending on our need to differentiate a starting task from a running task.
If needed, we can add a STARTING status, if not we mark the task as RUNNING.
All tasks in starting status can behave like running as a result will have the queue entering the current deadlock we have.

In a separate ticket we handle pinging tasks regularly to check that they are not dead.

Viable solution?

comment:12 in reply to: ↑ 11 Changed 2 years ago by skaplun

Replying to adeiana:

I see multiple reasons not dependent of our code that would make it long to start.
Particularly an unresponsive AFS leaving the processes in D state for several seconds.
or there is a problem on startup we never reach the startup and enter some deadlock.

Yep!

We have the same exact problem with RUNNING tasks.
If they were kill by the oom killer, they remain in RUNNING status. I already had this happening too.

Depending on our need to differentiate a starting task from a running task.

This is currently already the case: right before executing a task, bibsched is changing its status from WAITING to SCHEDULED. It's then the responsibility of the task to change it from SCHEDULED to RUNNING (in order to proof it's in a healthy state).

If needed, we can add a STARTING status, if not we mark the task as RUNNING.
All tasks in starting status can behave like running as a result will have the queue entering the current deadlock we have.

In a separate ticket we handle pinging tasks regularly to check that they are not dead.

Ouch, I thought we were already doing this in bibsched (and probably this was the case in old versions, but the code somehow is no longer there).

Viable solution?

To regularly ping the running tasks as in the above case is definitively a viable solution. Special care must be taken however: if the task is in status RUNNING, and we fail to ping it, we have to check that meanwhile, it has not changed the status to DONE or something else, as the task might end in the very moment we are pinging it :-)

comment:13 Changed 2 years ago by adeiana

Indeed the SCHEDULED status is exactly that. So that is handled.
The reason I reach that deadlock status is that the SCHEDULED status is lost as soon as I renable the automatic queue without acknowledging that task first.
I don't think that behavior is nice. Should we change it ?

comment:14 Changed 2 years ago by skaplun

  • Owner set to skaplun
  • Status changed from assigned to in_work

Indeed this would become a left over now that we have implemented the nice switch to ERROR (when using task_low_level_submission and wrong args), and once we will implement the regular pinging of task to check if they are alive.

I will prepare a patch with all the ideas we gathered in this useful ticket :-)

comment:15 Changed 3 months ago by adeiana

  • Resolution set to fixed
  • Status changed from in_work to closed

In 826e8d5068b01ff48f0a8dc11361b3ff36ff2c86/invenio:

BibSched: many improvements

  • Displays a Yes/No box to make sure you don't delete tasks by mistake.
  • In the bibsched daemon, every 50 cycles, check for local tasks that have crashed.
  • Store debug mode in database so that we can switch it on and off without restarting the bibsched daemons.
  • Fixes a bug that would mark a task as crashed because the pid of that task would not exist anymore but because the task has completed properly.
  • Press B to lood bibsched.log in your pager.
  • Fixes the task options panel when a task has a very long list of arguments. (closes #1177)
  • Tasks in about to stop are not going to sleep as soon as possible anymore. This proved annoying because instead of stopping they would just wait in SLEEPING and you would have to wake them up manually in order to make them stop.
  • Adds a help panel in the bibsched monitor accessed via the "h" keystroke.
  • Limits the progress column char length to match the database schema
  • Adds the username of person doing the action in bibsched when running a task manually or editing the motd
  • Adds --host which allows to force the execution of a task to a certain host (closes #991)
  • Prevents non concurrent task to wake up too early, prevent higher priority tasks to run.
  • flush logs after writing each message. This can be useful when using a filesystem that buffers yours writes like AFS and want to check the logs from a different server than the one the task is running on.
  • Confirmation dialog before deleting periodic tasks
  • Bind signal USR2 to starting foo remote console to debug running bibtasks.
  • If you ask a task to stop, (status is set to "ABOUT TO STOP") and then you lower the priority of the task, say to -11, the scheduler changes the status of the task to "ABOUT TO SLEEP", ignoring the previous status.
  • When --fixed-time is set and a task is postponed we used the regular sleeptime (to respect the fixed time) instead of running as soon as possible (in this case the beginning of the allowed times by --limit) e.g. A task is scheduled to run between monday and friday and sleep 24 hours

and is supposed to run at 7am.
Old behavior, on saturday 7am, it is postponed to run on monday morning
at midnight.
New behavior, it is postponed to run sunday, 7am. On sunday it is
postponed to monday 7am.

  • Fixes a bug sleeping a monotask that needs to run instead of ourselves.
  • Adds STOPPED to displayed status in default bibsched view
  • Removes the ability to force run manually tasks via the bibsched monitor and out of their time limit (specified via -L 00:40-05:00)

Signed-off-by: Alessio Deiana <alessio.deiana@…>
Reviewed-by: Samuele Kaplun <samuele.kaplun@…>

Note: See TracTickets for help on using tickets.