Opened 14 months ago

Last modified 14 months ago

#991 in_work defect

Allow tasks to execute on the host set in "host" field

Reported by: adeiana Owned by: skaplun
Priority: major Milestone:
Component: BibSched Version:
Keywords: Cc:

Description

Before running a task, we bind it to one host, however previously

when trying to run it on that same host, the task was also excluded

Attachments (2)

Change History (16)

comment:1 Changed 14 months ago by adeiana

  • Status changed from new to in_merge

comment:2 Changed 14 months ago by skaplun

Hi Alessio,

can you detail a bit this ticket? What exactly is it trying to solve? What do you mean by "previously"?

Cheers!

Sam

Last edited 14 months ago by skaplun (previous) (diff)

comment:3 Changed 14 months ago by skaplun

  • Status changed from in_merge to infoneeded

comment:4 Changed 14 months ago by adeiana

  • Status changed from infoneeded to assigned

self.tie_task_to_host() returns False when the host field is not empty.
As a result at line 1071 where it is called we never enter the code path where
we use os.system(COMMAND). Thus the command is never run when host is not
(the local hostname should be a valid value)

comment:5 Changed 14 months ago by skaplun

  • Status changed from assigned to infoneeded
  • What version of Invenio do you refer to? (since bibsched is being highly modified). Do you refer to latest master?
  • I think what you report as bug is instead a feature. tie_task_to_host will explicitly check that the given task hasn't already been tied to a host, and is hence be free to be scheduled on the current host. Otherwise if host variable was set already to a given value, this means that the task has already been tied, and we should hence avoid executing it twice. (either on the current host, if host has the value of the current host, or on a different host).

Why do you think the comman is never run?

comment:6 Changed 14 months ago by adeiana

  • Status changed from infoneeded to assigned

To be more explicit my queue current state stored in schTASK:

+-----+-------------------+---------------------+---------+----------+-----------+------------+
| id  | proc              | runtime             | status  | priority | host      | sequenceid |
+-----+-------------------+---------------------+---------+----------+-----------+------------+
| 476 | bibauthorid       | 2012-04-02 18:48:24 | WAITING |        0 | aso.local |       NULL |
| 475 | inveniogc         | 2012-04-02 18:48:30 | WAITING |        0 |           |       NULL |
| 486 | bibreformat       | 2012-04-02 18:48:30 | WAITING |        0 |           |       NULL |
| 468 | bibindex          | 2012-04-02 18:48:36 | WAITING |        0 |           |       NULL |
| 471 | bibindex:fulltext | 2012-04-02 18:48:40 | WAITING |        0 |           |       NULL |
| 469 | webcoll           | 2012-04-02 18:48:54 | WAITING |        0 |           |       NULL |
| 472 | bibindex:author   | 2012-04-02 18:48:56 | WAITING |        0 |           |       NULL |
| 470 | bibindex:global   | 2012-04-02 18:49:00 | WAITING |        0 |           |       NULL |
| 473 | bibrank           | 2012-04-02 18:49:06 | WAITING |        0 |           |       NULL |
| 500 | bibreformat       | 2012-04-02 21:30:28 | WAITING |        0 |           |       NULL |
+-----+-------------------+---------------------+---------+----------+-----------+------------+

This first task does not run.

Yes this problem is present on current master.

comment:7 Changed 14 months ago by adeiana

After your explanation I understand that this state can only by reached after triggering issue #943

comment:8 Changed 14 months ago by skaplun

I see!

Indeed I was about to write you that this state should never be reached :-) I guess that by catching SystemExit exception (as happens in the other ticket), this situation shouldn't be possibly reached anymore, since tasks will cleanly set the host value back to empty string.

So we can close this ticket as invalid, I guess :-)

comment:9 Changed 14 months ago by adeiana

I am afraid we still need to address this.

Actually, I encountered this problem again with an invalid interpreter for bibauthorid.
It seems like there are many ways a task can fail.
I am attaching a different way to tackle the problem.

comment:10 Changed 14 months ago by skaplun

Hi Alessio,

unfortunately, you can't really have bibsched to declare the task as in ERROR state, because most of the time, it's simply that the task is taking too much too start (e.g. because it creates enormous data structure upon start, as it happens with citation dictionaries).

Maybe we can still go in this direction (i.e. to have bibsched decide the task has failed, rather than the task), by first inspecting the task pid file and trying to ping the task with a UNIX signal. If no pid file exists, or if the number in the pid file does not correspond to a live task, then indeed bibsched can declare the task as ERROR as you propose. This can be implemented by exploiting the already existing get_task_pid function.

comment:11 follow-up: Changed 14 months ago by adeiana

Let's take this further,

I see multiple reasons not dependent of our code that would make it long to start.
Particularly an unresponsive AFS leaving the processes in D state for several seconds.
or there is a problem on startup we never reach the startup and enter some deadlock.

We have the same exact problem with RUNNING tasks.
If they were kill by the oom killer, they remain in RUNNING status. I already had this happening too.

Depending on our need to differentiate a starting task from a running task.
If needed, we can add a STARTING status, if not we mark the task as RUNNING.
All tasks in starting status can behave like running as a result will have the queue entering the current deadlock we have.

In a separate ticket we handle pinging tasks regularly to check that they are not dead.

Viable solution?

comment:12 in reply to: ↑ 11 Changed 14 months ago by skaplun

Replying to adeiana:

I see multiple reasons not dependent of our code that would make it long to start.
Particularly an unresponsive AFS leaving the processes in D state for several seconds.
or there is a problem on startup we never reach the startup and enter some deadlock.

Yep!

We have the same exact problem with RUNNING tasks.
If they were kill by the oom killer, they remain in RUNNING status. I already had this happening too.

Depending on our need to differentiate a starting task from a running task.

This is currently already the case: right before executing a task, bibsched is changing its status from WAITING to SCHEDULED. It's then the responsibility of the task to change it from SCHEDULED to RUNNING (in order to proof it's in a healthy state).

If needed, we can add a STARTING status, if not we mark the task as RUNNING.
All tasks in starting status can behave like running as a result will have the queue entering the current deadlock we have.

In a separate ticket we handle pinging tasks regularly to check that they are not dead.

Ouch, I thought we were already doing this in bibsched (and probably this was the case in old versions, but the code somehow is no longer there).

Viable solution?

To regularly ping the running tasks as in the above case is definitively a viable solution. Special care must be taken however: if the task is in status RUNNING, and we fail to ping it, we have to check that meanwhile, it has not changed the status to DONE or something else, as the task might end in the very moment we are pinging it :-)

comment:13 Changed 14 months ago by adeiana

Indeed the SCHEDULED status is exactly that. So that is handled.
The reason I reach that deadlock status is that the SCHEDULED status is lost as soon as I renable the automatic queue without acknowledging that task first.
I don't think that behavior is nice. Should we change it ?

comment:14 Changed 14 months ago by skaplun

  • Owner set to skaplun
  • Status changed from assigned to in_work

Indeed this would become a left over now that we have implemented the nice switch to ERROR (when using task_low_level_submission and wrong args), and once we will implement the regular pinging of task to check if they are alive.

I will prepare a patch with all the ideas we gathered in this useful ticket :-)

Note: See TracTickets for help on using tickets.