[adelie-devel] Re: abuildd design considerations

From: Max Rees <maxcrees_at_me.com>
Date: Thu, 06 Sep 2018 01:07:35 -0400

On Sep 05 09:00 PM, A. Wilcox wrote:
> On 09/05/18 16:10, Max Rees wrote:
> > I think another thing that needs fleshing out in this area is how
> > multiple build servers of the same architecture will be handled -
> > perhaps a build server can post a message so that it can "claim" a
> > task before the other same-architecture build servers get to it.
> >
> > Max
>
> My suggestion would be to have the agent daemon be 'smart enough' to know:
>
> - that a build server is available
> - that all build servers are busy
> - that a build server has become available
>
> If "busy" and "available" are MQTT messages, then I would assume that
> this would not be difficult.
>
> In this design, the agent would be the one that chooses which server to
> use for an architecture, and send the job directly to the server which
> is first available.
>
> There could likely be an operator-maintained order of servers to choose
> when all or multiple are available (sorted by CPU core count or so on).
>
> I am not in any way attached to this proposal; better ones are welcome.

As a terminology nitpick - the name "agent" is already used by
"abuildd-agentd", which is the process that runs on the build servers
that acts as the liaison between MQTT and abuild/pbuild. What you are
actually referring to would be a new, separate program - right now, the
webhook imports the functions it needs from enqueue in order to add
jobs. In this way, the webhook would have a long-lived MQTT client
connection, and in the future there would also be a standalone
command-line version of enqueue that would use the same code but with a
short MQTT client connection so that the command returns as soon as it's
done. I would like to keep the number of long-lived processes we must
maintain to a small number. As it stands, we need to manage an agentd
process for each build server; the webhook server; the MQTT broker; the
PostgreSQL server; and more in the future (at least collection and
the status webpage unless we choose to integrate these two more tightly
with the other components).

Perhaps this can be done purely using MQTT. We can publish retained
messages (i.e. messages that are received immediately on client
subscription) to topics like "servers/<arch>/<name>" when
abuildd-agentd starts on the build server, with a payload of e.g. number
of cores and availability status (idle or busy; generally idle on
startup). The webhook or enqueue CLI subscribes to "servers/#" (any
topic under servers) on *their* startup and thus receive a list of all
currently available build servers. They can then prioritize which build
server to delegate the task to by publishing to e.g.
"tasks/available/<arch>/<name>/<id>" (which abuildd-agentd will have
subscribed to) for some build server that is currently idle.

When abuildd-agentd receives a new task, it will change the retained
message on "servers/<arch>/<name>" to {nprocs, busy} and begin the task.
Once it's done, it'll publish the result of the task somewhere (perhaps
"tasks/done/<id>") with a payload of return code and error message, if
applicable. Then the agent can update "servers/<arch>/<name>" to be idle
again.

When abuildd-agentd shuts down gracefully, it can publish a {nprocs,
offline} retained message at "servers/<arch>/<name>".

We can also setup what MQTT calls a "last will and testament" for each
build server. This allows the client to tell the broker what to do in
case the client's connection dies without sending a proper DISCONNECT
message. In our case, we would want the last will and testament to be to
post a retained message to the "servers/<arch>/<name>" topic with a
payload of {nprocs, offline}. This way, if the broker detects that the
MQTT client of the build server has unexpectedly disconnected, it will
automatically publish an offline message for it.

What if all applicable build servers are unavailable (busy or offline)
when we want to delegate a new task? For the webhook, this should be
easy to get around since it's built to be asynchronous, so we can just
keep polling the list of servers until something's available. In case of
a webhook crash we have the SQL database as a backup. For the enqueue
CLI we would probably just fail by default and have an option to keep
polling for X time or indefinitely.

This is all based on about 30 minutes of reading MQTT documentation late
at night, so hopefully it all works.

Max
Received on Thu Sep 06 2018 - 04:54:21 UTC

This archive was generated by hypermail 2.4.0 : Sat May 08 2021 - 22:54:40 UTC