监督者行为模式

监督者原则
示例
监督者标志
重启策略
最大重启频率
- - Tuning the intensity and period
子进程规范

阅读本文时请参阅STDLIB用户手册中supervisor(3)的章节，其中详细介绍了监督者行为模式相关的内容。

监督者原则

监督者负责启动、停止和监视其子进程。监督者的基本思想是在必要时重新启动其子进程来保证它们处于活着的状态。

监督者通过子进程规范列表指定要启动和监视的子进程。监督者以列表中的顺序启动子进程，并以逆序终止子进程。

示例

通过监督者的回调模块启动GEN_SERVER行为模式的服务的示例：

-module(ch_sup).
-behaviour(supervisor).

-export([start_link/0]).
-export([init/1]).

start_link() ->
    supervisor:start_link(ch_sup, []).

init(_Args) ->
    SupFlags = #{strategy => one_for_one, intensity => 1, period => 5},
    ChildSpecs = [#{id => ch3,
                    start => {ch3, start_link, []},
                    restart => permanent,
                    shutdown => brutal_kill,
                    type => worker,
                    modules => [cg3]}],
    {ok, {SupFlags, ChildSpecs}}.

初始化函数init/1的返回值中SupFlags变量表示监督者标志。

初始化函数init/1的返回值中ChildSpecs变量是一个子进程规范列表。

监督者标志

下面是监督者标志的类型定义：

sup_flags() = #{strategy => strategy(),         % optional
                intensity => non_neg_integer(), % optional
                period => pos_integer()}        % optional
    strategy() = one_for_all
               | one_for_one
               | rest_for_one
               | simple_one_for_one

strategy指定重启策略
intensity和period指定最大重启频率

重启策略

重启策略定义在回调函数init的返回值中的监督者标志数据结构中以strategy标识。

SupFlags = #{strategy => Strategy, ...}

监督者标志数据结构中的strategy键是可选项。如果没有指定其值，将默认为one_for_one。

one_for_one

如果某子进程终止，此策略仅针对该进程进行重启。
one_for_all

如果某子进程终止，那么所有其他子进程也随之终止，随后重新启动所有子进程。
rest_one_for_one

如果某子进程终止，那么在其之后启动的子进程也随之终止，随后重新启动它们。
simple_one_for_one

参见：simple-one-for-one supervisors

最大重启频率

The supervisors have a built-in mechanism to limit the number of restarts which can occur in a given time interval. This is specified by the two keys intensity and period in the supervisor flags map returned by the callback function init:

SupFlags = #{intensity => MaxR, period => MaxT, ...}

If more than MaxR number of restarts occur in the last MaxT seconds, the supervisor terminates all the child processes and then itself. The termination reason for the supervisor itself in that case will be shutdown.

When the supervisor terminates, then the next higher-level supervisor takes some action. It either restarts the terminated supervisor or terminates itself.

The intention of the restart mechanism is to prevent a situation where a process repeatedly dies for the same reason, only to be restarted again.

The keys intensity and period are optional in the supervisor flags map. If they are not given, they default to 1 and 5, respectively.

Tuning the intensity and period

The default values are 1 restart per 5 seconds. This was chosen to be safe for most systems, even with deep supervision hierarchies, but you will probably want to tune the settings for your particular use case.

First, the intensity decides how big bursts of restarts you want to tolerate. For example, you might want to accept a burst of at most 5 or 10 attempts, even within the same second, if it results in a successful restart.

Second, you need to consider the sustained failure rate, if crashes keep happening but not often enough to make the supervisor give up. If you set intensity to 10 and set the period as low as 1, the supervisor will allow child processes to keep restarting up to 10 times per second, forever, filling your logs with crash reports until someone intervenes manually.

You should therefore set the period to be long enough that you can accept that the supervisor keeps going at that rate. For example, if you have picked an intensity value of 5, then setting the period to 30 seconds will give you at most one restart per 6 seconds for any longer period of time, which means that your logs won't fill up too quickly, and you will have a chance to observe the failures and apply a fix.

These choices depend a lot on your problem domain. If you don't have real time monitoring and ability to fix problems quickly, for example in an embedded system, you might want to accept at most one restart per minute before the supervisor should give up and escalate to the next level to try to clear the error automatically. On the other hand, if it is more important that you keep trying even at a high failure rate, you might want a sustained rate of as much as 1-2 restarts per second.

Avoiding common mistakes:

Do not forget to consider the burst rate. If you set intensity to 1 and period to 6, it gives the same sustained error rate as 5/30 or 10/60, but will not allow even 2 restart attempts in quick succession. This is probably not what you wanted.
Do not set the period to a very high value if you want to tolerate bursts. If you set intensity to 5 and period to 3600 (one hour), the supervisor will allow a short burst of 5 restarts, but then gives up if it sees another single restart almost an hour later. You probably want to regard those crashes as separate incidents, so setting the period to 5 or 10 minutes will be more reasonable.
If your application has multiple levels of supervision, then do not simply set the restart intensities to the same values on all levels. Keep in mind that the total number of restarts (before the top level supervisor gives up and terminates the application) will be the product of the intensity values of all the supervisors above the failing child process.

For example, if the top level allows 10 restarts, and the next level also allows 10, a crashing child below that level will be restarted 100 times, which is probably excessive. Allowing at most 3 restarts for the top level supervisor might be a better choice in this case.

子进程规范

The type definition for a child specification is as follows:

child_spec() = #{id => child_id(),       % mandatory
                 start => mfargs(),      % mandatory
                 restart => restart(),   % optional
                 shutdown => shutdown(), % optional
                 type => worker(),       % optional
                 modules => modules()}   % optional
    child_id() = term()
    mfargs() = {M :: module(), F :: atom(), A :: [term()]}
    modules() = [module()] | dynamic
    restart() = permanent | transient | temporary
    shutdown() = brutal_kill | timeout()
    worker() = worker | supervisor

未完，待续…