[RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

Fri Mar 15 17:17:49 UTC 2019

On Fri, Mar 15, 2019 at 9:43 AM Steven Rostedt <rostedt at goodmis.org> wrote:
>
> On Thu, 14 Mar 2019 21:36:43 -0700
> Daniel Colascione <dancol at google.com> wrote:
>
> > On Thu, Mar 14, 2019 at 8:16 PM Steven Rostedt <rostedt at goodmis.org> wrote:
> > >
> > > On Thu, 14 Mar 2019 13:49:11 -0700
> > > Sultan Alsawaf <sultan at kerneltoast.com> wrote:
> > >
> > > > Perhaps I'm missing something, but if you want to know when a process has died
> > > > after sending a SIGKILL to it, then why not just make the SIGKILL optionally
> > > > block until the process has died completely? It'd be rather trivial to just
> > > > store a pointer to an onstack completion inside the victim process' task_struct,
> > > > and then complete it in free_task().
> > >
> > > How would you implement such a method in userspace? kill() doesn't take
> > > any parameters but the pid of the process you want to send a signal to,
> > > and the signal to send. This would require a new system call, and be
> > > quite a bit of work.
> >
> > That's what the pidfd work is for. Please read the original threads
> > about the motivation and design of that facility.
>
> I wasn't Cc'd on the original work, so I haven't read them.
>
> >
> > > If you can solve this with an ebpf program, I
> > > strongly suggest you do that instead.
> >
>
>
>
> > We do want killed processes to die promptly. That's why I support
> > boosting a process's priority somehow when lmkd is about to kill it.
> > The precise way in which we do that --- involving not only actual
> > priority, but scheduler knobs, cgroup assignment, core affinity, and
> > so on --- is a complex topic best left to userspace. lmkd already has
> > all the knobs it needs to implement whatever priority boosting policy
> > it wants.
> >
> > Hell, once we add a pidfd_wait --- which I plan to work on, assuming
> > nobody beats me to it, after pidfd_send_signal lands --- you can
> > imagine a general-purpose priority inheritance mechanism expediting
> > process death when a high-priority process waits on a pidfd_wait
> > handle for a condemned process. You know you're on the right track
> > design-wise when you start seeing this kind of elegant constructive
> > interference between seemingly-unrelated features. What we don't need
> > is some kind of blocking SIGKILL alternative or backdoor event
> > delivery system.
> >
> > We definitely don't want to have to wait for a process's parent to
> > reap it. Instead, we want to wait for it to become a zombie. That's
> > why I designed my original exithand patch to fire death notification
> > upon transition to the zombie state, not upon process table removal,
> > and I expect pidfd_wait (or whatever we call it) to act the same way.
> >
> > In any case, there's a clear path forward here --- general-purpose,
> > cheap, and elegant --- and we should just focus on doing that instead
> > of more complex proposals with few advantages.
>
> If you add new pidfd systemcalls then making a new way to send a signal
> and block till it does die or whatever is

Right. And we shouldn't couple the killing and the waiting: while we
now have a good race-free way to kill processes using
pidfd_send_signal, but we still have no good facility for waiting for
the death of a process that isn't a child of the waiter. Any kind of
unified "kill and wait for death" primitive precludes the killing
thread waiting for things other than death at the same time! Instead,
if we allow waiting for an arbitrary process's death using
general-purpose wait primitives like select/poll/epoll/io_submit/etc.,
then synchronous killing becomes just another sleep that composes in
useful and predictable ways.

> more acceptable than adding a
> new signal that changes the semantics of sending signals, which is what
> I was against.

Agreed. Even if it were possible to easily add signals without
breaking everyone, a special kind of signal with delivery semantics
different from those of existing signals is a bad idea, and not really
a signal at all, but just a new system call in disguise.

> I do agree with Joel about bloating task_struct too. If anything, have
> a wait queue you add, where you can allocate a descriptor with the task
> dieing and task killing, and just search this queue on dying. We could
> add a TIF flag to the task as well to let the exiting of this task know
> it should do such an operation.

That's my basic plan. I think we need one link from struct signal or
something so we don't end up doing some kind of *global* search on
process death, but let's see how it goes.