[PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

Baoquan He bhe at redhat.com
Thu Jul 19 15:17:53 UTC 2018


Hi Andrew,

On 07/18/18 at 03:33pm, Andrew Morton wrote:
> On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe at redhat.com> wrote:
> 
> > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> > is used to load kernel/initrd/purgatory is supposed to be allocated from
> > top to down. This is what we have been doing all along in the old kexec
> > loading interface and the kexec loading is still default setting in some
> > distributions. However, the current kexec_file loading interface doesn't
> > do like this. The function arch_kexec_walk_mem() it calls ignores checking
> > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> > all resources of System RAM from bottom to up, to try to find memory region
> > which can contain the specific kexec buffer, then call locate_mem_hole_callback()
> > to allocate memory in that found memory region from top to down. This brings
> > confusion especially when KASLR is widely supported , users have to make clear
> > why kexec/kdump kernel loading position is different between these two
> > interfaces in order to exclude unnecessary noises. Hence these two interfaces
> > need be unified on behaviour.
> 
> As far as I can tell, the above is the whole reason for the patchset,
> yes?  To avoid confusing users.


In fact, it's not just trying to avoid confusing users. Kexec loading
and kexec_file loading are just do the same thing in essence. Just we
need do kernel image verification on uefi system, have to port kexec
loading code to kernel. 

Kexec has been a formal feature in our distro, and customers owning
those kind of very large machine can make use of this feature to speed
up the reboot process. On uefi machine, the kexec_file loading will
search place to put kernel under 4G from top to down. As we know, the
1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
it. It may have possibility to not be able to find a usable space for
kernel/initrd. From the top down of the whole memory space, we don't
have this worry. 

And at the first post, I just posted below with AKASHI's
walk_system_ram_res_rev() version. Later you suggested to use
list_head to link child sibling of resource, see what the code change
looks like.
http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com

Then I posted v2
http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com
Rob Herring mentioned that other components which has this tree struct
have planned to do the same thing, replacing the singly linked list with
list_head to link resource child sibling. Just quote Rob's words as
below. I think this could be another reason.

~~~~~ From Rob
The DT struct device_node also has the same tree structure with
parent, child, sibling pointers and converting to list_head had been
on the todo list for a while. ACPI also has some tree walking
functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a
common tree struct and helpers defined either on top of list_head or a
~~~~~
new struct if that saves some size.

> 
> Is that sufficient?  Can we instead simplify their lives by providing
> better documentation or informative printks or better Kconfig text,
> etc?
> 
> And who *are* the people who are performing this configuration?  Random
> system administrators?  Linux distro engineers?  If the latter then
> they presumably aren't easily confused!

Kexec was invented for kernel developer to speed up their kernel
rebooting. Now high end sever admin, kernel developer and QE are also
keen to use it to reboot large box for faster feature testing, bug
debugging. Kernel dev could know this well, about kernel loading
position, admin or QE might not be aware of it very well. 

> 
> In other words, I'm trying to understand how much benefit this patchset
> will provide to our users as a whole.

Understood. The list_head replacing patch truly involes too many code
changes, it's risky. I am willing to try any idea from reviewers, won't
persuit they have to be accepted finally. If don't have a try, we don't
know what it looks like, and what impact it may have. I am fine to take
AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even
though it could be a little bit low efficient.

Thanks
Baoquan


More information about the devel mailing list