[PATCHv9 4/8] zswap: add to mm/

Wed Apr 17 17:46:17 UTC 2013

On Sun, Apr 14, 2013 at 01:45:06AM +0100, Mel Gorman wrote:
> On Wed, Apr 10, 2013 at 01:18:56PM -0500, Seth Jennings wrote:
> > zswap is a thin compression backend for frontswap. It receives
> > pages from frontswap and attempts to store them in a compressed
> > memory pool, resulting in an effective partial memory reclaim and
> > dramatically reduced swap device I/O.
> > 
> > Additionally, in most cases, pages can be retrieved from this
> > compressed store much more quickly than reading from tradition
> > swap devices resulting in faster performance for many workloads.
> > 
> 
> Except in the case where the zswap pool is externally fragmented, occupies
> its maximum configured size and a workload that would otherwise have fit
> in memory gets pushed to swap.
> 
> Yes, it's a corner case but the changelog portrays zswap as an unconditional
> win and while it certainly is going to help some cases, it won't help
> them all.

Yes, and I (and others) are already working on improvements to the allocator to
make this corner case even smaller.

> 
> > This patch adds the zswap driver to mm/
> > 
> > Signed-off-by: Seth Jennings <sjenning at linux.vnet.ibm.com>
> > ---
> >  mm/Kconfig  |  15 ++
> >  mm/Makefile |   1 +
> >  mm/zswap.c  | 665 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 681 insertions(+)
> >  create mode 100644 mm/zswap.c
> > 
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index aa054fc..36d93b0 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -495,3 +495,18 @@ config PGTABLE_MAPPING
> >  
> >  	  You can check speed with zsmalloc benchmark[1].
> >  	  [1] https://github.com/spartacus06/zsmalloc
> > +
> > +config ZSWAP
> > +	bool "In-kernel swap page compression"
> > +	depends on FRONTSWAP && CRYPTO
> > +	select CRYPTO_LZO
> > +	select ZSMALLOC
> > +	default n
> > +	help
> > +	  Zswap is a backend for the frontswap mechanism in the VMM.
> > +	  It receives pages from frontswap and attempts to store them
> > +	  in a compressed memory pool, resulting in an effective
> > +	  partial memory reclaim.  In addition, pages and be retrieved
> > +	  from this compressed store much faster than most tradition
> > +	  swap devices resulting in reduced I/O and faster performance
> > +	  for many workloads.
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 0f6ef0a..1e0198f 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -32,6 +32,7 @@ obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
> >  obj-$(CONFIG_BOUNCE)	+= bounce.o
> >  obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
> >  obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
> > +obj-$(CONFIG_ZSWAP)	+= zswap.o
> >  obj-$(CONFIG_HAS_DMA)	+= dmapool.o
> >  obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
> >  obj-$(CONFIG_NUMA) 	+= mempolicy.o
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > new file mode 100644
> > index 0000000..db283c4
> > --- /dev/null
> > +++ b/mm/zswap.c
> > @@ -0,0 +1,665 @@
> > +/*
> > + * zswap.c - zswap driver file
> > + *
> > + * zswap is a backend for frontswap that takes pages that are in the
> > + * process of being swapped out and attempts to compress them and store
> > + * them in a RAM-based memory pool.  This results in a significant I/O
> > + * reduction on the real swap device and, in the case of a slow swap
> > + * device, can also improve workload performance.
> > + *
> > + * Copyright (C) 2012  Seth Jennings <sjenning at linux.vnet.ibm.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > +*/
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +#include <linux/module.h>
> > +#include <linux/cpu.h>
> > +#include <linux/highmem.h>
> > +#include <linux/slab.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/types.h>
> > +#include <linux/atomic.h>
> > +#include <linux/frontswap.h>
> > +#include <linux/rbtree.h>
> > +#include <linux/swap.h>
> > +#include <linux/crypto.h>
> > +#include <linux/mempool.h>
> > +#include <linux/zsmalloc.h>
> > +
> > +/*********************************
> > +* statistics
> > +**********************************/
> > +/* Number of memory pages used by the compressed pool */
> > +static atomic_t zswap_pool_pages = ATOMIC_INIT(0);
> > +/* The number of compressed pages currently stored in zswap */
> > +static atomic_t zswap_stored_pages = ATOMIC_INIT(0);
> > +
> > +/*
> > + * The statistics below are not protected from concurrent access for
> > + * performance reasons so they may not be a 100% accurate.  However,
> > + * they do provide useful information on roughly how many times a
> > + * certain event is occurring.
> > +*/
> > +static u64 zswap_pool_limit_hit;
> > +static u64 zswap_reject_compress_poor;
> > +static u64 zswap_reject_zsmalloc_fail;
> > +static u64 zswap_reject_kmemcache_fail;
> > +static u64 zswap_duplicate_entry;
> > +
> 
> Ok. Initially I thought "vmstat" but it would be overkill in this case
> and the fact zswap can be a module would be a problem.

Agreed.

> 
> > +/*********************************
> > +* tunables
> > +**********************************/
> > +/* Enable/disable zswap (disabled by default, fixed at boot for now) */
> > +static bool zswap_enabled;
> > +module_param_named(enabled, zswap_enabled, bool, 0);
> > +
> > +/* Compressor to be used by zswap (fixed at boot for now) */
> > +#define ZSWAP_COMPRESSOR_DEFAULT "lzo"
> > +static char *zswap_compressor = ZSWAP_COMPRESSOR_DEFAULT;
> > +module_param_named(compressor, zswap_compressor, charp, 0);
> > +
> > +/* The maximum percentage of memory that the compressed pool can occupy */
> > +static unsigned int zswap_max_pool_percent = 20;
> > +module_param_named(max_pool_percent,
> > +			zswap_max_pool_percent, uint, 0644);
> > +
> 
> This has some potentially interesting NUMA characteristics.
> The location of the allocated pages will depend on the process that first
> allocated the page. As the pages can then be used by remote processes,
> there may be increased remote accesses when accessing zswap.
> Furthermore, if zone_reclaim_mode is enabled and allowed to swap it
> could setup a weird situation whereby a process pushes itself fully into
> zswap trying to reclaim local memory and instead pushing itself into
> zswap on the local node.

Yes, there are some NUMA ramifications to flush out here. None of them
completely detrimental I don't think, but sub-optimal yes.

> 
> If this is every reported as a problem then a workaround is to always
> allocate zswap pages round-robin between online nodes.

Even the round-robin approach could have a caveat if the nodes are not
identically sized.  I guess you could do a weighted round-robin proportional to
node size in that case.

> 
> > +/*
> > + * Maximum compression ratio, as as percentage, for an acceptable
> 
> s/as as/as a/

Yes.

> 
> > + * compressed page. Any pages that do not compress by at least
> > + * this ratio will be rejected.
> > +*/
> > +static unsigned int zswap_max_compression_ratio = 80;
> > +module_param_named(max_compression_ratio,
> > +			zswap_max_compression_ratio, uint, 0644);
> > +
> > +/*********************************
> > +* compression functions
> > +**********************************/
> > +/* per-cpu compression transforms */
> > +static struct crypto_comp * __percpu *zswap_comp_pcpu_tfms;
> > +
> > +enum comp_op {
> > +	ZSWAP_COMPOP_COMPRESS,
> > +	ZSWAP_COMPOP_DECOMPRESS
> > +};
> > +
> > +static int zswap_comp_op(enum comp_op op, const u8 *src, unsigned int slen,
> > +				u8 *dst, unsigned int *dlen)
> > +{
> > +	struct crypto_comp *tfm;
> > +	int ret;
> > +
> > +	tfm = *per_cpu_ptr(zswap_comp_pcpu_tfms, get_cpu());
> 
> It's always the local CPU so why not get_cpu_var()?

Hmm... I'm not sure :)  Let me take another look.

> 
> > +	switch (op) {
> > +	case ZSWAP_COMPOP_COMPRESS:
> > +		ret = crypto_comp_compress(tfm, src, slen, dst, dlen);
> > +		break;
> > +	case ZSWAP_COMPOP_DECOMPRESS:
> > +		ret = crypto_comp_decompress(tfm, src, slen, dst, dlen);
> > +		break;
> > +	default:
> > +		ret = -EINVAL;
> > +	}
> > +
> > +	put_cpu();
> > +	return ret;
> > +}
> > +
> > +static int __init zswap_comp_init(void)
> > +{
> > +	if (!crypto_has_comp(zswap_compressor, 0, 0)) {
> > +		pr_info("%s compressor not available\n", zswap_compressor);
> > +		/* fall back to default compressor */
> > +		zswap_compressor = ZSWAP_COMPRESSOR_DEFAULT;
> > +		if (!crypto_has_comp(zswap_compressor, 0, 0))
> > +			/* can't even load the default compressor */
> > +			return -ENODEV;
> > +	}
> > +	pr_info("using %s compressor\n", zswap_compressor);
> > +
> > +	/* alloc percpu transforms */
> > +	zswap_comp_pcpu_tfms = alloc_percpu(struct crypto_comp *);
> > +	if (!zswap_comp_pcpu_tfms)
> > +		return -ENOMEM;
> > +	return 0;
> > +}
> > +
> > +static void zswap_comp_exit(void)
> > +{
> > +	/* free percpu transforms */
> > +	if (zswap_comp_pcpu_tfms)
> > +		free_percpu(zswap_comp_pcpu_tfms);
> > +}
> > +
> > +/*********************************
> > +* data structures
> > +**********************************/
> > +struct zswap_entry {
> > +	struct rb_node rbnode;
> > +	unsigned type;
> > +	pgoff_t offset;
> > +	unsigned long handle;
> > +	unsigned int length;
> > +};
> 
> Document that the types and offset are from frontswap and the handle is
> an opaque type from zsmalloc. This indicates that zswap is hard-coded
> against zsmalloc but so far I do not believe I have seen anything that
> forces it to be and allow either zbud or zsmalloc to be pluggable.

Ok.

> 
> > +
> > +struct zswap_tree {
> > +	struct rb_root rbroot;
> > +	spinlock_t lock;
> > +	struct zs_pool *pool;
> > +};
> > +
> > +static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> > +
> > +/*********************************
> > +* zswap entry functions
> > +**********************************/
> > +#define ZSWAP_KMEM_CACHE_NAME "zswap_entry_cache"
> 
> heh, it's only used once and it's not exactly a magic number. Seems
> overkill for a #define

Yes.

> 
> > +static struct kmem_cache *zswap_entry_cache;
> > +
> > +static inline int zswap_entry_cache_create(void)
> > +{
> 
> No need to declare it inline, compiler will figure it out. Also should
> return bool.

Yes. 

> 
> > +	zswap_entry_cache =
> > +		kmem_cache_create(ZSWAP_KMEM_CACHE_NAME,
> > +			sizeof(struct zswap_entry), 0, 0, NULL);
> > +	return (zswap_entry_cache == NULL);
> > +}
> > +
> > +static inline void zswap_entry_cache_destory(void)
> > +{
> > +	kmem_cache_destroy(zswap_entry_cache);
> > +}
> > +
> > +static inline struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
> > +{
> > +	struct zswap_entry *entry;
> > +	entry = kmem_cache_alloc(zswap_entry_cache, gfp);
> > +	if (!entry)
> > +		return NULL;
> 
> No need to check !entry, just return it.

Yes.

> 
> > +	return entry;
> > +}
> > +
> > +static inline void zswap_entry_cache_free(struct zswap_entry *entry)
> > +{
> > +	kmem_cache_free(zswap_entry_cache, entry);
> > +}
> > +
> > +/*********************************
> > +* rbtree functions
> > +**********************************/
> > +static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> > +{
> > +	struct rb_node *node = root->rb_node;
> > +	struct zswap_entry *entry;
> > +
> > +	while (node) {
> > +		entry = rb_entry(node, struct zswap_entry, rbnode);
> > +		if (entry->offset > offset)
> > +			node = node->rb_left;
> > +		else if (entry->offset < offset)
> > +			node = node->rb_right;
> > +		else
> > +			return entry;
> > +	}
> > +	return NULL;
> > +}
> > +
> > +/*
> > + * In the case that a entry with the same offset is found, it a pointer to
> > + * the existing entry is stored in dupentry and the function returns -EEXIST
> > +*/
> > +static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
> > +			struct zswap_entry **dupentry)
> > +{
> > +	struct rb_node **link = &root->rb_node, *parent = NULL;
> > +	struct zswap_entry *myentry;
> > +
> > +	while (*link) {
> > +		parent = *link;
> > +		myentry = rb_entry(parent, struct zswap_entry, rbnode);
> > +		if (myentry->offset > entry->offset)
> > +			link = &(*link)->rb_left;
> > +		else if (myentry->offset < entry->offset)
> > +			link = &(*link)->rb_right;
> > +		else {
> > +			*dupentry = myentry;
> > +			return -EEXIST;
> > +		}
> > +	}
> > +	rb_link_node(&entry->rbnode, parent, link);
> > +	rb_insert_color(&entry->rbnode, root);
> > +	return 0;
> > +}
> > +
> > +/*********************************
> > +* per-cpu code
> > +**********************************/
> > +static DEFINE_PER_CPU(u8 *, zswap_dstmem);
> > +
> 
> That deserves a comment. It per-cpu buffers for compressing or
> decompressing data.

Will do.

> 
> 
> > +static int __zswap_cpu_notifier(unsigned long action, unsigned long cpu)
> > +{
> > +	struct crypto_comp *tfm;
> > +	u8 *dst;
> > +
> > +	switch (action) {
> > +	case CPU_UP_PREPARE:
> > +		tfm = crypto_alloc_comp(zswap_compressor, 0, 0);
> > +		if (IS_ERR(tfm)) {
> > +			pr_err("can't allocate compressor transform\n");
> > +			return NOTIFY_BAD;
> > +		}
> > +		*per_cpu_ptr(zswap_comp_pcpu_tfms, cpu) = tfm;
> > +		dst = kmalloc(PAGE_SIZE * 2, GFP_KERNEL);
> > +		if (!dst) {
> > +			pr_err("can't allocate compressor buffer\n");
> > +			crypto_free_comp(tfm);
> > +			*per_cpu_ptr(zswap_comp_pcpu_tfms, cpu) = NULL;
> > +			return NOTIFY_BAD;
> > +		}
> > +		per_cpu(zswap_dstmem, cpu) = dst;
> > +		break;
> > +	case CPU_DEAD:
> > +	case CPU_UP_CANCELED:
> > +		tfm = *per_cpu_ptr(zswap_comp_pcpu_tfms, cpu);
> > +		if (tfm) {
> > +			crypto_free_comp(tfm);
> > +			*per_cpu_ptr(zswap_comp_pcpu_tfms, cpu) = NULL;
> > +		}
> > +		dst = per_cpu(zswap_dstmem, cpu);
> > +		if (dst) {
> > +			kfree(dst);
> > +			per_cpu(zswap_dstmem, cpu) = NULL;
> > +		}
> > +		break;
> > +	default:
> > +		break;
> > +	}
> > +	return NOTIFY_OK;
> > +}
> > +
> > +static int zswap_cpu_notifier(struct notifier_block *nb,
> > +				unsigned long action, void *pcpu)
> > +{
> > +	unsigned long cpu = (unsigned long)pcpu;
> > +	return __zswap_cpu_notifier(action, cpu);
> > +}
> > +
> > +static struct notifier_block zswap_cpu_notifier_block = {
> > +	.notifier_call = zswap_cpu_notifier
> > +};
> > +
> > +static int zswap_cpu_init(void)
> > +{
> > +	unsigned long cpu;
> > +
> > +	get_online_cpus();
> > +	for_each_online_cpu(cpu)
> > +		if (__zswap_cpu_notifier(CPU_UP_PREPARE, cpu) != NOTIFY_OK)
> > +			goto cleanup;
> > +	register_cpu_notifier(&zswap_cpu_notifier_block);
> > +	put_online_cpus();
> > +	return 0;
> > +
> > +cleanup:
> > +	for_each_online_cpu(cpu)
> > +		__zswap_cpu_notifier(CPU_UP_CANCELED, cpu);
> > +	put_online_cpus();
> > +	return -ENOMEM;
> > +}
> > +
> > +/*********************************
> > +* zsmalloc callbacks
> > +**********************************/
> > +static mempool_t *zswap_page_pool;
> > +
> > +static inline unsigned int zswap_max_pool_pages(void)
> > +{
> > +	return zswap_max_pool_percent * totalram_pages / 100;
> > +}
> > +
> > +static inline int zswap_page_pool_create(void)
> > +{
> > +	/* TODO: dynamically size mempool */
> > +	zswap_page_pool = mempool_create_page_pool(256, 0);
> > +	if (!zswap_page_pool)
> > +		return -ENOMEM;
> > +	return 0;
> > +}
> > +
> > +static inline void zswap_page_pool_destroy(void)
> > +{
> > +	mempool_destroy(zswap_page_pool);
> > +}
> > +
> > +static struct page *zswap_alloc_page(gfp_t flags)
> > +{
> > +	struct page *page;
> > +
> > +	if (atomic_read(&zswap_pool_pages) >= zswap_max_pool_pages()) {
> > +		zswap_pool_limit_hit++;
> > +		return NULL;
> > +	}
> > +	page = mempool_alloc(zswap_page_pool, flags);
> > +	if (page)
> > +		atomic_inc(&zswap_pool_pages);
> > +	return page;
> > +}
> > +
> > +static void zswap_free_page(struct page *page)
> > +{
> > +	if (!page)
> > +		return;
> > +	mempool_free(page, zswap_page_pool);
> > +	atomic_dec(&zswap_pool_pages);
> > +}
> 
> Again I find it odd that the mempool is here instead of within zsmalloc
> itself. It's also not superclear why you used mempool instead of just
> alloc_page/free_page

I am reexamining this approach since the removal of __GFP_NOMEMALLOC from the
mask has bascially done away with the issue of zsmalloc being able to allocate
pages for the pool.

This could potentially be done away with.

> 
> 
> > +
> > +static struct zs_ops zswap_zs_ops = {
> > +	.alloc = zswap_alloc_page,
> > +	.free = zswap_free_page
> > +};
> > +
> > +/*********************************
> > +* frontswap hooks
> > +**********************************/
> > +/* attempts to compress and store an single page */
> > +static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > +				struct page *page)
> > +{
> > +	struct zswap_tree *tree = zswap_trees[type];
> > +	struct zswap_entry *entry, *dupentry;
> > +	int ret;
> > +	unsigned int dlen = PAGE_SIZE;
> > +	unsigned long handle;
> > +	char *buf;
> > +	u8 *src, *dst;
> > +
> > +	if (!tree) {
> > +		ret = -ENODEV;
> > +		goto reject;
> > +	}
> > +
> > +	/* allocate entry */
> > +	entry = zswap_entry_cache_alloc(GFP_KERNEL);
> > +	if (!entry) {
> > +		zswap_reject_kmemcache_fail++;
> > +		ret = -ENOMEM;
> > +		goto reject;
> > +	}
> > +
> > +	/* compress */
> > +	dst = get_cpu_var(zswap_dstmem);
> > +	src = kmap_atomic(page);
> 
> Why kmap_atomic? We do not appear to be in an atomic context here and
> you've already disabled preempt for the compression op.

To support swapping out highmem pages in 32-bit systems.  Is this not needed
for that?

> 
> > +	ret = zswap_comp_op(ZSWAP_COMPOP_COMPRESS, src, PAGE_SIZE, dst, &dlen);
> > +	kunmap_atomic(src);
> > +	if (ret) {
> > +		ret = -EINVAL;
> > +		goto putcpu;
> > +	}
> > +	if ((dlen * 100 / PAGE_SIZE) > zswap_max_compression_ratio) {
> > +		zswap_reject_compress_poor++;
> > +		ret = -E2BIG;
> > +		goto putcpu;
> > +	}
> > +
> > +	/* store */
> > +	handle = zs_malloc(tree->pool, dlen,
> > +		__GFP_NORETRY | __GFP_HIGHMEM | __GFP_NOMEMALLOC |
> > +			__GFP_NOWARN);
> > +	if (!handle) {
> > +		zswap_reject_zsmalloc_fail++;
> > +		ret = -ENOMEM;
> > +		goto putcpu;
> > +	}
> > +
> 
> This is an aging inversion problem. Once zswap is full, the newest pages
> are written to swap instead of old zswap pages and freeing up some
> space. This means two things

Regarding the age inversion, this is addressed, except in the case where zswap
is full, by writeback.

> 
> 1. Once zswap is full, it's performance degrades immediately to just
>    being even worst than traditional swap except we're doing all the
>    swapping but have 20% (by default) less physical memory to work with.

zswap can still help in this case but yes, the performance will converge on the
normal off-the-cliff thrashing performance.

> 
> 2. zswap is vunerable to a DOS by a process starting, allocating a
>    buffer that is RAM + max zswap size to fill zswap, freeing its remaining
>    in-core pages and then sleeping forever

I think writeback addresses this issue.  If it doesn't let me know.

Additionally, I'm working on code to introduce "periodic writeback" where, if
zpages are not accessed in zswap within a certain time, they are written back
to free up memory.  This prevents zswap from occupying memory forever on
systems that use swap in the perferred manner: to contain pages that are
unlikely to swapped back in any time soon.

> 
> zswap pages should also be maintained on a LRU with old zswap pages written
> to backing storage when it's full. A logical follow-on then would be that
> the size of the zswap pool can be dynamically shrunk to free physical RAM
> if the refault rate between zswap and normal RAM is low.

I think this is fully addressed by the writeback patch.

> 
> > +	buf = zs_map_object(tree->pool, handle, ZS_MM_WO);
> > +	memcpy(buf, dst, dlen);
> > +	zs_unmap_object(tree->pool, handle);
> > +	put_cpu_var(zswap_dstmem);
> > +
> > +	/* populate entry */
> > +	entry->type = type;
> > +	entry->offset = offset;
> > +	entry->handle = handle;
> > +	entry->length = dlen;
> > +
> > +	/* map */
> > +	spin_lock(&tree->lock);
> > +	do {
> > +		ret = zswap_rb_insert(&tree->rbroot, entry, &dupentry);
> > +		if (ret == -EEXIST) {
> > +			zswap_duplicate_entry++;
> > +
> > +			/* remove from rbtree */
> > +			rb_erase(&dupentry->rbnode, &tree->rbroot);
> > +
> > +			/* free */
> > +			zs_free(tree->pool, dupentry->handle);
> > +			zswap_entry_cache_free(dupentry);
> > +			atomic_dec(&zswap_stored_pages);
> > +		}
> > +	} while (ret == -EEXIST);
> > +	spin_unlock(&tree->lock);
> > +
> > +	/* update stats */
> > +	atomic_inc(&zswap_stored_pages);
> > +
> > +	return 0;
> > +
> > +putcpu:
> > +	put_cpu_var(zswap_dstmem);
> > +	zswap_entry_cache_free(entry);
> > +reject:
> > +	return ret;
> > +}
> > +
> > +/*
> > + * returns 0 if the page was successfully decompressed
> > + * return -1 on entry not found or error
> > +*/
> > +static int zswap_frontswap_load(unsigned type, pgoff_t offset,
> > +				struct page *page)
> > +{
> > +	struct zswap_tree *tree = zswap_trees[type];
> > +	struct zswap_entry *entry;
> > +	u8 *src, *dst;
> > +	unsigned int dlen;
> > +
> > +	/* find */
> > +	spin_lock(&tree->lock);
> > +	entry = zswap_rb_search(&tree->rbroot, offset);
> > +	spin_unlock(&tree->lock);
> > +
> > +	/* decompress */
> > +	dlen = PAGE_SIZE;
> > +	src = zs_map_object(tree->pool, entry->handle, ZS_MM_RO);
> > +	dst = kmap_atomic(page);
> > +	zswap_comp_op(ZSWAP_COMPOP_DECOMPRESS, src, entry->length,
> > +		dst, &dlen);
> > +	kunmap_atomic(dst);
> > +	zs_unmap_object(tree->pool, entry->handle);
> > +
> > +	return 0;
> > +}
> > +
> > +/* invalidates a single page */
> > +static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
> > +{
> > +	struct zswap_tree *tree = zswap_trees[type];
> > +	struct zswap_entry *entry;
> > +
> > +	/* find */
> > +	spin_lock(&tree->lock);
> > +	entry = zswap_rb_search(&tree->rbroot, offset);
> > +
> > +	/* remove from rbtree */
> > +	rb_erase(&entry->rbnode, &tree->rbroot);
> > +	spin_unlock(&tree->lock);
> > +
> > +	/* free */
> > +	zs_free(tree->pool, entry->handle);
> > +	zswap_entry_cache_free(entry);
> > +	atomic_dec(&zswap_stored_pages);
> > +}
> > +
> > +/* invalidates all pages for the given swap type */
> > +static void zswap_frontswap_invalidate_area(unsigned type)
> > +{
> > +	struct zswap_tree *tree = zswap_trees[type];
> > +	struct rb_node *node;
> > +	struct zswap_entry *entry;
> > +
> > +	if (!tree)
> > +		return;
> > +
> > +	/* walk the tree and free everything */
> > +	spin_lock(&tree->lock);
> > +	/*
> > +	 * TODO: Even though this code should not be executed because
> > +	 * the try_to_unuse() in swapoff should have emptied the tree,
> > +	 * it is very wasteful to rebalance the tree after every
> > +	 * removal when we are freeing the whole tree.
> > +	 *
> > +	 * If post-order traversal code is ever added to the rbtree
> > +	 * implementation, it should be used here.
> > +	 */
> > +	while ((node = rb_first(&tree->rbroot))) {
> > +		entry = rb_entry(node, struct zswap_entry, rbnode);
> > +		rb_erase(&entry->rbnode, &tree->rbroot);
> > +		zs_free(tree->pool, entry->handle);
> > +		zswap_entry_cache_free(entry);
> > +	}
> > +	tree->rbroot = RB_ROOT;
> > +	spin_unlock(&tree->lock);
> > +}
> > +
> > +/* NOTE: this is called in atomic context from swapon and must not sleep */
> > +static void zswap_frontswap_init(unsigned type)
> > +{
> > +	struct zswap_tree *tree;
> > +
> > +	tree = kzalloc(sizeof(struct zswap_tree), GFP_ATOMIC);
> > +	if (!tree)
> > +		goto err;
> > +	tree->pool = zs_create_pool(GFP_NOWAIT, &zswap_zs_ops);
> > +	if (!tree->pool)
> > +		goto freetree;
> > +	tree->rbroot = RB_ROOT;
> > +	spin_lock_init(&tree->lock);
> > +	zswap_trees[type] = tree;
> > +	return;
> > +
> 
> Ok I think. I didn't read this as carefully because I assumed that it
> would either work or blow up spectacularly and there was little scope
> for being clever.
> 
> > +freetree:
> > +	kfree(tree);
> > +err:
> > +	pr_err("alloc failed, zswap disabled for swap type %d\n", type);
> > +}
> > +
> > +static struct frontswap_ops zswap_frontswap_ops = {
> > +	.store = zswap_frontswap_store,
> > +	.load = zswap_frontswap_load,
> > +	.invalidate_page = zswap_frontswap_invalidate_page,
> > +	.invalidate_area = zswap_frontswap_invalidate_area,
> > +	.init = zswap_frontswap_init
> > +};
> > +
> > +/*********************************
> > +* debugfs functions
> > +**********************************/
> > +#ifdef CONFIG_DEBUG_FS
> > +#include <linux/debugfs.h>
> > +
> > +static struct dentry *zswap_debugfs_root;
> > +
> > +static int __init zswap_debugfs_init(void)
> > +{
> > +	if (!debugfs_initialized())
> > +		return -ENODEV;
> > +
> > +	zswap_debugfs_root = debugfs_create_dir("zswap", NULL);
> > +	if (!zswap_debugfs_root)
> > +		return -ENOMEM;
> > +
> > +	debugfs_create_u64("pool_limit_hit", S_IRUGO,
> > +			zswap_debugfs_root, &zswap_pool_limit_hit);
> > +	debugfs_create_u64("reject_zsmalloc_fail", S_IRUGO,
> > +			zswap_debugfs_root, &zswap_reject_zsmalloc_fail);
> > +	debugfs_create_u64("reject_kmemcache_fail", S_IRUGO,
> > +			zswap_debugfs_root, &zswap_reject_kmemcache_fail);
> > +	debugfs_create_u64("reject_compress_poor", S_IRUGO,
> > +			zswap_debugfs_root, &zswap_reject_compress_poor);
> > +	debugfs_create_u64("duplicate_entry", S_IRUGO,
> > +			zswap_debugfs_root, &zswap_duplicate_entry);
> > +	debugfs_create_atomic_t("pool_pages", S_IRUGO,
> > +			zswap_debugfs_root, &zswap_pool_pages);
> > +	debugfs_create_atomic_t("stored_pages", S_IRUGO,
> > +			zswap_debugfs_root, &zswap_stored_pages);
> > +
> > +	return 0;
> > +}
> > +
> > +static void __exit zswap_debugfs_exit(void)
> > +{
> > +	debugfs_remove_recursive(zswap_debugfs_root);
> > +}
> > +#else
> > +static inline int __init zswap_debugfs_init(void)
> > +{
> > +	return 0;
> > +}
> > +
> > +static inline void __exit zswap_debugfs_exit(void) { }
> > +#endif
> > +
> > +/*********************************
> > +* module init and exit
> > +**********************************/
> > +static int __init init_zswap(void)
> > +{
> > +	if (!zswap_enabled)
> > +		return 0;
> > +
> > +	pr_info("loading zswap\n");
> > +	if (zswap_entry_cache_create()) {
> > +		pr_err("entry cache creation failed\n");
> > +		goto error;
> > +	}
> > +	if (zswap_page_pool_create()) {
> > +		pr_err("page pool initialization failed\n");
> > +		goto pagepoolfail;
> > +	}
> > +	if (zswap_comp_init()) {
> > +		pr_err("compressor initialization failed\n");
> > +		goto compfail;
> > +	}
> > +	if (zswap_cpu_init()) {
> > +		pr_err("per-cpu initialization failed\n");
> > +		goto pcpufail;
> > +	}
> > +	frontswap_register_ops(&zswap_frontswap_ops);
> > +	if (zswap_debugfs_init())
> > +		pr_warn("debugfs initialization failed\n");
> > +	return 0;
> > +pcpufail:
> > +	zswap_comp_exit();
> > +compfail:
> > +	zswap_page_pool_destroy();
> > +pagepoolfail:
> > +	zswap_entry_cache_destory();
> > +error:
> > +	return -ENOMEM;
> > +}
> > +/* must be late so crypto has time to come up */
> > +late_initcall(init_zswap);
> > +
> > +MODULE_LICENSE("GPL");
> > +MODULE_AUTHOR("Seth Jennings <sjenning at linux.vnet.ibm.com>");
> > +MODULE_DESCRIPTION("Compressed cache for swap pages");
> 
> Ok, so there are some problems in there.  For me, the zsmalloc fragmentation
> issues are potentially the far scarier problem because unpredictable
> performance characteristics tend to generate really painful bug reports
> with difficult (if not impossible) to replicate problems. Those reports
> are so painful in fact that I'm inclined to dig my heels in and make loud
> noises unless an allocator with predictable performance characteritics
> can also be used (presumably zbud) -- as a comparison point if nothing
> else but also to have as a workaround for performance problems in zsmalloc.

Yes, that allocator continues to be a point of discussion, which I find
unfortunate since it is just a means to an end; that end being compressed swap.
However, it is important to get somewhat right from the beginning to avoid
throwing large amounts of code out of the mm/ later.

> 
> It also looks like performance will fall off a cliff when zswap is full
> but at least that's a predictable problem and easily explained to a
> user. An LRU for zswap pages could always be implemented later with
> bonus points if it uses refault rates to judge when the pool can be
> shrunk more agressively to free physical RAM.

Hopefully addressed by writeback patch.

> 
> -- 
> Mel Gorman
> SUSE Labs
>