[Dovecot] Design: Asynchronous I/O for single/multi-dbox
The long term plan is to get all of Dovecot disk I/O asynchronous. The first step to that direction would be to make dbox/mdbox I/O asynchronous. This might also allow mbox/maildir to become partially asynchronous.
http://www.dovecot.org/list/dovecot/2009-December/045481.html already started describing how the lib-storage API could be changed to support high-latency storages. Adding support for all of the paralellism would be nice, but probably not necessary as the first step. Then again, the API changes are necessary, so the parallelism optimizations probably aren't a huge task on top of that.
Besides the API changes described in the above link, another change that's necessary is to add mail_get_nonblocking_stream(). The returned stream can return EAGAIN whenever it would need to block. Whenever sending message data to client, this stream would have to be used. It would also be used internally by all of message parsing/searching code. Luckily all of that already supports nonblocking input or can be easily modified to support it.
Below are some thoughts how to go forward with this. I originally thought about writing a more specific plan, but I think this is good enough now to start coding. The target Dovecot version is v2.1, not v2.0.
Filesystem API
The idea is to first abstract out all POSIX filesystem accessing in dbox/mdbox code. Non-blocking I/O works pretty nicely for socket I/O, so I thought I'd use a similar API for disk I/O as well:
handle = open(path, mode)
- this function can't fail. it's executed asynchronously.
- mode=read-only: no writing to file
- mode=append: appending to file is allowed
handle = create(path, mode, permissions)
- this function can't fail. it's executed asynchronously.
- mode=fail-if-exists: commit() fails if file already exists
- mode=replace-if-exists: commit() replaces file if it already exists
- permissions: i'm not entirely sure how this works yet. but it should contain mode and gid in some format
set_input_callback(handle, callback, context) set_output_callback(handle, callback, context)
- call the callback when more data can be read/written
ret = pread(handle, buf, size, offset)
- just like pread(), but can fail with EAGAIN if there are no bytes already buffered. so the idea is that the backend implementation would buffer/readahead data, which would be returned by this call. this would require memcpy()ing all data, but it might get too complex/fragile if it was asynchronously written to given buffer.
ret = write(handle, buf, size)
- append data to given file and return how many bytes were actually added to write buffer. works in a similar way than writing to socket. data can only be appended to files, there is no support for overwriting data.
- no writes will be visible to reads until commit() is called
ret = commit(handle, [filename]) this is needed because e.g. mdbox saving can write many temp files in a
- commit all previous writes to disk. either returns success/EAGAIN.
- if filename is given and a new file is being created, the filename is changed to the given one instead of using the original path's filename.
single transaction and only at commit stage it locks the index files and knows what the filenames will be.
rollback(handle)
- rollback all previous writes.
close(handle)
- if file was created and not committed, the temp file will be deleted
- does implicit rollback
ret = try_lock(handle)
- this isn't an asynchronous operation! it assumes that locking state is kept in memory, so that the operation will be fast. if backend doesn't support locking or it's slow, single-dbox should be used (instead of multi-dbox), because it doesn't need locking.
- returns success or "already locked"
- only exclusive locking is possible
Async IO streams
Async input streams are created with FS API handle, so it's possible to start reading from them before the file has even been open()ed. The callers must be aware of this and realize that read() can fail with ENOENT, etc.
Async input streams' read() would work exactly as file_istream works for non-blocking sockets: It would return data that is already buffered in memory. If there's nothing, it returns EAGAIN. The FS API's set_input_callback() can be used to set a callback function that is called whenever there's more data available in the buffer.
Async output streams also work the same as non-blocking file_ostreams: write() returns the number of bytes added to buffer. When buffer becomes full, it starts returning EAGAIN. The ostream handles flushing internally the same way as file_ostreams does, although instead of using io_add(IO_WRITE) it uses FS API's set_output_callback(). If callers need to know when more data can be written or when all of the data has been written, it can override the ostream's flush_callback, just like with file_ostreams.
Async IO for FS API backend
So now that all of the APIs have been designed, all that's needed to do is to write a simple FS API implementation using kernel's async IO API, right? Wrong. There is no usable async IO API in Linux, and probably nowhere else either:
POSIX AIO isn't supported by Linux kernel. And even if it was, it only supports async reads/writes, not async open().
Linux kernel has its own native AIO implementation! ..But it only works with direct IO, which makes it pretty much useless for almost everyone. There have been many different attempts to get buffered AIO support to Linux, but all of them have failed.
So for now the only practical way is to implement it with threads. There are several libraries that could make this easier.. But all of them enable (and require) full thread safety for libc calls. I don't really like that. Dovecot isn't using threads, I shouldn't pay the penalty of using extra locking when it's really not necessary. So I was thinking about doing the async IO in two ways:
For Linux/x86/x86-64 (and maybe others) implement a version that creates threads with clone() and uses lockless queues for communicating between the async io worker threads. The threads won't use malloc() or any other unsafe calls, so this should be pretty nice.
Fallback version that uses pthreads with mutexes.
Is 2) going to be noticeably slower than 1)? Probably not.. In both cases there is also the problem of how many worker threads to create. I've really no idea, kernel would be so much better in deciding that.. I guess that also depends on how many processes Dovecot is configured to run on (and how busy they are at the moment). Maybe it could be just a globally configurable number of threads/process, defaulting to something like 10.
dbox changes
Most of the dbox code shouldn't be too difficult to change to using the FS API. The one exception is writing messages. Currently the format looks like:
The problem is that the message size isn't always known before the message is written. So currently the code just pwrite()s the size afterwards, but this is no longer possible with the new FS API. One possibility would be to buffer the message in memory, but that could waste a lot of memory since messages can be large.
So the only good solution is to change the dbox file format to contain the message size after the message. At the same time the dbox format could be cleaned up from old broken ideas as well. The reading code could support the old version format as well, because reading isn't a problem.
The new dbox format could look like:
The unique separator exists there primarily for fixing corrupted dbox storage, so it can (quite) reliably recognize where a message ends.
Multi-dbox's indexes contain already the message offset + size of (message+metadata), i.e. offset+size = offset of next message. The size is primarily used to figure out if more data can be appended to the file (so it won't grow too large). This size could just be changed to become message's size and the check just changed to assume metadata size would be e.g. 512 bytes. It won't really matter in practise. So then the reading code could get the message's size from the index.
Single-dbox doesn't have such index. There are two ways to solve the problem there (both actually would work for multi-dbox too):
Metadata block can be found by reading backwards from end of file. For example read the last 512 bytes of the file and find the metadata magic block. Get the message's size from there.
Create an istream that simply reads until it reaches dbox metadata magic and unique separator. And to be absolutely sure it reached the correct end of message, it can also check that the rest of the data in the stream contains only a valid metadata block.
On 3/14/10 1:59 AM +0200 Timo Sirainen wrote:
The long term plan is to get all of Dovecot disk I/O asynchronous. The first step to that direction would be to make dbox/mdbox I/O asynchronous. This might also allow mbox/maildir to become partially asynchronous.
Does it really need to be? Sure, for example you might be able to return to a client that (say) a move didn't complete but is that really useful? A single blocked I/O is probably indicative of a systemic problem and the next operation is going to fail as well. A mail server is part of a system and in this case, wouldn't it be easier to let the client handle timeouts?
Just playing devil's advocate since you haven't presented the advantage of async I/O. I don't want to guess at the reasoning, but e.g. I hope you're not planning to return success results before I/O actually completes.
So now that all of the APIs have been designed, all that's needed to do is to write a simple FS API implementation using kernel's async IO API, right? Wrong. There is no usable async IO API in Linux, and probably nowhere else either: ... So for now the only practical way is to implement it with threads. There
Threads are simpler anyway.
are several libraries that could make this easier.. But all of them enable (and require) full thread safety for libc calls. I don't really like that. Dovecot isn't using threads, I shouldn't pay the penalty of using extra locking when it's really not necessary.
You're pre-optimizing. The overhead of an uncontested lock is essentially zero.
So I was thinking about doing the async IO in two ways: 2) Fallback version that uses pthreads with mutexes.
I would do this version only.
-frank
On 14.3.2010, at 4.40, Frank Cusack wrote:
Just playing devil's advocate since you haven't presented the advantage of async I/O. I don't want to guess at the reasoning, but e.g. I hope you're not planning to return success results before I/O actually completes.
The idea was that a single process could handle tons of connections. Maybe in the end the number of IMAP processes you'd have would be about 1-2 x the number of CPU cores.
And not just that. Also parallelism. Dovecot could issue a lot of I/O requests in parallel and OS can reorder those so that it gives the best performance by reading them from disk in the right order. And the higher the latency to disks is, the higher benefits there comes from parallelism (NFS, NoSQL).
On 3/14/10 4:48 AM +0200 Timo Sirainen wrote:
On 14.3.2010, at 4.40, Frank Cusack wrote:
Just playing devil's advocate since you haven't presented the advantage of async I/O. I don't want to guess at the reasoning, but e.g. I hope you're not planning to return success results before I/O actually completes.
The idea was that a single process could handle tons of connections.
And what's the advantage of that? process-per-client already scales really well, at the expense of more memory. In both Linux and Solaris, the cost of a process context switch is nearly the same as a thread context switch. I don't know about *BSD but I imagine it's similar.
There is a distinct disadvantage in complexity.
Maybe in the end the number of IMAP processes you'd have would be about 1-2 x the number of CPU cores.
I'd think many more than that (regardless of client model - multithread, AIO or PPC), since most time is spent with the user idle and the IMAP connection doing nothing. I'd guess (a) you should be able to support at least 100 simultaneous clients per CPU, if not 500, and (b) that simul client support is I/O limited, not CPU limited.
I haven't ever run dovecot with a large user base so I don't have any empirical data to back up that guess. Folks here on the list must though.
And not just that. Also parallelism. Dovecot could issue a lot of I/O requests in parallel and OS can reorder those so that it gives the best performance by reading them from disk in the right order. And the higher the latency to disks is, the higher benefits there comes from parallelism (NFS, NoSQL).
The kernel already does that regardless of server implementation.
-frank
On 14.3.2010, at 5.27, Frank Cusack wrote:
On 3/14/10 4:48 AM +0200 Timo Sirainen wrote:
On 14.3.2010, at 4.40, Frank Cusack wrote:
Just playing devil's advocate since you haven't presented the advantage of async I/O. I don't want to guess at the reasoning, but e.g. I hope you're not planning to return success results before I/O actually completes.
The idea was that a single process could handle tons of connections.
And what's the advantage of that? process-per-client already scales really well, at the expense of more memory. In both Linux and Solaris, the cost of a process context switch is nearly the same as a thread context switch. I don't know about *BSD but I imagine it's similar.
Well, you just mentioned the benefits :) Less memory usage, less context switches (of any kind). (You aren't assuming I'd do something like one thread per connection, right? That's not going to happen.)
Maybe in the end the number of IMAP processes you'd have would be about 1-2 x the number of CPU cores.
I'd think many more than that (regardless of client model - multithread, AIO or PPC), since most time is spent with the user idle and the IMAP connection doing nothing. I'd guess (a) you should be able to support at least 100 simultaneous clients per CPU, if not 500, and (b) that simul client support is I/O limited, not CPU limited.
That's kind of the point. You could have just a few IMAP processes where each can handle hundreds of connections. Currently that's not a very good idea, because a single connection that waits on disk I/O blocks all the other connections in the same process from doing anything.
And not just that. Also parallelism. Dovecot could issue a lot of I/O requests in parallel and OS can reorder those so that it gives the best performance by reading them from disk in the right order. And the higher the latency to disks is, the higher benefits there comes from parallelism (NFS, NoSQL).
The kernel already does that regardless of server implementation.
Kernel can only parallelize requests from different processes. But AIO allows for example telling kernel that:
- open files 1..10
- read contents of files 1..10
rather than:
- open file 1
- read contents of file 1
- open file 2
- read contents of file 2
- ..etc..
All of this when processing a single client connection's command(s). And while waiting for I/O replies Dovecot can do other work. So user sees results earlier.
On 14.3.2010, at 13.46, Timo Sirainen wrote:
Well, you just mentioned the benefits :) Less memory usage, less context switches (of any kind). (You aren't assuming I'd do something like one thread per connection, right? That's not going to happen.) .. That's kind of the point. You could have just a few IMAP processes where each can handle hundreds of connections. Currently that's not a very good idea, because a single connection that waits on disk I/O blocks all the other connections in the same process from doing anything.
And for clarification: I don't think these are going to give any huge benefits to most people for now. They should help, but I don't expect any dramatic (order of magnitude) performance improvements. But it's the first step towards supporting higher-latency (NoSQL) databases, and they are going to be great for people who need a lot of cheap HA storage (you know, the people who usually pay me for Dovecot development). And actually I think they're going to be great for other people too, because those databases can support easy and fast replication. I could see myself using local Dovecot servers for my mails in different machines, while Dovecot automatically replicates changes to the master server.
Timo Sirainen put forth on 3/14/2010 6:46 AM:
- open file 1
- read contents of file 1
- open file 2
- read contents of file 2
- ..etc..
All of this when processing a single client connection's command(s). And while waiting for I/O replies Dovecot can do other work. So user sees results earlier.
Concentrate on rewriting imapd into a threaded model, and get it right. Once you've done so I think you will have eliminated the perceived need for AIO.
Regarding memory usage, on Linux anyway, you should really read this Timo. It would be mutually beneficial to Dovecot (and Postfix, and etc). I've not played with it yet, and I don't have a busy server to test it on, but I'm betting it would seriously benefit those running servers with hundreds of concurrent users. I'm going to spin this into my next kernel and play with it.
Docs on KSM:
How to use the Kernel Samepage Merging feature
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation, and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
The KSM daemon ksmd periodically scans those areas of user memory which have been registered with it, looking for pages of identical content which can be replaced by a single write-protected page (which is automatically copied if a process later wants to update its content).
KSM was originally developed for use with KVM (where it was known as Kernel Shared Memory), to fit more virtual machines into physical memory, by sharing the data common between them. But it can be useful to any application which generates many instances of the same data.
KSM only merges anonymous (private) pages, never pagecache (file) pages. KSM's merged pages are at present locked into kernel memory for as long as they are shared: so cannot be swapped out like the user pages they replace (but swapping KSM pages should follow soon in a later release).
KSM only operates on those areas of address space which an application has advised to be likely candidates for merging, by using the madvise(2) system call: int madvise(addr, length, MADV_MERGEABLE).
The app may call int madvise(addr, length, MADV_UNMERGEABLE) to cancel that advice and restore unshared pages: whereupon KSM unmerges whatever it merged in that range. Note: this unmerging call may suddenly require more memory than is available - possibly failing with EAGAIN, but more probably arousing the Out-Of-Memory killer.
If KSM is not configured into the running kernel, madvise MADV_MERGEABLE and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was built with CONFIG_KSM=y, those calls will normally succeed: even if the the KSM daemon is not currently running, MADV_MERGEABLE still registers the range for whenever the KSM daemon is started; even if the range cannot contain any pages which KSM could actually merge; even if MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
Like other madvise calls, they are intended for use on mapped areas of the user address space: they will report ENOMEM if the specified range includes unmapped gaps (though working on the intervening mapped areas), and might fail with EAGAIN if not enough memory for internal structures.
Applications should be considerate in their use of MADV_MERGEABLE, restricting its use to areas likely to benefit. KSM's scans may use a lot of processing power, and its kernel-resident pages are a limited resource. Some installations will disable KSM for these reasons.
The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, readable by all but writable only by root:
max_kernel_pages - set to maximum number of kernel pages that KSM may use e.g. "echo 100000 > /sys/kernel/mm/ksm/max_kernel_pages" Value 0 imposes no limit on the kernel pages KSM may use; but note that any process using MADV_MERGEABLE can cause KSM to allocate these pages, unswappable until it exits. Default: quarter of memory (chosen to not pin too much)
pages_to_scan - how many present pages to scan before ksmd goes to sleep e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan" Default: 100 (chosen for demonstration purposes)
sleep_millisecs - how many milliseconds ksmd should sleep before next scan e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" Default: 20 (chosen for demonstration purposes)
run - set 0 to stop ksmd from running but keep merged pages, set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", set 2 to stop ksmd and unmerge all pages currently merged, but leave mergeable areas registered for next run Default: 0 (must be changed to 1 to activate KSM, except if CONFIG_SYSFS is disabled)
The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:
pages_shared - how many shared unswappable kernel pages KSM is using pages_sharing - how many more sites are sharing them i.e. how much saved pages_unshared - how many pages unique but repeatedly checked for merging pages_volatile - how many pages changing too fast to be placed in a tree full_scans - how many times all mergeable areas have been scanned
A high ratio of pages_sharing to pages_shared indicates good sharing, but a high ratio of pages_unshared to pages_sharing indicates wasted effort. pages_volatile embraces several different kinds of activity, but a high proportion there would also indicate poor use of madvise MADV_MERGEABLE.
Izik Eidus, Hugh Dickins, 24 Sept 2009
-- Stan
Stan Hoeppner, 2010-03-16 08:45:
Regarding memory usage, on Linux anyway, you should really read this Timo. It would be mutually beneficial to Dovecot (and Postfix, and etc). I've not played with it yet, and I don't have a busy server to test it on, but I'm betting it would seriously benefit those running servers with hundreds of concurrent users. I'm going to spin this into my next kernel and play with it.
AFAIK, KSM is primarily useful for vservers (xen, kvm etc.) running the same software. How would running a single instance of dovecot (or any other daemon) profit by that? AFAICS, "generates many instances of the same data" does not apply to dovecot (or most other deamons).
Jakob Hirsch put forth on 3/16/2010 3:55 AM:
Stan Hoeppner, 2010-03-16 08:45:
Regarding memory usage, on Linux anyway, you should really read this Timo. It would be mutually beneficial to Dovecot (and Postfix, and etc). I've not played with it yet, and I don't have a busy server to test it on, but I'm betting it would seriously benefit those running servers with hundreds of concurrent users. I'm going to spin this into my next kernel and play with it.
AFAIK, KSM is primarily useful for vservers (xen, kvm etc.) running the same software. How would running a single instance of dovecot (or any other daemon) profit by that? AFAICS, "generates many instances of the same data" does not apply to dovecot (or most other deamons).
You didn't read the paste Jakob. Or maybe you don't comprehend the material? Also, it seems you're unaware that Dovecot creates at minimum 1 imap process per client connection. In the case of Thunderbird, it creates 5 imap processes. Each one is over 2MB in size, and all of those imap processes share close to 2MB of memory pages that are identical, because the code is identical. If they're all identical, why have 500 copies in memory instead of one?
KSM was designed during the KVM effort, but it's applicable to any program duplication in memory, and Dovecot is a perfect example of an application which would benefit from KSM due to the large number of identical processes. On a "loaded" Dovecot server you will have hundreds of imap, and possibly dovecot-auth, processes--depends on whether login_process_per_connection = yes. If yes, you'll have one dovecot-auth per client. It appears from my vantage point as a non programmer that each of these processes has about 2MB of duplicate memory. So on a server with 500 concurrent users, assuming 1 imap connection for each (although most MUAs open more than one connection per session--TB opens 5), KSM can potentially save you ~2GB-6GB of memory by de-duplicating the identical page contents of these Dovecot processes.
In the case of TB, which I have quite a bit of experience with, 1 imap process ends up servicing all the requests, while the other 4 sit idle, just eating memory senselessly. KSM would eliminate most of that memory waste. Personally I've changed my TB to use a single connection instead of 5, and there is zero difference in performance. Too bad the team sets the default at 5 instead of 1.
Fast forward to a threaded Dovecot design, and you still benefit from KSM, because all the imap worker threads are duplicate code, probably the same for the auth workers. The benefit won't be as great as with the 1 imap process per connection design of today as some of the code will only be duplicated 2 x nCPUs times, the master processes, whereas right now all the code is duplicated across all imap worker processes.
In summary, if KSM is configured correctly, and Dovecot coded to work with KSM, the memory savings could be huge on large servers.
-- Stan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Tue, 16 Mar 2010, Stan Hoeppner wrote:
You didn't read the paste Jakob. Or maybe you don't comprehend the material? Also, it seems you're unaware that Dovecot creates at minimum 1 imap process per client connection. In the case of Thunderbird, it creates 5 imap processes. Each one is over 2MB in size, and all of those imap processes share close to 2MB of memory pages that are identical, because the code is identical. If they're all identical, why have 500 copies in memory instead of one?
The "paste" talks about data. You talk about code.
If I display a process map of an active imap process, I have: mapped: 2656K writeable/private: 460K shared: 300K Essentially you are talking about 460KB of KSM-able data.
IMHO, all modern Unix-alike system just "map" code pages read-only into the process and share them among all processes. No KSM required at all.
Regards,
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iQEVAwUBS59ajr+Vh58GPL/cAQK8Ogf/SmUcpPHMsNBBcE3yoTD/1j2A6jMBU7/E 84hn6az0kc+aigz5OFGxPPXhussTyjxOu3FlyeLagB0kIcC03/awiKBmqIOGQBX6 SZokUay7zfoPiBcPJ4VVADBwfr4FGQPjyBgOS+X8ut5twrgJ8Wd66ipzdQLS4E6x lp26HYx6brPisKWb2Po83T9UM8/hecnJCQmCPhGsrE6ig6KUjXOQv67P/AVBNg6N WSRRwzhD3dut1SZCWxup7f6PXlCtPH25ZZUFprdw1QoYvkhM2D9PSnHutWD2srIe bP0SfK0tGBeQkF9zg+V6hg+QRHFxM//d7mDOg1vJxBrrl/74I9QuIA== =z7k/ -----END PGP SIGNATURE-----
On Tue, 2010-03-16 at 11:16 +0100, Steffen Kaiser wrote:
If I display a process map of an active imap process, I have: mapped: 2656K writeable/private: 460K shared: 300K Essentially you are talking about 460KB of KSM-able data.
IMHO, all modern Unix-alike system just "map" code pages read-only into the process and share them among all processes. No KSM required at all.
Right. Code and mmaped pages (e.g. some of Dovecot index files) are shared already. There's very little (if any) data that KSM could de-duplicate.
Timo Sirainen put forth on 3/16/2010 6:36 AM:
On Tue, 2010-03-16 at 11:16 +0100, Steffen Kaiser wrote:
If I display a process map of an active imap process, I have: mapped: 2656K writeable/private: 460K shared: 300K Essentially you are talking about 460KB of KSM-able data.
IMHO, all modern Unix-alike system just "map" code pages read-only into the process and share them among all processes. No KSM required at all.
Right. Code and mmaped pages (e.g. some of Dovecot index files) are shared already. There's very little (if any) data that KSM could de-duplicate.
Ok, I've boned up on virtual shared memory and mmap and I now have the knowledge I previously lacked. I also have a far better understanding of RSS than I previously did, mainly that shared pages are included in RSS, even though there is only one copy of a shared page. Thus the sum of all RSS values can be much greater than the physical RAM in the system, which is opposite to my previous (mis)understanding or RSS. And I also now understand why KSM would have a very limited use case on a non KVM Linux kernel, since virtual shared memory serves the same purpose, and does so much more efficiently.
Feel free to slap me around with some trout for previously speaking to things above my knowledge level at the time. ;)
-- Stan
Steffen Kaiser put forth on 3/16/2010 5:16 AM:
The "paste" talks about data. You talk about code.
The author's use of the word "data" in that sentence was obviously not in the programmatic context but in the bit bucket context. In many contexts, the word "data" can properly describe the entire contents of RAM, ROM, disk, SSD, thumb drive, tape, etc storage without regard for distinction of binary code or application data. Often "data" merely means contents. Look up the word "data" in the dictionary, and only one of the many meanings might state "the contents of a computer program's variables". You're looking at the word "data" too narrowly, in the wrong context.
If I display a process map of an active imap process, I have: mapped: 2656K writeable/private: 460K shared: 300K Essentially you are talking about 460KB of KSM-able data.
Looking at top and pmap data, at bare minimum, 900KB of code is fully duplicated in each imap process and candidate for KSM de-duplication. I'm guessing there may be more code that's duplicated as well, some of the .so's maybe?
IMHO, all modern Unix-alike system just "map" code pages read-only into the process and share them among all processes. No KSM required at all.
If this were the case with Linux then the KSM project would have never been initiated? You attribute far too much intelligence to the stock *nix kernels WRT memory management, specifically de-duplication. AFAIK none of them in a stock configuration do what your humble opinion suggests.
-- Stan
On 16/03/2010 12:31, Stan Hoeppner wrote:
IMHO, all modern Unix-alike system just "map" code pages read-only into the process and share them among all processes. No KSM required at all.
If this were the case with Linux then the KSM project would have never been initiated? You attribute far too much intelligence to the stock *nix kernels WRT memory management, specifically de-duplication. AFAIK none of them in a stock configuration do what your humble opinion suggests.
I don't think you are correct here?
If KVM guests were commonly implemented using a shared COW filesystem, ie with all binaries linked to the same inodes on all guests then there would be little need for the KSM because normal memory management would de-dup as expected...
The reality is that this is not (normally) how people build guests... Hence why the KSM is desirable and has been implemented.
On virtualisation solutions such as linux-vservers (and perhaps openvz, etc) you have a tool called "vhashify" which literally hardlinks all the shared files together across all guest filesystems and adds a COW hardlink breaking feature to the kernel. This achieves exactly what we expect that the hardlinked libraries are loaded only once and effectively de-duped. This ^^ exists today and is in production use and provides the incentive for KSM which is slightly cleverer and can de-dup without the hint of the hardlinks (and perhaps de-dup a bit more extensively under some circumstances)
In summary I believe that unix-alike systems do largely behave as described in the previous post? This doesn't make KSM useless, but it's a tool which needs to be used carefully and frugally
Regards
Ed W
Stan Hoeppner, 2010-03-16 10:50:
AFAIK, KSM is primarily useful for vservers (xen, kvm etc.) running the same software. How would running a single instance of dovecot (or any other daemon) profit by that? AFAICS, "generates many instances of the same data" does not apply to dovecot (or most other deamons). You didn't read the paste Jakob. Or maybe you don't comprehend the material? Also, it seems you're unaware that Dovecot creates at minimum 1
Or maybe your understanding of modern OS's memory handling is just naive. You should really learn the basics before you jump about The New Thing.
code is identical. If they're all identical, why have 500 copies in memory instead of one?
You won't, the code pages are just mapped, just like the shared libs that the programm uses. Btw, you can look up the map with pmap. E.g.:
# pmap -d 13465 Address Kbytes Mode Offset Device Mapping 0061c000 120 r-x-- 0000000000000000 008:00002 ld-2.11.1.so ... 0063e000 1468 r-x-- 0000000000000000 008:00002 libc-2.11.1.so ... 007b0000 12 rw--- 0000000000000000 000:00000 [ anon ] 007e1000 12 r-x-- 0000000000000000 008:00002 libdl-2.11.1.so ... 007e8000 88 r-x-- 0000000000000000 008:00002 libpthread-2.11.1.so ... 00800000 8 rw--- 0000000000000000 000:00000 [ anon ] 00804000 28 r-x-- 0000000000000000 008:00002 librt-2.11.1.so ... 00dd5000 4 r-x-- 0000000000000000 000:00000 [ anon ] 08047000 744 r-x-- 0000000000000000 008:00002 imap 08101000 8 rw--- 00000000000b9000 008:00002 imap 08103000 4 rw--- 0000000000000000 000:00000 [ anon ] 09d27000 456 rw--- 0000000000000000 000:00000 [ anon ] b780b000 268 r--s- 0000000000000000 008:00005 dovecot.index.cache b784e000 8 rw--- 0000000000000000 000:00000 [ anon ] b7859000 4 rw--- 0000000000000000 000:00000 [ anon ] bfc02000 84 rw--- 0000000000000000 000:00000 [ stack ] mapped: 3360K writeable/private: 604K shared: 268K
Out of the 3360K one of my imap processes has "mapped", only 604K is "writeable/private".
It appears from my vantage point as a non programmer
That's the problem here, I guess.
Fast forward to a threaded Dovecot design, and you still benefit from KSM, because all the imap worker threads are duplicate code, probably the same
The same applies here. Furthermore, also the data pages are shared between processes (or threads, but that's just slightly different) until a process writes to a pages (see copy-on-write), so the advantage of KSM would be near zero (or even below).
Jakob Hirsch put forth on 3/16/2010 3:55 AM:
AFAICS, "generates many instances of the same data" does not apply to dovecot (or most other deamons).
You're misreading the meaning of the word "data" in this context. In this context "data" simply means memory page contents, not "program data structures" as in program variable contents (my_name = "Stan"). If you understood the previous paragraph in the doc, this should have been crystal clear.
VMs run as processes on top of the host Linux kernel+KVM. Thus, they are no different from a memory management standpoint than any other process. KSM wasn't specifically designed for memory consolidation of the contents of virtual machines. Again, they are regular Linux processes. Thus, KSM works on non VM processes exactly the same way.
Repeat after me: "A process is a process is a process."
-- Stan
Timo Sirainen put forth on 3/16/2010 6:37 AM:
On Tue, 2010-03-16 at 02:45 -0500, Stan Hoeppner wrote:
Concentrate on rewriting imapd into a threaded model, and get it right.
I could give a lot of reasons for this, but: no.
Ok, what am I missing? Given the current clone/fork imap parallelism architecture, wouldn't spawning imap worker threads from a master imap process be the most straightforward change to accomplish the process count shrink? With the least code changes, and thus be least likely to introduce new bugs?
Maybe I didn't read your previous post correctly. It sure sounded like worker threads were exactly what you were describing. If you're not looking at threads, can you briefly describe the program flow and subroutine layout of the new imap server process(s)? You've got me really curious about this now since you're not looking at threads.
If you don't clone/fork or threads, how else can you get decent parallel client scalability? Are you looking at nested fast looping serial code, and using AIO to minimize stalling the loops due to blocked I/O?
-- Stan
On Tue, 2010-03-16 at 10:55 -0500, Stan Hoeppner wrote:
Timo Sirainen put forth on 3/16/2010 6:37 AM:
On Tue, 2010-03-16 at 02:45 -0500, Stan Hoeppner wrote:
Concentrate on rewriting imapd into a threaded model, and get it right.
I could give a lot of reasons for this, but: no.
Ok, what am I missing? Given the current clone/fork imap parallelism architecture, wouldn't spawning imap worker threads from a master imap process be the most straightforward change to accomplish the process count shrink? With the least code changes, and thus be least likely to introduce new bugs?
Threads are useful when you need to do CPU intensive work in parallel, and threads need access to shared data structures. Dovecot doesn't do CPU intensive work, so threads do nothing but add extra overhead (scheduling, memory).
Maybe I didn't read your previous post correctly. It sure sounded like worker threads were exactly what you were describing.
I wanted asynchronous disk I/O from kernel. Since kernel doesn't support it yet, I'm forced to use threads until that happens. As soon as kernel supports AIO well, I can get rid of threads.
And anyway there's only going to be 1..n worker AIO threads per process, where n depends on load. So a single process could handle even thousands of IMAP connections, but if the connections are mainly idling have only a few AIO worker threads.
If you're not looking at threads, can you briefly describe the program flow and subroutine layout of the new imap server process(s)? You've got me really curious about this now since you're not looking at threads.
It's mostly the same as now, except now some functions return "try again later" if data isn't yet in memory. The first message in this thread tried to explain it and also had a link to previous thoughts about it.
If you don't clone/fork or threads, how else can you get decent parallel client scalability? Are you looking at nested fast looping serial code, and using AIO to minimize stalling the loops due to blocked I/O?
Yes to looping serial code, but if by nested you mean some kind of recursive looping, no. Recursive looping would just make everything break.
So there's just a single loop that waits for new things to do and then does as much of it as it can, and when it finds that it can't do more without waiting on disk IO, it returns back to the loop to find more stuff to do. Dovecot already does non-blocking network I/O that way, so things already support that well. It's just the disk I/O part that now needs to be made non-blocking. Previously I've thought it would make using the mailbox accessing APIs horribly difficult and ugly, but nowadays I've figured out a plan that doesn't change the API much (described in those two mails).
On 3/16/10 6:44 PM +0200 Timo Sirainen wrote:
Threads are useful when you need to do CPU intensive work in parallel, and threads need access to shared data structures.
And when you don't want to block on I/O. Threads are almost always easier than AIO, and especially easy (ie, no scary complexity issues eg deadlock) if you aren't sharing data structures.
It used to be that threads weren't scheduled well, and that if you wanted Linux portability you should avoid them, but in the last 5 years that has changed and usu. it's the best choice now.
You mentioned locking overhead earlier, which is a red herring, but even on Linux if you are running an SMP kernel you have the locking "overhead" even if your process is single threaded.
-frank
On Tue, 2010-03-16 at 13:19 -0400, Frank Cusack wrote:
On 3/16/10 6:44 PM +0200 Timo Sirainen wrote:
Threads are useful when you need to do CPU intensive work in parallel, and threads need access to shared data structures.
And when you don't want to block on I/O. Threads are almost always easier than AIO, and especially easy (ie, no scary complexity issues eg deadlock) if you aren't sharing data structures.
But there are some shared data structures, so I'd need to add locking for them. And threads won't help with the biggest benefit of the whole AIO change: "fast" access to high-latency storage. To access such storage without constant waiting on IO, multiple read requests must be done in parallel during the execution of a single command. So even if I added support for threads, I'd need to process a single command in parallel, which eventually leads to more or less the same kind of AIO processing I'm thinking about now anyway, only with tons of extra locking.
Also Dovecot's file locking rules are simple and short-lived. In v2.0 it's already possible to have a single process serve multiple connections. And Apple's Dovecot does that already with v1.1. I haven't seen/heard of any deadlocking problems with either. And actually my long term plan is to make single-dbox (including index files) completely lockless.
You mentioned locking overhead earlier, which is a red herring, but even on Linux if you are running an SMP kernel you have the locking "overhead" even if your process is single threaded.
Yes, but once threads are enabled there are even more locking overhead from libc. (Yeah, probably small overhead.)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Tue, Mar 16, 2010 at 01:19:01PM -0400, Frank Cusack wrote:
[...]
And when you don't want to block on I/O. Threads are almost always easier than AIO, and especially easy (ie, no scary complexity issues eg deadlock) if you aren't sharing data structures. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Such a little phrase... ;-)
Yes, if you aren't sharing data structures you may as well use separate processes.
Honestly, I do prefer the explicit complexity of async IO to the implicit complexity of locking. At least in IO-centered server processes. Threads just /seem/ easy.
And as to the latency/performance, watch the small/fast HTTP servers moving to async IO models.
I'm not the one to be taken too seriously on it, but I feel that Timo has the right hunch here.
Regards
- -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iD8DBQFLoJjjBcgs9XrR2kYRAlz+AJ9J1m3GiX8xAXyIyRa/2yyvejfc0QCdEhLy an2q0IxgRP/FYp4gVda/22Q= =exbZ -----END PGP SIGNATURE-----
On 3/14/10 1:46 PM +0200 Timo Sirainen wrote:
On 14.3.2010, at 5.27, Frank Cusack wrote:
On 3/14/10 4:48 AM +0200 Timo Sirainen wrote:
On 14.3.2010, at 4.40, Frank Cusack wrote:
Just playing devil's advocate since you haven't presented the advantage of async I/O. I don't want to guess at the reasoning, but e.g. I hope you're not planning to return success results before I/O actually completes.
The idea was that a single process could handle tons of connections.
And what's the advantage of that? process-per-client already scales really well, at the expense of more memory. In both Linux and Solaris, the cost of a process context switch is nearly the same as a thread context switch. I don't know about *BSD but I imagine it's similar.
Well, you just mentioned the benefits :) Less memory usage, less context switches (of any kind). (You aren't assuming I'd do something like one thread per connection, right? That's not going to happen.)
But as I stated, dovecot is likely to be I/O limited, not CPU or memory limited. Who cares if there is 10% less memory usage when the limit is still I/O? Feel free to trout-slap me (to borrow from Stan) if you expect dovecot to be memory limited.
That's kind of the point. You could have just a few IMAP processes where each can handle hundreds of connections. Currently that's not a very good idea, because a single connection that waits on disk I/O blocks all the other connections in the same process from doing anything.
Or you could have a whole bunch of IMAP processes, one per connection. All you're doing is multiplexing the I/O at a different part of your system. Each connection doesn't share any state or other data with other "related" connections so it doesn't matter how you multiplex the I/O, since any imagined efficiency from AIO/threads isn't going to be realized due to I/O saturation before CPU saturation.
Kernel can only parallelize requests from different processes.
Right, but again so what? Since each connection is a different process, all I/O is parallelized in dovecot's case. It works well because connections aren't related to each other.
-frank
On Tue, 2010-03-16 at 13:30 -0400, Frank Cusack wrote:
Well, you just mentioned the benefits :) Less memory usage, less context switches (of any kind). (You aren't assuming I'd do something like one thread per connection, right? That's not going to happen.)
But as I stated, dovecot is likely to be I/O limited, not CPU or memory limited. Who cares if there is 10% less memory usage when the limit is still I/O? Feel free to trout-slap me (to borrow from Stan) if you expect dovecot to be memory limited.
It is I/O limited, but less memory usage still can help. Some setups use e.g. NFS storage with multiple Dovecot servers accessing it. They could reduce the number of Dovecot servers if a single server could process more connections.
Also all setups benefit from less memory usage, because that means there's more memory left for disk cache -> less disk I/O.
That's kind of the point. You could have just a few IMAP processes where each can handle hundreds of connections. Currently that's not a very good idea, because a single connection that waits on disk I/O blocks all the other connections in the same process from doing anything.
Or you could have a whole bunch of IMAP processes, one per connection. All you're doing is multiplexing the I/O at a different part of your system. Each connection doesn't share any state or other data with other "related" connections
Actually if multiple connections are accessing the same mailbox in the same process, they can (and already do!) share index data.
Kernel can only parallelize requests from different processes.
Right, but again so what? Since each connection is a different process, all I/O is parallelized in dovecot's case. It works well because connections aren't related to each other.
I think I answered to this in the other mail: access to high-latency storage (which can also reduce command reply latency somewhat for low-latency storage).
handle = open(path, mode)
- this function can't fail. it's executed asynchronously. Does that mean you can successfully open("/nonexistent", mode); write() to it over and over again and only the commit() fails?
handle = create(path, mode, permissions) [...]
- mode=fail-if-exists: commit() fails if file already exists
- mode=replace-if-exists: commit() replaces file if it already exists s/commit/create/g
ret = pread(handle, buf, size, offset)
- just like pread(), [...] Hm, pread() works like pread()? What do I misunderstand?
ret = try_lock(handle)
- this isn't an asynchronous operation! it assumes that locking state is kept in memory, so that the operation will be fast. So does this only lock against the same process or how is locking state supposed to be in memory?
if backend doesn't support locking or it's slow, single-dbox should be used (instead of multi-dbox), because it doesn't need locking.
- returns success or "already locked" And if the backend doesn't support locking?
Async input streams' read() would work exactly as file_istream works for non-blocking sockets: It would return data that is already buffered in memory. If there's nothing, it returns EAGAIN. The FS API's set_input_callback() can be used to set a callback function that is called whenever there's more data available in the buffer. I don't understand what triggers that reading into memory. Is that supposed to happen automaticly in the background or has it to be initiated by the program?
- POSIX AIO isn't supported by Linux kernel. And even if it was, it only supports async reads/writes, not async open(). Doesn't help you, but NetBSD 5 does support Posix AIO, I think.
On 15.3.2010, at 21.37, Edgar Fuß wrote:
handle = open(path, mode)
- this function can't fail. it's executed asynchronously. Does that mean you can successfully open("/nonexistent", mode); write() to it over and over again and only the commit() fails?
Probably. But in mdbox case you lock the file first. Hmm. But I suppose locking can't work until the file is fully opened, so I guess there needs to be a way to set some kind of callback to "open finished" too.
handle = create(path, mode, permissions) [...]
- mode=fail-if-exists: commit() fails if file already exists
- mode=replace-if-exists: commit() replaces file if it already exists s/commit/create/g
No, commit actually. create() is asynchronous. Only at commit() stage it actually tries to create/replace the file. Writes in general are atomic, they won't be visible until commit() is called.
ret = pread(handle, buf, size, offset)
- just like pread(), [...] Hm, pread() works like pread()? What do I misunderstand?
It was describing an API, so kind of like io_api->pread() is like pread() syscall.
ret = try_lock(handle)
- this isn't an asynchronous operation! it assumes that locking state is kept in memory, so that the operation will be fast. So does this only lock against the same process or how is locking state supposed to be in memory?
I mean it's going to use fcntl() or whatever OS locking (as opposed to some slow remote locking with remote storages).
if backend doesn't support locking or it's slow, single-dbox should be used (instead of multi-dbox), because it doesn't need locking.
- returns success or "already locked" And if the backend doesn't support locking?
It says that above :) If you can't use locking, you can't use multi-dbox.
Async input streams' read() would work exactly as file_istream works for non-blocking sockets: It would return data that is already buffered in memory. If there's nothing, it returns EAGAIN. The FS API's set_input_callback() can be used to set a callback function that is called whenever there's more data available in the buffer. I don't understand what triggers that reading into memory. Is that supposed to happen automaticly in the background or has it to be initiated by the program?
The pread() call triggers reading into memory the amount of data that was requested. The first pread() fails with EAGAIN, then it starts reading on background and once it finishes, pread() returns the amount of read data. Possibly it could do readahead on background too, but that's optimization and more relevant to high-latency remote storage than local disk storage.
- POSIX AIO isn't supported by Linux kernel. And even if it was, it only supports async reads/writes, not async open(). Doesn't help you, but NetBSD 5 does support Posix AIO, I think.
Yeah, looks like BSDs have it after all. I tried a few Google searches first, but I didn't then find anything specific about it.
No, commit actually. create() is asynchronous. Silly me.
I mean it's going to use fcntl() or whatever OS locking (as opposed to some slow remote locking with remote storages). But fcntl() will use NLM if the file is on NFS.
I tried a few Google searches first, but I didn't then find anything specific about it. http://netbsd.gw.com/cgi-bin/man-cgi?aio_read++NetBSD-5.0
participants (8)
-
Ed W
-
Edgar Fuß
-
Frank Cusack
-
Jakob Hirsch
-
Stan Hoeppner
-
Steffen Kaiser
-
Timo Sirainen
-
tomas@tuxteam.de