CANCELLED: In-Memory Database Performance Improvements
Abstract:
This paper talks about various problems and solutions for designing a high throughput In-Memory Databases.
Background:
In-Memory database can be Memcached,Redis,Terracotta DB,..etc . It mainly contains non-mutation operations like read a key-value pair or search an entry. and mutation operations like insert , update and delete. Mutation operations involves writing data to appendonly or commit log before responding to the user. Non-mutation operation does not involve disk io, so the TPS is relatively higher. All operations involves four components:
- receiving/sending of message to socket,
- decoding/encoding of message,
- lookup and storing/retrieving the contents from/to the memory cache, and
- storing the data in to append log only for mutation operation.
To design a high throughput in-memory database all these four components should be done efficiently. The problems and solutions in designing are discussed in the next few sections.
Problem statement :
Design of In-Memory Database on the linux system as the limitations of posix API for getting high throughput or TPS (Transaction Per Second). The following are some of the bottlenecks from the linux OS and supporting components for delivering High TPS:
- Interface with OS to receive/send network messages need high cpu cycles: In the current linux system In-Memory databases uses epoll,read,write calls to receive/send the packets. Lot of cpu cycles are spend in executing these system calls for each message, due to this TPS will be lowered. more details in the system calls impact described in this paper[1] .
- Interface with OS to sync the append log during mutation operation is a synchronous operation and need more cpu cycles.
- Multi-threaded programme cannot make use of Multi core server efficiently.
- Inter Process Communication(IPC) or Inter Thread communication (ITC) .
Solutions:
The following are four solution for the above problems. Prototype is done using JIny kernel[2]. Jiny kernel provides and implements same posix api as that of linux, due to this linux app can run as it on the jiny kernel. Non-posix api are added to jiny kernel to solve some of the above problems.
1) Non-Posix API for to sync append-only log
Fsync and write calls is one of the way to implement the persistence in the Database. Other Implementation is using mmap and fsync. The Persistence is achieved using one of mechanism like WAL, commit log or append log. Other type of persistence are not covered here. The data is flushed to these log files before responding to the user, so the speed at which the data hits the disk will decide the latency or TPS for mutation operations.
Performance Problems in posix API:
- Calling synchronous Fsync for every mutation operation or at end of few batched operations. The system call is an expensive call, and fsync is a syncronization call.
- Absence of notification from OS what portion of data is flushed.
The above problems is solved using the below non-posix api. This api is suitable only for the WAL, commit or appendonly logs.
Non-Posix API for append-only log:
-
Shared memory region is created for every file to communicate between user level app and os. Below are the entries of the shared memory, each entry can be updated by os or by app, not by both. “User to OS” entries are updated by app and read by the OS. “OS to user “ entries are updated by OS and read by app:
-
User to OS : Last-byte-written: This is last byte written to the file by the user . data can be written using write call or using mmap. This is an indication to flush the file to sync till the offset, and there will not be any overwrites to the file, it is a appendonly file, so this value will continuously increasing
-
User to OS: clear-memory-flag: This is indication to OS to clear the pages after the data is flushed, this is to avoid calling fadvise.
-
OS to user : Last-byte-flushed : This is last byte flushed to disk by the os.
-
OS to User : fsync-start-flag : when there is request is submitted to the disk, the poll flag is turned on, this indicates that app does not need to inform to the os till the last-byte-written is same as last-byte-flushed.
-
OS to User : sync interval: This is set by OS when the next buffer will be picked by the OS, this is determine by the block driver speed and timer resolution. If this interval is long, the user threads can issue explicit fsync and also can turn off the polling.
-
-
New Asynchronous fsync syscall call : The new fsync call will trigger the sync of dirty data into the disk if it is not started. This will make the OS to turn on the fsync-start-flag, and this is a non-blocking call. Arguments to the fsync:
-
Fsync_start_flag: This flag indicates to start polling mode or not. If it is already in the polling mode this can be used to switch off the polling mode.
-
Advantages of NON-Posix API:
-
At high load with polling mode ON, both user level thread and kernel thread can work in parallel, both threads will be polling the shared memory. This save cpu cycles for the system calls, and improves the efficiency of cpu cores by avoiding the costly switching from user space to kernel space.
-
User level thread does not need to block the execution, it can parallely do in populating the file buffer by write system call or mmap method. Mmap is highly suggested instead of write system call for the following reasons:
-
since it does not need any write system call to write the data in to page cache.
-
Costly memory copy from user space to kernel space is not needed.
-
“last_byte_written” will give indication to the os till what point in the page cache the data is valid or dirty for flushing. This entry also gives indication to the os to start flushing the data if it is not started.
-
-
This Non-Posix API will be helpful for fine-grained flushing of data to the append long without system calls. This can happen a) if the mutation operations arriving to the user are not batched , b) After every mutation operation, need to respond to the user so that the latency will be lesser.
-
The efficiency of the API will be better if the speed gap between volatile memory and disk/Non-volatile memory decreases, or if there are large number of disks.
-
Early write: With this API, OS can flush the dirty blocks within page caches to the disk as soon as possible without waiting syscall from the user api, in this way higher disk write throughput can be achieved. Poll flag will trigger the poll mode inside the OS.
The above api will make app threads and OS threads to execute parallely , and even the new fsync all is an asynchronous call that will called at the start of batched mutation operations.
2) Non-posix API for Network IO :
Extending epoll to receive as well as sending the packets. Sending/receiving packets to multiple sockets with a single system call. The new posix api will merge the functionality of send, recv and epoll in to single new syscall. The advantages of new epoll call are as below:
-
Avoid calling multiple recv calls for each socket, suppose epoll is called , and after that individual socket is used to read syscall repeatedly. Using the new syscall, read calls are not needed, this saves large number of system call.
-
Avoid calling multiple send calls. Along with the poll, multiple send buffers can be feed , so that os internally calls send on multiple sockets.
-
Zero copy for receiving: the kernel will get directly mapped to the user space , offset and length will be return to the user, user can directly use the buffer.
-
Zero copy for sending: user need to fill the buffer provided by the OS during epoll, but user need leave some space at the start of the buffer as per the contract between os and app.
With this, system calls will be reduced by a large number and at the same time buffer copy will be avoided.
3) Shared memory interface.
Ring buffer inside the shared memory can be used to exchange messages between threads, process or between process and OS.
Advantages of Shared memory interface are below:
-
Ring buffer between two entities can give better throughput when compare to system call to OS, but at low load notification mechanism between two entities are needed.
-
This method also can be used between two threads in a jvm application. This method can be used at the user space to user space or between userspace to kernel.
-
This method can be used instead of non-posix api for network IO, this totally avoids the system calls. This is especially gives better throughput at high load.
-
Test Results: Socket versus shared memory in Redis: socket TPS = 54k . , Shared Memory: TPS=2000k. more details in[3].
4) Multicore Friendly Design
-
NUMA awareness to make cpu cache friendly:
-
Avoiding core sleep during futex syscall:
-
Cpu Spin when load is high:
References:
[1] https://github.com/naredula-jana/Jiny-Kernel/blob/master/doc/Perf_IPC.pdf
[2] https://github.com/naredula-jana/Jiny-Kernel/
[3] https://github.com/naredula-jana/PerfTools/tree/master/SharedMem