Tuning RRDtool for performance
RRDtool minimizes writing to disk by using circular buffers. If you run a small deployment of RRDtool with a few hundred rrd files, you will find RRDtool to be very fast in updating its rrd files. If your installation runnins into 50k or 100k or even more rrd files, things may be less speedy.
To understand RRDtool performance tuning, you must know some basics about the RRDtool file format as well as the way the OS accesses files on disk.
RRDtool File Format
+-------------------------------+ | RRD Header | |-------------------------------| | DS Header (one per DS) | |-------------------------------| | RRA Header (one per RRA) | |===============================| < 1 kByte (normally) | RRA Data Area (first RRA) | ................................. The bulk of the space | RRA Data Area (last RRA) | +-------------------------------+
An rrd file has a tiny header area and a large data area. On update, there is a read to the header area, to determine the properties of the rrd file, then there is a write to the header area with the update information, and finally all RRA Data Areas are updated that have passed their update interval since the last run.
How the OS accesses the disk
Accessing files on disk is one of the more expensive tasks (time-wise) an operating system has to do. The OS accesses the disk in 'blocks' of data. Normally a disk block is 4k bytes in size. When the OS writes to disk, it does so by writing a block of 4 kByte. RRDtool updates are normally much smaller. A single update to an RRA is 8 Bytes per Date Source defined in the rrd. Since the OS wants to write 4k at a time, it must first read the relevant block from disk, then integrate the changed bytes into the block data and then write it back to disk. So even though RRDtool writes very little information to disk, the OS always handles 4k bytes at a time. Even worse, since the OS assumes that you are going to need more than just the data you access at the moment, it will read some additional blocks of data from the file, to have them ready when you ask next time.
Since accessing the disk is so expensive, the OS designers have thought of many tricks to minimize disk access. The primary one is to use a disk cache. If an rrd file is cached, this means that the OS will not have to do any reads from disk, it just does the writes. This saves a considerable amount of time.
Optimizing RRD Performance by Altering RRDtool
Normally the OS reads a few blocks ahead when accessing a file. The goal is to have the next block ready when the application asks for it, without having to access the disk again. This just does not work well for rrd updates since here we access one block at the very beginning of the rrd file and then one or several blocks somewhere inside the RRD file to update the RRAs. By telling the OS that we are going to access the file-blocks in 'random' order it will stop reading ahead. This has two neat effects:
- The reading will be quicker since less data has to be read
- The file system cache will not get filled with data that is never going to be needed.
Dave Plonka has developed a patch for RRDtool that uses posix_fadvise to tell the OS that we want to random access the file. The patch is in RRDtool 1.2.24 and 1.3.x.
Preserving Buffer Cache
Whenever a block of data is read from a file, it will end up in the file system cache aka buffer cache. Since the buffer cache has a limited size, the OS must evict old data blocks from the cache to make space for the new blocks. In order to have good performance with RRDtool, all the hot blocks of an rrd file must stay in buffer cache. If you are just running rrdtool update all the time, this may work fine, especially combined with the random access idea from the previous paragraph.
When creating a new rrd file, or when drawing a graph, a large amount of data is moved between disk and memory. That data will also be stored in the buffer cache for future use. Since the size of the buffer cache is limited, the OS will evict other data to free up some space. By telling the OS that we do not want it to keep any rrd data in cache except for the header portion and the RRA blocks currently active, we can make sure that there is enough space in cache for many more RRD files. Again the posix_fadvise call comes into play. This time for telling the OS what data we do not need. A patch, teaching RRDtool to do this will be in 1.3.x.
If your system is shared by other applications, their operation may affect the buffer cache too. Especially programs that read and write large amounts of data will pollute the buffer cache with their own data. I have developed a patch for rsync that helps with this problem, using fadvise itself.
Using Memory Mapped IO
POSIX like operating systems support a low overhead method for file IO called memory mapped IO. RRDtool has been using the technique for the rrdtool update function for a long time now. Recent work by Bernhard Fischer introduces this memory mapped IO for all RRDtool functionality. In this accordingly we are using madvise instead of fadvise to tell the OS which data blocks it should cache and which ones it can drop. This functionality is being added to the 1.3.x trunk.
OS based Optimization of RRD Performance
As explained above RRDtool will work best when it can keep all the hot flie blocks of an RRD file in cache. This means that the memory size of your box plays an important role regarding RRDtool performance. To keep the header (4k) and the active block of at least one RRA (4k) in memory, you need 8k per RRD file. For a 100k RRD setup you need 800 Mega Bytes of disk cache.
100'000 * 8kByte per RRD ~ 800 MByte Buffer Cache
Use a Separate File System and possibly block device for RRD Files ¶
Having a separate filesystem and block device for RRD storage will allow you to use optimizations that affect the whole file system or even a block device.
Files System Level Optimization
- Set the noatime and nodiratime option when mounting the ext3 file system. This will save updates to the directory blocks. (DavePlonka??)
- Mount ext3 with data=writeback. This gives some performance boost at the expense of a possibility for old data to appear after a crash. Check man mount for details. (RyanKubica??)
- Use a file system that indexes directories (ext3 with dir_index) and keep all rrd files in a single directory (DavePlonka??)
Block Device Level Optimization
The following optimizations affect the whole block device and are NOT TESTED. Replace 'sda' with the actual name of the physical block device (not a virtual lvm device) ... and make sure you test the performance impact (TobiOetiker??).
- Reduce block level read-a-head to 2 file system blocks with
blockdev --setra 16 /dev/sda
Two blocks, just to be sure, in case FS blocks and device blocks are not aligned.
- Recent linux kernels come with sophisticated IO schedulers. Make sure the scheduler has something to juggle with. Give it a larger queue
echo “512” > /sys/block/sda/queue/nr_requests
- Tune the queue depth of eventual hardware that is on the data-path (HBAs, etc).
In recent Linux kernels, there are several tunable regarding the buffer cache in /proc/sys/vm
- What percentage of the system memory can be taken up by 'dirty' buffers before the pdflush process starts writing them back to disk.
- What percentage of the system memory can be taken up by 'dirty' buffers before the process writing the data will be forced to wrote the data back to disk.
- The pdflush process will start writing data to disk that has been dirty for more than the given amount of 1/10 of a second.
By setting dirty_expire_centisecs to a high value (several steps), while all rrd data fits into the cache, will cause your system to bundle up several rounds of updates before writing the dirty buffers back to disk. The danger in this is, that on a system crash all non updated buffers will be lost. You may also want to set dirty_ratio and dirty_background_ratio to a reasonably high percentage, so that you can actually use all that buffer cache. (RyanKubica??)