Occasionally, rsync is too slow when transferring a very large number of files. This article discusses a workaround.
Recently I had to copy about 10 TByte of data from one server to another. Normally I would use rsync and just let it run for whatever time it takes, but for this particular system I could get only a transfer speed of 10 - 20 MByte per second. As both systems were connected with a 1 GBit network, I was expecting a performance of about 100 MByte per second.
It turned out that the bottleneck was not the network speed, but the fact that the source system contained a very large number of smaller files. Rsync doesn't seem to be the optimal solution in this case. Also, the destination system was empty, so there was no benefit in choosing rsync over scp or tar.
After some experiments, I found a command that improved the overall file copy performance significantly. It looks like this:
root@s2200:/home/backup# tar cf - * | mbuffer -m 1024M | ssh 10.1.1.207 '(cd /home/backup; tar xf -)' in @ 121.3 MB/s, out @ 85.4 MB/s, 841.3 GB total, buffer 78% full
Using this method, it is possible to transfer data with a speed near the network bandwidth. The trick is the mbuffer command. It allocates a very large buffer of 1024 MByte which sits between the tar command and the ssh command.
When there are a few large files available to transfer, the tar command would copy the data faster than it can be transferred over the network. So, the buffer fills up to 100% even though data is transmitted with the full network speed.
However when there is a directory with a large number of smaller files, reading those files from the storage is relatively slow so the buffer is emptied faster than it is refilled by the tar command. But until it is not completely empty, data is still transferred with the maximum network speed.
With a bit of luck there are enough large files to keep the buffer filled. If the buffer is always near 100% full, this means that the bottleneck is the network (or the destination system). In this case it is worth trying the -z option to both tar commands. This would compress the data before transmission. However if the buffer is mostly near 0% full, this means that the source system is the bottleneck. Data can't be read from the local storage fast enough, and spending more CPU to compress it would probably not help.
Of course, the command above makes only sense if the destination server is empty. If some of the files already exist in the destination location, rsync would simply skip over them (if they are actually identical). There are two rsync options that can be used to speed up rsync somewhat: (todo)