Step-by-Step Guidelines to Optimize Big Data Transfers

This is a summary of our paper on Application-Level Optimization of Big Data Transfers Through Pipelining, Parallelism and Concurrency which is recently accepted for publication at IEEE Transactions on Cloud Computing (TCC). In this paper, we analyze the effects of the most important application-level transfer parameters that are used to enhance the end-to-end data transfer throughput, and we provide guidelines to set the best values for these parameters.

Transferring large datasets especially with heterogeneous file sizes (i.e. many small and large files together) causes inefficient utilization of the available network bandwidth. Small file transfers may cause the underlying transfer protocol not reaching the full network utilization due to short-duration transfers and connection start up/tear down overhead; and large file transfers may suffer from protocol inefficiency and end-system limitations.

Application-level TCP tuning parameters such as pipelining, parallelism and concurrency are very effective in removing these bottlenecks, especially when used together and in correct combinations. However, predicting the best combination of these parameters requires highly complicated modeling since incorrect combinations can either lead to overloading of the network, inefficient utilization of the resources, or unacceptable prediction overheads. In short, pipelining refers to sending multiple transfer requests over a single data channel without waiting for the “transfer complete” acknowledgement in order to minimize the delay between individual transfers; parallelism refers to sending different chunks of the same file through different data channels at the same time; and concurrency refers to sending different files through different data channels at the same time. There are various factors affecting the performance of pipelining, parallelism and concurrency; such as the available network bandwidth, round trip time (RTT), buffer size, file size, and number of files to be transferred.

We answer the following questions (and more) in the paper:

Is pipelining necessary for every transfer?

Pipelining is useful when transferring large numbers of small files, but there is a certain breakpoint where the average file size becomes greater than the bandwidth delay product (BDP = Bandwidth X RTT). After that point, there is no need to use a high level of pipelining. So if we have a dataset of files with varying sizes, it is important to divide the dataset into two and focus on the part (file size <  BDP) where setting different pipelining levels may affect the throughput. BDP is calculated by taking bulk TCP disk-to-disk throughput for a single TCP stream for bandwidth and average RTT for the delay.

How does file size affect the optimal pipelining level?

File size is the dominating factor in setting the optimal pipelining level, especially for long RTT networks. Different pipelining level transfers go through similar slow start phases regardless of the file size. The crucial point is the highest number of bytes reached by a specific file size, which is equal to: FS  = BDP / (pp + 1), where BDP is the number of bytes sent/received in one RTT, FS is the file size, and pp is the pipelining level. Of course this linear increase in the number of bytes with the pipelining level only lasts when it reaches BDP. After that, the increase becomes logarithmic. Therefore the optimal pipelining level could be calculated as: ppopt  = (BDP/FS) − 1. When the file size is greater than the BDP, pipelining does not provide any benefits.

When is parallelism advantageous?

Parallelism is advantageous when the system buffer size is set to a value smaller than the BDP. This occurs mostly in large bandwidth-long RTT networks. It is also advisable to use parallelism in large file transfers. In the case of small files, parallelism may not give a good performance by itself, however when used with pipelining its effects on the performance could be significant as long as it does not cause pipelining to lose its effect due to division of small files into chunks by parallelism. This happens when the number of files and average file size in a chunk are small.

How much parallelism is too much?

This is a difficult question to answer. If it were possible to predict when the packet loss rate would start to increase exponentially, it would also be possible to determine how much parallelism would be too much. There are two cases to consider in terms of the dataset characteristics. First, when the transfer is of a large file, the point the network or disk bandwidth capacity is reached and the number of retransmissions start to increase is the point where the parallelism level becomes too much. In our previous work, we managed to predict the optimal parallelism level by looking into throughput measurements of as few as three past transfer samplings. There is a knee point in the throughput curve as we increase the parallel stream number. In the second case, when the transfer is of a dataset consisting of large number of small files, parallelism has a negative effect, because the data size becomes smaller as the file is divided into multiple streams and the window sizes of each stream cannot reach the maximum because there is not enough data to send. With the help of pipelining this bottleneck can also be overcome to an extent.

Is concurrency sufficient by itself without parallelism or pipelining?     

In cases where the bandwidth is fully utilized with small number of concurrent transfers and the number of files is large enough, concurrency, used with data channel caching can achieve similar performances with parallelism + pipelining + concurrency. However, the optimal concurrency level could be much higher when it is used alone. High number of concurrent transfers means many processes (or threads), which can degrade the performance. In this case, using concurrency with pipelining and parallelism is a good choice.

What advantages does concurrency have over parallelism?

In cases where parallelism deteriorates the performance improvements of pipelining, it is better to use concurrency. In some cases, concurrency with pipelining has better performance than using all three functions together for the same settings. This is due to the negative effect of parallelism on pipelining when small files are transferred. For larger files, the negative effect of parallelism is degraded and when all three functions are used together they can perform better than concurrency + pipelining case.

How does network capacity affect the optimal parallelism and concurrency levels?

The performance benefits of parallelism and concurrency are best observed in wide area data transfers. As a result of exponentially increasing parallel stream numbers and concurrency, the total throughput shows a linear increase first. However as the number goes high it comes closer to the network capacity and the increase becomes exponential and then starts to decrease or takes the form of a steady-state transfer. The most apparent outcome that can be deduced from these results is that the optimal parallelism and concurrency levels increase as the network capacity increases.

When to use UDT over TCP?

It is better to use UDT in long RTT (wide-area) networks without additional parallelism but it performs worse short RTT (local-area or metropolitan-area) networks. Parallel TCP can compete with UDT in both cases however it is important to set the correct parallelism level without overwhelming the network.

Rules of Thumb for Throughput Optimization

Our paper presents some rules of thumb that should be applied when optimizing the throughput of large data transfers:
  1. Always use pipelining, even if it has very little effect on throughput. It allows the usage of a single data channel for sending multiple files, resulting a continuous increase in number of bytes sent/received in one RTT. It also overlaps control channel messages and processing overhead with data channel transfers resulting in removal of idle time between consecutive transfers.
  2. Set different pipelining levels by dividing the data set into chunks where mean file size is less than BDP. The number of bytes sent/received in one RTT cannot pass the average file size multiplied by the pipelining level. Pipelining can have a huge effect as long as this value is less than BDP
  3. Keep the chunks as big as possible. It is important to have enough data in a chunk for pipelining to be effective because different pipelining level transfers go through the same slow-start phase.
  4. Use only concurrency with pipelining for small file sizes and small number of files. Dividing a small file further with parallelism affects throughput adversely. 
  5. Add parallelism to concurrency and pipelining for bigger file sizes where parallelism does not affect pipelining.
  6. Use parallelism when the number of files is insufficient to apply concurrency.
  7. Use UDT for wide area transfers only, preferably with only a single stream. In cases where you are allowed parallel stream transfers, TCP with optimal stream number can compete with UDT and sometimes outperform it.
Throughput Optimization Algorithms

The paper also presents two novel throughput optimization algorithms: The first algorithm (PCP) uses an adaptive approach and tries to reach the maximum network bandwidth gradually. The second algorithm (MC) follows a more aggressive approach in using concurrency. The experiments and validation of the developed models are performed on high-speed networking testbeds and cloud networks. The results are compared to the most successful and highly adopted data transfer tools such as Globus Online and UDT. It has been observed that our algorithms can outperform them in majority of the cases. For more details about these algorithms as well as our other optimization techniques, you can read our paper.



Comments

  1. Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.

    IoT course training in Vizag

    ReplyDelete

Post a Comment

Popular posts from this blog

Toward Sustainable Software for HPC, Cloud, and AI Workloads

Spring'24 Seminar Course on Green Computing and Sustainability

OneDataShare -- Fast, Scalable, and Flexible Data Sharing Made Easy