Step-by-Step Guidelines to Optimize Big Data Transfers
This is a
summary of our paper on “Application-Level Optimization of Big Data Transfers Through Pipelining, Parallelism and Concurrency” which is
recently accepted for publication at IEEE Transactions on Cloud Computing
(TCC). In this paper, we analyze the
effects of the most important application-level transfer parameters that are
used to enhance the end-to-end data transfer throughput, and we provide
guidelines to set the best values for these parameters.
Transferring large datasets especially with heterogeneous
file sizes (i.e. many small and large files together) causes inefficient
utilization of the available network bandwidth. Small file transfers may cause
the underlying transfer protocol not reaching the full network utilization due
to short-duration transfers and connection start up/tear down overhead; and
large file transfers may suffer from protocol inefficiency and end-system
limitations.
Application-level TCP tuning parameters such as pipelining,
parallelism and concurrency are very effective in removing these bottlenecks,
especially when used together and in correct combinations. However, predicting
the best combination of these parameters requires highly complicated modeling
since incorrect combinations can either lead to overloading of the network,
inefficient utilization of the resources, or unacceptable prediction overheads. In short, pipelining
refers to sending multiple transfer requests over a single data channel without
waiting for the “transfer complete” acknowledgement in order to minimize the
delay between individual transfers; parallelism
refers to sending different chunks of the same file through different data channels at the same
time; and concurrency refers to
sending different
files through different
data channels at the same time. There are various factors affecting the
performance of pipelining, parallelism and concurrency; such as the available
network bandwidth, round trip time (RTT), buffer size, file size, and number of
files to be transferred.
We answer the following questions (and more) in the paper:
Is
pipelining necessary for every transfer?
Pipelining is useful when transferring large numbers of
small files, but there is a certain breakpoint where the average file size
becomes greater than the bandwidth delay product (BDP = Bandwidth X RTT). After
that point, there is no need to use a high level of pipelining. So if we have a
dataset of files with varying sizes, it is important to divide the dataset
into two and focus on the part (file size <
BDP) where setting different pipelining levels may affect the
throughput. BDP is calculated by taking bulk TCP disk-to-disk throughput for a
single TCP stream for bandwidth and average RTT for the delay.
How does
file size affect the optimal pipelining level?
File size is the dominating factor in setting the optimal pipelining
level, especially for long RTT networks. Different pipelining level transfers
go through similar slow start phases regardless of the file size. The crucial
point is the highest number of bytes reached by a specific file size, which is
equal to: FS = BDP
/ (pp + 1), where BDP is the number of
bytes sent/received in one RTT, FS is the file size, and pp is the pipelining level.
Of course this linear increase in the number of bytes with the pipelining level
only lasts when it reaches BDP. After that, the increase becomes logarithmic. Therefore
the optimal pipelining level could be calculated as: ppopt = (BDP/FS) − 1. When the file size is greater
than the BDP, pipelining does not provide any benefits.
When is parallelism advantageous?
Parallelism is advantageous when the system buffer size is
set to a value smaller than the BDP. This occurs mostly in large bandwidth-long
RTT networks. It is also advisable to use parallelism in large file transfers. In
the case of small files, parallelism may not give a good performance by itself,
however when used with pipelining its effects on the performance could be
significant as long as it does not cause pipelining to lose its effect due to
division of small files into chunks by parallelism. This happens when the
number of files and average file size in a chunk are small.
How much
parallelism is too much?
This is a difficult question to answer. If it were possible to
predict when the packet loss rate would start to increase exponentially, it
would also be possible to determine how much parallelism would be too much.
There are two cases to consider in terms of the dataset characteristics. First,
when the transfer is of a large file, the point the network or disk bandwidth
capacity is reached and the number of retransmissions start to increase is the point
where the parallelism level becomes too much. In our previous work, we
managed to predict the optimal parallelism level by looking into throughput
measurements of as few as three past transfer samplings. There is a knee point in
the throughput curve as we increase the parallel stream number. In the second
case, when the transfer is of a dataset consisting of large number of small
files, parallelism has a negative effect, because the data size becomes smaller
as the file is divided into multiple streams and the window sizes of each
stream cannot reach the maximum because there is not enough data to send. With
the help of pipelining this bottleneck can also be overcome to an extent.
Is concurrency sufficient by itself without parallelism or
pipelining?
In cases where the bandwidth is fully utilized with small
number of concurrent transfers and the number of files is large enough,
concurrency, used with data channel caching can achieve similar performances with
parallelism + pipelining + concurrency. However, the optimal concurrency level
could be much higher when it is used alone. High number of concurrent transfers
means many processes (or threads), which can degrade the performance. In this case, using
concurrency with pipelining and parallelism is a good choice.
What advantages does concurrency have over parallelism?
In cases where parallelism deteriorates the performance improvements
of pipelining, it is better to use concurrency. In some cases, concurrency with
pipelining has better performance than using all three functions together for
the same settings. This is due to the negative effect of parallelism on
pipelining when small files are transferred. For larger files, the negative
effect of parallelism is degraded and when all three functions are used
together they can perform better than concurrency + pipelining case.
How does network capacity affect the optimal parallelism and
concurrency levels?
The performance benefits of parallelism and concurrency are best
observed in wide area data transfers. As a result of exponentially increasing
parallel stream numbers and concurrency, the total throughput shows a linear
increase first. However as the number goes high it comes closer to the network
capacity and the increase becomes exponential and then starts to decrease or
takes the form of a steady-state transfer. The most apparent outcome that can
be deduced from these results is that the optimal parallelism and concurrency
levels increase as the network capacity increases.
When to use UDT over TCP?
It is better to use UDT in long RTT (wide-area) networks
without additional parallelism but it performs worse short RTT (local-area or metropolitan-area)
networks. Parallel TCP can compete with UDT in both cases however it is
important to set the correct parallelism level without overwhelming the
network.
Rules of Thumb for Throughput Optimization
Our paper presents some rules of thumb that should be
applied when optimizing the throughput of large data transfers:
- Always use pipelining, even if it has very little effect on throughput. It allows the usage of a single data channel for sending multiple files, resulting a continuous increase in number of bytes sent/received in one RTT. It also overlaps control channel messages and processing overhead with data channel transfers resulting in removal of idle time between consecutive transfers.
- Set different pipelining levels by dividing the data set into chunks where mean file size is less than BDP. The number of bytes sent/received in one RTT cannot pass the average file size multiplied by the pipelining level. Pipelining can have a huge effect as long as this value is less than BDP
- Keep the chunks as big as possible. It is important to have enough data in a chunk for pipelining to be effective because different pipelining level transfers go through the same slow-start phase.
- Use only concurrency with pipelining for small file sizes and small number of files. Dividing a small file further with parallelism affects throughput adversely.
- Add parallelism to concurrency and pipelining for bigger file sizes where parallelism does not affect pipelining.
- Use parallelism when the number of files is insufficient to apply concurrency.
- Use UDT for wide area transfers only, preferably with only a single stream. In cases where you are allowed parallel stream transfers, TCP with optimal stream number can compete with UDT and sometimes outperform it.
Throughput
Optimization Algorithms
The paper also presents two novel throughput optimization
algorithms: The first algorithm (PCP) uses an adaptive approach and tries to
reach the maximum network bandwidth gradually. The second algorithm (MC) follows
a more aggressive approach in using concurrency. The experiments and validation
of the developed models are performed on high-speed networking testbeds and
cloud networks. The results are compared to the most successful and highly
adopted data transfer tools such as Globus Online and UDT. It has been observed
that our algorithms can outperform them in majority of the cases. For more details about these algorithms as well as our other optimization techniques, you can read our paper.
Very intersting stuff thank u sharing ....
ReplyDeletedata analytics course
data science course
business analytics course
Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.
ReplyDeleteIoT course training in Vizag