Minimizing the Energy Footprint of Global Data Movement with GreenDataFlow

It is estimated that the number of devices connected to the Internet will be four times as high as the world population in 2022, and the global IP traffic will reach 4.8 zettabytes per year. The increased number of users and data rates do not only require increased network bandwidth and achievable data transfer throughput but also result in an increased energy footprint. The annual electricity consumed by the global data movement is estimated to be more than 200 terawatt-hours at the current rate, costing more than 40 billion US dollars per year. According to the same statistics, the share of the US in this global data movement and in its energy footprint is approximately 20%. This fact has resulted in a considerable amount of work focusing on power management and energy efficiency in hardware and software systems and more recently on power-aware networking. The majority of the existing work on power-aware networking focuses on reducing the power consumption on networking devices (i

OneDataShare -- Fast, Scalable, and Flexible Data Sharing Made Easy

With the emergence of complex scientific applications, social media, video over IP, and more recently the trend for Internet of Things (IoT), the global data movement requirements have already exceeded the Exabyte scale. Large scientific experiments, such as environmental and coastal hazard prediction, climate modeling, genome mapping, and high-energy physics simulations generate data volumes reaching several Petabytes per year. Data collected from remote sensors and satellites, dynamic data-driven applications, social networks, digital libraries and preservations are also producing extremely large datasets for real-time or online processing.   This so called “data deluge” in the scientific applications necessitates collaboration and sharing among the national and international education and research institutions, which results in frequent large-scale data movement over wide-area networks. We see a similar trend in the commercial applications as well. According to a study by

Step-by-Step Guidelines to Optimize Big Data Transfers

This is a summary of our paper on “ Application-Level Optimization of Big Data Transfers Through Pipelining, Parallelism and Concurrency ” which is recently accepted for publication at IEEE Transactions on Cloud Computing (TCC). In this paper, we analyze the effects of the most important application-level transfer parameters that are used to enhance the end-to-end data transfer throughput, and we provide guidelines to set the best values for these parameters. Transferring large datasets especially with heterogeneous file sizes (i.e. many small and large files together) causes inefficient utilization of the available network bandwidth. Small file transfers may cause the underlying transfer protocol not reaching the full network utilization due to short-duration transfers and connection start up/tear down overhead; and large file transfers may suffer from protocol inefficiency and end-system limitations. Application-level TCP tuning parameters such as pipelining, par

Forging Data Alliances -- The Quest for Data Intensive Discovery

This week I've attended  NCDS Leadership Summit on "Data to Discovery: Genomes to Health" in Chapel Hill, North Carolina. The National Consortium for Data Science ( NCDS ) is a new initiative aiming to engage a broad community of data science experts to identify key data challenges, coordinate data science research priorities, and support the development of technical, ethical, and policy standards for data. This week's meeting was in one sense the kick-off meeting of the initiative. The initiative is lead by Stan Ahalt , professor of Computer Science at UNC-Chapel Hill and the Director of the Renaissance Computing Institute ( RENCI ). I know Stan from the days he was leading the Ohio Supercomputing Center ( OSC ) as the executive director of the center. After moving to RENCI, he has formed a great team especially focusing on cyberinfrastructure and data management technologies. One of the data management experts in his team is Reagan Moore, the mastermind behind th

"Big Data" is Dead. What’s Next?

Over the last couple of years, the term "big data" has been a big hype which you can see everywhere: on the covers of the most serious scientific journals such as " Science " and " Nature "; on popular magazines such as " The Economist "; in the strategic documents of federal agencies; in the ads of top IT companies; and of course all over the web.. The emergence of the term is due to a real problem: the increasing volume, velocity, and variety of data makes it impossible to processed, analyzed, and interpreted in an efficient manner using existing tools. And this problem existed way before the term "big data" existed.  Back in 60's many scientists were already talking about the " explosion of information " and proposing techniques to cope with it. The first collaborative organization to deal with increasing amounts of data was the "Committee on Data for Science and Technology (CODATA) " established by the In