OneDataShare -- Fast, Scalable, and Flexible Data Sharing Made Easy


With the emergence of complex scientific applications, social media, video over IP, and more recently the trend for Internet of Things (IoT), the global data movement requirements have already exceeded the Exabyte scale. Large scientific experiments, such as environmental and coastal hazard prediction, climate modeling, genome mapping, and high-energy physics simulations generate data volumes reaching several Petabytes per year. Data collected from remote sensors and satellites, dynamic data-driven applications, social networks, digital libraries and preservations are also producing extremely large datasets for real-time or online processing.  This so called “data deluge” in the scientific applications necessitates collaboration and sharing among the national and international education and research institutions, which results in frequent large-scale data movement over wide-area networks.

We see a similar trend in the commercial applications as well. According to a study by Forrester Research, 77% of the 106 large organizations that operate two or more datacenters run regular backup and replication applications among three or more sites. Also, more than 50% of them have over one Petabyte of data in their primary datacenter and expect their inter-datacenter throughput requirements to double or triple over the next couple of years. As a result, Google has deployed a large-scale inter-datacenter copy service, and background traffic has become dominant in Yahoo!’s aggregate inter-datacenter traffic. It is estimated that, in 2017, more IP traffic will traverse global networks than all prior “Internet years” combined. The global IP traffic will exceed an annual rate of 1.5 Zettabytes, which corresponds to nearly 1 billion DVDs of data transfer per day for the entire year. It is also estimated that more than 20 billion devices will be connected to Internet during this year.

Several national and regional optical networking initiatives such as Internet2, XSEDE, and ESnet provide high-speed network connectivity to their users to mitigate the data bottleneck. Especially the recent developments in the networking technology provide scientists with high-speed optical links reaching and exceeding 100 Gbps in capacity. In the near future, we expect to have Tbps networking capacity available to end-users. However, majority of the users fail to obtain even a fraction of the theoretical speeds promised by the existing networks due to issues such as sub-optimal protocol tuning, inefficient end-to-end routing, disk performance bottleneck on the sending and/or receiving ends, and server processor limitations.

For example, Garfinkel reports that sending a 1 TB forensics dataset from Boston to the Amazon S3 storage system took several weeks. For this reason, many companies still prefer sending their data through a shipment service provider such as UPS or FedEx rather than using wide-area networks (please see the post by Johan Obrink on “When - if ever- will the bandwidth of the Internet surpass that of FedEx?”). This means that having high-speed networks in place is important but not sufficient. Being able to effectively use these high-speed networks is becoming increasingly important for wide-area data exchange as well as for petascale computing in a widely distributed setting.

As data has become more abundant and data resources become more heterogeneous, accessing, sharing and disseminating these data sets become a bigger challenge. Using simple tools to remotely logon to computers and manually transfer data sets between sites is no longer feasible. Managed file transfer (MFT) services such as Globus, PhEDEx, Mover.IO, and B2SHARE have allowed users to do more, but these services still rely on the users providing specific details to control this process, and they suffer from shortcomings including low transfer throughput, inflexibility, and restricted protocol support. There is substantial empirical evidence suggesting that performance directly impacts revenue. As two well-known examples for this, Google reported 20% revenue loss due to a specific experiment that increased the time to display search results by as little as 500 milliseconds; and Amazon reported a 1% sales decrease for an additional delay of as little as 100 milliseconds.

High-performance and cost-efficient data access and sharing has also been a key component in the strategic planning documents of federal agencies. According to the Strategic Plan for the US Climate Change Science Program (CCSP), one of the main objectives of the future research programs should be “Enhancing the data management infrastructure”, since “The users should be able to focus their attention on the information content of the data, rather than how to discover, access, and use it.” The DOE Office of Science report on ‘Data Management Challenge’ defines data movement and efficient access to data as two key foundations of scientific data management technology. NSF’s Cyberinfrastructure Vision for 21st Century states that “The national data framework must provide for reliable preservation, access, analysis, interoperability, and data movement.” I have always believed that efficient data access and sharing is a fundamental challenge for large-scale distributed systems, and advances in this area promise to enable a range of new high-impact applications and capabilities, which is closely aligned with the summary of NSF report on Research Challenges in Distributed Computing Systems. NSF has recently funded our OneDataShare project that serves the same goals. 

Goals of our OneDataShare project: 

OneDataShare aims to make the data readily available to the researchers and to their applications in the fastest and the most efficient way possible. The main goals of OneDataShare include: 

(1) Reduce the time to delivery of the data. Large scale data easily generated in a few days may presently take weeks to transfer to the next stage of processing or to the long term storage sites, even assuming high speed interconnect and the availability of resources to store the data. Through OneDataShare’s application-level tuning and optimization of TCP-based data transfer protocols (such as GridFTP, SCP, HTTP etc.), the users will be able to obtain throughput close to the theoretical speeds promised by the high-bandwidth networks, and the performance of data movement will not be a major bottleneck for data-intensive applications any more (I have summarized some of these optimization techniques in my previous blog post). The time to the delivery of data will be greatly reduced, and the end-to-end performance of data-intensive applications relying on remote data will increase drastically. 

(2) Provide interoperation across heterogeneous data resources. In order to meet the specific needs of the users (i.e. scientists, engineers, educators etc.), numerous data storage systems with specialized transfer protocols have been designed, with new ones emerging all the time. Despite the familiar file system-like architecture that underlies most of these systems, the protocols used to exchange data with them are mutually incompatible and require specialized software to use. The difficulties in accessing heterogeneous data storage servers and incompatible data transfer protocols discourage researchers from drawing from more than a handful of resources in their research, and also prevent them from easily disseminating the data sets they produce. OneDataShare will provide interoperation across heterogeneous data resources (both streaming and at-rest) and on-the-fly translation between different data transfer protocols. Sharing data between traditionally non-compatible data sources will become very easy and convenient for the scientists and other end users. 

(3) Decrease the uncertainty in real-time decision-making processes. The timely completion of some compute and analysis tasks may be crucial for especially mission-critical and real-time decision-making processes. If these compute and analysis tasks depend on the delivery of certain data before they can be processed and completed, then not only the timely delivery of the data but also the predictive ability for estimating the time of delivery becomes very important. This would allow the researchers/users to do better planning, and deal with the uncertainties associated with the delivery of data in real-time decision-making process. OneDataShare’s data throughput and delivery time prediction service will eliminate possible long delays in completion of a transfer operation and increase utilization of end-system and network resources by giving an opportunity to provision these resources in advance with great accuracy. Also, this will enable the data schedulers to make better and more precise scheduling decisions by focusing on a specific time frame with a number of requests to be organized and scheduled for the best end-to-end performance.

OneDataShare’s novel data transfer optimization, interoperability, and prediction services will be implemented completely at the application-level, not requiring any changes to the existing infrastructure nor to the low-level networking stack, although drastically increasing the end-to-end performance of data transfers and data-intensive applications which depend on data movement. I am really excited about this project.

Comments

  1. Thank you for sharing this informative post about the usability of One Data Share for sharing and storing data. It byfar seems the most feasible way of handling large data sets. I will definitely give it a try. Looking forward to your next post.

    ReplyDelete
  2. You actually make it look so easy with your performance but I find this matter to be actually something which I think I would never comprehend. It seems too complicated and extremely broad for me. I'm looking forward for your next post, I’ll try to get the hang of it!
    ExcelR data analytics course in bangalore with placement

    ReplyDelete
  3. Your work is very good and I appreciate you and hopping for some more informative posts
    data science course in guwahati

    ReplyDelete

Post a Comment

Popular posts from this blog

Toward Sustainable Software for HPC, Cloud, and AI Workloads

Spring'24 Seminar Course on Green Computing and Sustainability