Forging Data Alliances -- The Quest for Data Intensive Discovery

This week I've attended NCDS Leadership Summit on "Data to Discovery: Genomes to Health" in Chapel Hill, North Carolina. The National Consortium for Data Science (NCDS) is a new initiative aiming to engage a broad community of data science experts to identify key data challenges, coordinate data science research priorities, and support the development of technical, ethical, and policy standards for data. This week's meeting was in one sense the kick-off meeting of the initiative.

The initiative is lead by Stan Ahalt, professor of Computer Science at UNC-Chapel Hill and the Director of the Renaissance Computing Institute (RENCI). I know Stan from the days he was leading the Ohio Supercomputing Center (OSC) as the executive director of the center. After moving to RENCI, he has formed a great team especially focusing on cyberinfrastructure and data management technologies. One of the data management experts in his team is Reagan Moore, the mastermind behind the iRODS and SRB systems which are widely used in the community. Reagan left SDSC a couple of years ago to join RENCI and currently leading the Data Intensive Cyber Environments Center (DICE) there.

NCDS plans to have three main contributions to the community: 1) A Data Observatory which will offer a shared, distributed infrastructure to house large sets of research data; 2) A Data Laboratory which will enable data science researchers to test radically new techniques for storing, sharing, analyzing, and visualizing data; and 3) A Data Fellows program which will train a new generation of data science experts.

NCDS is not the first organization aiming to form such a consortium of data scientists. There are similar trends in Europe, Asia, and Australia. The European Union recently launched EUDAT, a consortium of data scientists and data service providers in Europe, with the aim to build a collaborative data infrastructure. EUDAT is focusing on five service areas determined by its user communities: safe data replication, data staging, simple store, authentication & authorization, and metadata management. Australia has initiated a similar organization called ANDS, the Australian National Data Service, which aims to make their national research data collections manageable, connected, discoverable, and reusable by all their scientists. China is also making  major investments in data science and large-scale data management but has not officially announced formation of a consortium in this area yet. 

In addition to these nation-wide (or continent-level) consortiums to manage large-scale scientific data from different disciplines, a world-wide alliance called Research Data Alliance (RDA) is formed last year to accelerate data-driven innovation through research data sharing and exchange in the global scale. The US-side of this alliance is lead by Fran Berman, professor of Computer Science at RPI. Prior to joining RPI, Fran was the director of the San Diego Supercomputing Center (SDSC), and also one oft he PIs of the NSF's Teragrid (now XSEDE) project. RDA replicates the successful model of Internet Engineering Task force (IETF) which brought world-wide experts together to develop the standards for Internet which is used by everybody today.

Alliances are forged and forces are combined to tackle with the Data Impasse of the 21st century. May the force be with us in this quest towards data intensive scientific discovery.


  1. Enjoyed reading this article throughout.Nice post! IoT is the trendy course right now and is going to be in
    a great demand in near future as jobs for this domain will be sky rocketted.To be on par with the current trend we have to
    gain complete knowledge about the subject. For the complete course online
    360Digitmg Iot Certification Training
    360Digitmg Internet of things courses online


Post a Comment

Popular posts from this blog

Step-by-Step Guidelines to Optimize Big Data Transfers

OneDataShare -- Fast, Scalable, and Flexible Data Sharing Made Easy

Minimizing the Energy Footprint of Global Data Movement with GreenDataFlow