A Vision for a National Data and Software Cyberinfrastructure

During my term as an NSF program director in the Office of Advanced Cyberinfrastructure between 2020-2022, I had the opportunity to lead the development of NSF’s Blueprint for a National Data and Software Cyberinfrastructure. This blueprint document is publicly available to the community and provides a forward-looking vision for a robust, secure, trusted, performant, scalable, and sustainable data and software cyberinfrastructure (Data and Software CI) ecosystem to enable and accelerate science and engineering research.

This blueprint was prepared based on a comprehensive analysis of existing NSF programs and a wide range of input from the community via advisory bodies, requests for information (such as Data-Focused Cyberinfrastructure Needed to Support Future Data-Intensive Science and Engineering Research and Future Needs for Advanced Cyberinfrastructure to Support Science and Engineering Research), community surveys (such as NSF CSSI Community Survey), and several NSF-funded workshops (such as Community Visioning Workshop on the Future Direction of the CSSI Program and NSF CSSI PI Workshop). 

Key Elements of the Envisioned National Data and Software CI Ecosystem

In this blueprint, the key elements of the envisioned national Data and Software CI ecosystem (as outlined in Figure 1) are listed as follows:

Figure 1: Key elements of the envisioned national Data and Software CI ecosystem to transform data into knowledge and discovery.

Seamless Data Access and Sharing.  The report highlights that smart data placement and seamless data access, transfer, streaming, and sharing services that would support the computing continuum from the micro level (e.g., edge devices, sensors, IoT) to the macro level (e.g., data centers, clouds, supercomputers) and enable access to data anywhere, anytime, from any device is a pressing necessity. This could be done either by taking the generated data, filtering/transforming them, and proactively moving them to locations where they will be processed (in-situ or offline) or by moving the computation to where the relevant data are located. Since data are the core of research and insight for a broad set of academic disciplines, fast and seamless access to data in a usable form becomes critical for innovative research and educational programs across science and engineering domains.

Privacy, Security and Integrity. While sharing data and code within and across public and private sectors is a critical aspect of collaborative scientific discovery, issues of data and code integrity, privacy and security are paramount in this process. Emerging domains like edge computing, confidential cloud computing, and secure distributed computation introduce new security vulnerabilities and privacy concerns especially when designed explicitly for, and operated at, extreme scale. These issues must be addressed to ensure controlled and proper dissemination of data and code in order to ensure trust among the various stakeholders.

Integration, Interoperability and Reusability. A broad class of new science and engineering applications must deal with data and software from multiple sources that may be heterogeneous in a variety of ways, such as the type, syntax and semantics of the data, the quality of the data, the platform and interfaces of the software, and the policy regime under which the data and software were produced and by which they can be used. Developing robust, scalable and flexible solutions that would provide interoperability between these intrinsically diverse and disparate data and software components is a key requirement for the science and engineering applications which depend on them. 

Curation, Provenance and Findability. Significant efforts are needed to curate datasets – to clean, enrich, and standardize the data, record the context as well as semantics associated with the data, and log the analyses performed on the data as well as the code used for those analyses – in order to make them more useful for scientists involved in data discovery and analysis. Effective and proper reuse of data and code demands that the data and code context be appropriately registered and that their semantics be extracted and represented. Research in automated data tagging, metadata generation and registration, semantic representation methods, ontologies, and provenance of data and software will be essential for the discovery and collaborative exploration of all relevant data by researchers across all science and engineering domains.

Analytics and Visualization. Synthesis of the information content and deriving insight from massive, dynamic, ambiguous, and even conflicting data can be achieved through advanced data analytics and visualization techniques. Transforming data into new knowledge and understanding is a crucial step for advancement in science and engineering, and this can be done using advanced data analysis tools and collaborative visual interfaces. A new generation of data analytics and visualization systems and services will help absorb vast amounts of data and enhance researchers’ ability to interpret and analyze otherwise overwhelming data. In this way, researchers will be able to detect the expected and discover the unexpected, uncovering hidden associations within vast data sets and making new scientific breakthroughs.

Software Frameworks, Abstractions and Libraries. New abstractions and programming frameworks will be necessary to simplify the challenges of programming scalable and parallel systems, while achieving maximal performance through exploitation of parallelism for scheduling computation, communication, and output for interactive as well as batch-oriented science and engineering applications. The proper set of abstractions must be provided to enable applications to specify their resource requirements and execute efficiently in an environment with shared resources. The development of domain-specific as well as cross-domain robust, reliable and efficient software frameworks, abstractions, and libraries should continue since they are critical for the rapid advancement of science and engineering.

Resource Allocation, Scheduling and End-to-End Workflow Management. End-to-end data processing and analysis is generally performed via data analysis pipelines and workflows. For this reason, many science and engineering communities depend on access to services that enable the creation of robust, reliable, efficient and scalable scientific workflows, and integration of the diverse data, computing, analysis, and monitoring capabilities. Comprehensive tools and best practices are needed to ensure that existing analysis pipelines are efficient, reliable, and scalable and that the results can be replicated at some future point in time if needed. A next generation of resource allocation, task scheduling, and end-to-end workflow management solutions is needed to allow for the efficient and scalable processing, analysis, visualization, and sharing of large datasets generated among highly diverse and interdisciplinary groups.

NSF’s overarching strategies in building the national Data and Software CI ecosystem

The blueprint lays out NSF’s overarching strategies in building this envisioned national Data and Software CI ecosystem as follows:

1) Support Domain-specific and Customized Data and Software CI: Some science and engineering domains have very specific data and software requirements unique to their disciplines and they need to be well supported with Data and Software CI resources tailored and optimized for those applications. NSF will continue to support domain-specific and customized Data and Software CI solutions for such domains. NSF’s existing programs such as CSSI, HDR, and CDS&E, together with discipline-specific programs offered by NSF are well-suited for such efforts.

2) Prioritize and Invest in Transdisciplinary Community Data and Software CI: NSF will prioritize and invest in the key elements of a broadly accessible, interoperable, and reusable transdisciplinary community Data and Software CI. The lack of easily reusable capabilities can sometimes result in an ecosystem of duplicated functionality. For this reason, not only the data, but also the CI tools and services should be FAIR – findable, accessible, interoperable, and reusable. Such new CI services and capabilities should allow for seamless integration and interoperability with existing CI; support a wide variety of science and engieering drivers, users and usage modes; and foster the initiation of future modes of discovery.

3) Close the Gap Between Research, Development and Sustained Production of CI: NSF is developing a strategy that will close the gap between research, development and sustained production of Data and Software CI. The envisioned Pathways to Production will balance innovations with stability and continuity in production-quality Data and Software CI while ensuring that there are opportunities to explore innovations that address emerging requirements, novel technologies and concerns such as reproducibility, privacy and trust, and to transition these innovations to production when appropriate. It is essential to have a clear plan for scaling Data and Software CI research prototypes and early implementations developed through other NSF programs mentioned above and transition them to production in order to increase productivity and ensure sustained scientific innovation across science and engineering domains. 

4) Complement Data and Software CI with Other NSF CI Efforts: Data and Software CI needs to be complemented with shared-use computing, networking, and data CI, sophisticated research instruments and platforms, robust and trustworthy services and data products that are openly, reliably and pervasively accessible by a broad community of researchers and/or educators. NSF recognizes that such research instrumentation is critical for advances in fundamental science and engieering, and it will continue investing in the creation of this infrastructure through its existing programs such as CICI, CC*, MRI, Mid-scale RI-1 and RI-2, LCCF, and ACCS. The size of these investments can change from small sensor systems for data acquisition to large repositories for storage and high-end computing facilities for the analysis of massive data sets.

5) Promote Data and Software CI Community Building: NSF will continue to invest in developing a broad and diverse Data and Software CI community, promoting coordination and exchange between the CI and research communities and facilitating the dissemination of best practices for design, development, and operation of CI resources and capabilities. NSF’s BD Hubs program is a major step towards achieving this goal by serving as a venue for building and fostering local and regional data-related activities in city, county, and state governments, in local industry and non-profits, and in regional academic institutions. NSF’s new investments in CI Centers of Excellence (CoE) takes this a step further and aims to facilitate community building and sharing by supporting hubs of expertise and innovation targeting specific areas, aspects, or stakeholder communities of the research CI ecosystem. Supported CI CoEs provide expertise and services related to CI technologies and solutions; gather, develop, and communicate community best practices; and serve as readily-available resources for both the research community and the CI community.

6) Invest in Data and Software CI Learning and Workforce Development: All Data and Software CI programs at NSF will continue to include an integrated component focused on the training and professional development of a skilled workforce with expertise ranging from CI research and development to CI deployment and its application to different domains. This is critical in preparing, nurturing, and growing the national scientific research workforce for creating, utilizing, and supporting advanced CI to enable and potentially transform fundamental science and engineering research and contribute to the Nation's overall economic competitiveness and security. In addition to this integrated learning and workforce development (LWD) component across Data and Software CI programs, NSF’s dedicated LWD program in CI, CyberTraining, will continue to invest in innovative and scalable training, education, and professional development activities which will lead to transformative changes in the state of research workforce preparedness for advanced CI-enabled research in the short and long terms.

Data and Software CI Pathways to Production

The blueprint states that NSF investments in Data and Software CI pathways to production will be under three broad categories (as outlined in Figure 2) which are described below.

Figure 2: Data and Software CI pathways to production.

1) Data and Software CI Research. NSF recognizes and supports foundational and translational research to catalyze core Data and Software CI innovations essential to address disruptive changes in applications and technologies as well as the emergence of new concerns (e.g., energy efficiency, privacy, trust, transparency). There are multiple open research issues leading to advances in technologies for storing, accessing, sharing, integrating, and analyzing data, as well as for developing, managing and sustaining complex software. Fundamental understanding is needed not only in modeling and theory but also in designing new architectures, novel visualizations, and the effective utilization and optimization of data, software, computing, and communications resources. Insertion of these advances into the next generation of Data and Software CI needs to occur through close collaboration between the researchers, user communities, developers, and the providers of these systems, tools, and resources. To address new challenges in the data-to-knowledge pipeline, there should be continuous and increasing investments in research on Data and Software CI technologies for large-scale data collection, management, analysis, interpretation, preservation, and security. Machine-learning approaches, including deep learning systems, are also needed to build better data-driven models that can be used to augment human decision-making reliably. NSF’s current programs such as PPoSS, OAC Core, HDR , and CDS&E , and new initiatives such as AI Research Institutes, will continue to support research in novel and innovative techniques in Data and Software CI.

2) Data and Software CI Development. NSF is heavily invested in supporting the development of a robust, secure, trusted, performant, scalable, and sustainable Data and Software CI ecosystem to enable and accelerate science and engineering research. Consistent with the Federal Big Data Research and Development Strategic Plan, NSF recognizes that there are significant data handling challenges common across disciplines, while some challenges are specific to particular disciplines. Some aspects of Data and Software CI may focus on specific application domains, while others are common and shared across multiple research domains. Investments in both categories are critical for creating the envisioned Data and Software CI ecosystem that drives new thinking and transformative discoveries in all areas of research and education. The former is important so that domains with specific and complex data and software challenges can be well supported with resources optimized for those applications; and the latter so that a shared infrastructure can offer access to resources that an individual community alone would not be able to build and sustain. NSF’s current programs such as CSSI  will continue to support the development of new Data and Software CI systems and services that are findable, accessible, interoperable, reusable, provenance traceable, and sustainable.

3) Data and Software CI Sustained Production. NSF aims to enable the deployment and operation of sustained production-quality Data and Software CI systems, tools and services. For this reason, NSF is developing a strategy that balances innovations with stability and continuity in production-quality Data and Software CI while ensuring that there are opportunities to explore innovations and to transition these innovations to production when appropriate. It is essential to have a clear plan for scaling Data and Software CI research prototypes and early implementations developed through other NSF programs mentioned above and transition them to production in order to increase productivity and ensure sustained scientific innovation across science and engineering. NSF is planning new initiatives complementing its existing programs (such as POSE and  CSSI Sustainability track) in this area, closing the gap between research, development, and sustained production of Data and Software CI. 

CSSI's new award class on Transition to Sustainability which was first introduced in 2021 is a great step towards this vision. This new award class targets groups who would like to execute a well-defined sustainability plan for existing CI with demonstrated impact in one or more areas of science and engineering supported by NSF. The sustainability plan should enable new avenues of support for the long-term sustained impact of the CI. The POSE program which was introduced in 2022 supports projects to establish a sustainable Open-Source Ecosyste based on a robust open-source product that shows promise in the ability to both meet an emergent societal or national need and build a community to help develop it. 

The envisioned Pathways to Production will balance innovations with stability and continuity in production-quality Data and Software CI while ensuring that there are opportunities to explore innovations and to transition these innovations to production when appropriate in order to increase productivity and ensure sustained scientific innovation across science and engineering domains.

I highly recommend each of you to read the full blueprint document which is available at this link.

Comments

Popular posts from this blog

Toward Sustainable Software for HPC, Cloud, and AI Workloads

Spring'24 Seminar Course on Green Computing and Sustainability

Reading List for My Seminar Course on "Green Computing and Sustainability"