Skip to content

CMIP7 Participation for Data Managers

1. Installation and configuration

All information on Earth System Grid Federation (ESGF) can be found here this needs updating?

Somewhere in the documentation below needs a) information about QC procedures for Data Managers and b) graphic of the ESGF NG setup (Forrest/Phil have sent examples to IPO)

1.1 ESGF Software

Description

The ESGF Data Node software stack enables sites hosting earth system data to make it available to the community over several transfer protocols including http(s). ~~Index nodes enable search for hosted data via data publishing to the index, and these nodes include a search API and web frontend~~.not sure this is relevant if indexing is deprecated? Identity nodes manage user accounts. Nodes run as Docker containers and can be deployed via Ansible Playbooks or Helm Charts in a Kubernetes environment

New and exisiting installations

For new or exisiting ESGF node installations, first read the following document needs proper link and updating on ESGF policies, as this will influence the type of installation you need to deploy.

1.2 How to install

Requirements, setup and usage documentation

Software Stack The ESGF software stack requires Linux RedHat Enterprise or Rocky/Alma distributions. Administrators must have full sudo privileges to root access or a Kubernetes Cluster. The services are meant to run on webserver-grade hardware need a practical example here with cost estimate. For data-sharing nodes the storage holding your data must be mounted on the node.

ESGF Docker Instructions and links to any issuses can be found here.

Ansible Legacy documentation is available here is this still valid?

Metagrid user interface To install the Metagrid UI for end-users to search and download data, read the documentation here and see the Github repo heredon't know if these links need updating?

2. Dataset publication

Requirements

Publishers to ESGF must have an existing Data Node installed at their site. Although the publisher software (from v5.x onwards) does not need to run on the Data Node it does require a Data mount for the software to access data files.

2.1 Dataset preparation

The ESGF publication process requires robust and effective data management, which can also be a burden for data managers. However, the ESGF esgprep toolbox is a piece of software that enables data preparation according to ESGF best practices. Esgprep allows the data providers and data node managers to easily prepare their data for publishing to an ESGF node - it is a standalone toolbox. It can be used to fetch required configuration files, apply the Data Reference Syntax on local filesystems and/or generate mapfiles for ESGF publication.

Full details of esgprep and instructions for use provided by the team at Institut Pierre-Simon Laplace (IPSL) can be found here

2.2 Publisher introduction

The esg-publisher or esgcet Python package contains a collection of command-line utilities to scan, manipulate and push dataset metadata to an ESGF index node. The basic publication process takes several steps with some optional steps. Publisher functionality is available via several submodles/classes in the package. Please refer to the user documentation and Github issues page

2.3 ESG-Publisher software installation

Requirements

  1. A python environment, using venv, conda, miniforge/mamba etc.
  2. Mountpoint map to data on the same host as the publisher software installation, so the publisher scan utility (eg. autocurator) has access.
  3. Basic dataset information provided via the esg mapfile format. For example using the esgf-prepare/esgmapfile utility.

2.4 Dataset publication

Full details of the dataset publication process using pip install to install esgcet can be found here

3. Dataset retraction

3.1 Retraction process

The esgunpublish command retracts, or, upon specification, deletes a specified dataset(s). The output of this command is either a success or failure message accompanied with the id of the dataset that was retracted. Exercise caution when deleting datasets. If replicas have been made or if you will be republishing, you should retract rather than delete outright. Follow the instructions here and for an example, check out the Jupyter notebook