NIRD Research Data Archive - Preservation Plan

V3.1
29 January, 2025

Introduction

The mission of the NIRD Research Data Archive (NIRD RDA) is to ensure research data produced by Norwegian researchers remains widely discoverable, accessible and reusable at least 10 years after the data has been deposited. Data deposited on the NIRD RDA comes from different disciplines, are in a variety of formats and can be of any size. The inherent challenges posed by these characteristics, including size, format, and community diversity, necessitate a comprehensive preservation strategy.  This document describes the preservation plan that NIRD RDA follows to maintain long-term data accessibility during the archival period.

Scope and exclusion

This document describes the preservation plan for the archive to ensure that:

  • Authentic, reliable instances of the datasets are accessible.
  • Integrity, security and quality of the datasets are maintained.
  • Adequate management strategies for the dataset during the archival period, are in place.

This plan does not consider the wider National e-Infrastructure for Research Data (NIRD) that is not part of the implementation of the NIRD RDA. The plan uses as a guideline the standard OAIS reference model, the FAIR principles and the guidance from the Research Council of Norway.

Characterisation of the archive datasets

Characterisation of the archive datasets

The NIRD RDA provides archival and publication services for datasets from all disciplines resulting from Norwegian research activities, provided the data is not sensitive (please refer to the Terms of Use), or when there is no requirement to store the data in another archive (for example domain specific). There is no restriction on the size of a dataset or the number of files it may contain. The archive currently holds datasets spanning a range of sizes, from a few megabytes with a limited number of files to several hundred terabytes  in size and comprising hundreds of thousands of files. to.

Scientific disciplines such as Biology and Geoscience primarily use the NetCDF format to store their data, encompassing a diverse range from observational, analytical, and simulation-derived data. The format is adopted by many disciplines as a de-facto standard. Each discipline uses a variety of formats including PNG, JPEG for images, ASCII files for markdown and text. In some cases, researchers use the ZIP or TAR format to package and compress their data before archiving. The NIRD RDA encourages researchers to choose open formats for their data as described in the list of open file formats.

Community watch

The NIRD RDA actively follows current or new trends appearing in the wide variety of disciplines served by the archive, by adopting the three approaches outlined below. All three rely on active communication between the archive and the communities to ensure the reuse of data is maximized.

The NIRD RDA requires each dataset to be linked to a Data Manager/Contact Person who acts as the contact person for the archived dataset. Users of the dataset who have queries about the data can contact the Data Manager/Contact Person to resolve their query. The Data Manager/Contact Person may find a need to update metadata for a dataset, or to replace or update an existing dataset (in which case a new version of the dataset can be created with updated metadata). In the context of the archive’s activities, such as identifying datasets that have expired and become candidates for deletion or migration to different storage classes, the Data Manager/Contact Person can be contacted to understand the impact. The Data Manager/Contact Person can also initiate contact with the NIRD RDA in the case when the deposited dataset needs to be migrated to a new format.

Sigma2 also offers researchers Advanced User Support services where Depositors, or stakeholders of archived datasets can work with the archive DevOps to improve the reuse of archived datasets. To date, such support has resulted in the extraction of metadata contained in NetCDF files that was then populated in a domain-specific catalog, integration of the archive with domain-specific services and other portals.

Sigma2 regularly meets with communities and conducts yearly surveys to gather feedback on its services including the NIRD RDA. In addition, Sigma2 has regular workshops with the heavy users (several datasets deposited per year, large data volume) to understand issues and future directions.

The NIRD RDA’s principles

  • The NIRD RDA strives to support Open Science, FAIR and the national guideline for sharing and reusing research data by ensuring each phase of the OAIS-based archive addresses the FAIR principles.
  • The NIRD RDA strives to support data from any community, of any size and with any open formats.
  • The NIRD RDA offers long term preservation of the data.
  • The NIRD RDA strives to support data driven science and data reuse.

The preservation strategies

The archive adopts the following strategies to ensure datasets are authentic, secure and accessible throughout their lifetime:

  • The Depositor of a dataset agrees to the Terms and Conditions that allow the archive to manage and distribute the datasets:
  • The Depositor is meant to be responsible for the integrity of the dataset at the deposition, the eligibility and the compliance of the data sets with the terms and conditions and GDPR requirements.
  • The datasets are validated (metadata are checked and data is checked to ensure authenticity) before publication which results in a DOI being issued.
  • The integrity of datasets is checked and maintained throughout the dataset’s lifetime.
  • All metadata and data comply with GDPR guidelines and IPR and copyright regulations of the dataset owner’s institution.
  • Deposited datasets are never deleted before the end of the retention time specified in the Depostor contract (at least 10 years). Only in the case of extraordinary circumstances may the data be deleted before the end of the retention time. After the retention time datasets might be deleted only in case of compelling technical reasons.contract (at least 10 years). The deleted dataset DOI will resolve to a tombstone record containing all the public metadata for the dataset. 

Roles and Responsibilities

For a list of roles and responsibilities, please refer to the Glossary in Terms of Use.

Sustainability plans and funding

Sigma2 ensures operation of the service for the validity period of Sigma2’s mandate. Sigma2, established in 2015, is mandated by the Research Council of Norway (RCN) and the BOTT universities with a horizon of 10 years. Every 5 years there is a mid-term evaluation which triggers the renewal of the next period. Therefore, the horizon in which Sigma2 operates is always between 5-10 years. Even if the mandate is based on a collaboration between the RCN and the BOTT universities, the funding to store and operate the NIRD RDA is solely from RCN, as these data are considered of public value.

Preservation plan implementation

The archive follows the OAIS reference architecture which divides up the preservation process into functions: ingest, archive storage, data management and access. The preservation plan covers all functions. The functions are implemented as described below. See also the link to the archive workflow.

The Ingest function

Covers the deposit of the dataset and related metadata. Potential Depositors are required to request registration with the archive. The Archive Manager assesses the request and approves it if the user is a member of a Norwegian institute. The archive implements an extension of the Dublin Core metadata standard as described here and Depositors are required to agree to the Terms of Use. This also covers the archive taking responsibility for the management and distribution of the dataset as described in the Depositor Agreement.

The NIRD RDA encourages researchers to deposit datasets in an open format as described in the user-guide. If the data format is not on the supplied list, the Archive Manager will contact the Depositor to understand the reason. If the format is generally accepted by the community and has wide support, it is then added to the list of recommended formats. If it is not, the Archive Manager will  suggest to the Depositor to migrate the data to one of the recommended formats.

Depositors are also required to select a license for the dataset. The default license is CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). Depositors can make a request to the Archive Manager for a different license if the default license does not meet the needs. Each request is evaluated by the Archive Manager and the NIRD RDA product owner on a case-by-case basis where the archive considers the needs of the Depositor as well as the need to ensure the data are as open as possible.

The metadata supplied by the Depositor and the data form the Submission Information Package (SIP) in the OAIS model. The metadata is supplied via a form which is submitted to the archive and stored in a database. Each dataset gets a unique identifier that is internal to the archive which is assigned to the metadata and the data are stored under a well defined structure based on the unique identifier. The ingest process is logged to facilitate troubleshooting of any potential issues that may arise during the ingestion process.

The Depositor is notified once the data has been ingested and the status of the metadata is visible to the user via the metadata status flag.

The metadata is checked by the Archive Manager before publication. Only rudimentary checks of the metadata are made (to make sure the required metadata contains understandable information at a basic level). More detailed checks, whether the metadata is understandable by the targeted community are the responsibility of the Data Manager/Contact Person and Depositor. Their acknowledgement is implied when they request the dataset to be published. As a secondary check the Archive Manager double-checks by email to the Depositor and Data Manager/Contact Person that all the metadata are sufficient for the targeted community before publishing the dataset (which includes issuing the DOI).

Users can upload new datasets, or create a new version of an existing, published dataset. When the user selects to create a new version of a dataset the archive ensures a link between the existing dataset and the new version is maintained by using the Dublin Core ‘hasVersion’ and ‘isVersionOf’ terms. The archive creates a copy of the metadata for the new version and the Depositor is free to update the metadata as needed.

New datasets and new versions of existing datasets then go through a publication process. The Depositor is required to verify the number of files and checksum of the data files agree with the values computed by the depositor before upload. The Depositor is also required to verify they have supplied all the mandatory metadata and any optional metadata that may be necessary. The Archive Manager manually checks that the metadata is understandable at a very basic level to ensure the metadata terms are not nonsense. If the metadata is not sufficient, the dataset is put back into the preparation phase and the Depositor is notified in order to allow the Depositor to update the metadata.

Once the data and metadata have been verified the dataset is published by requesting from DataCite a Digital Object Identifier (DOI). Once the DOI has been issued it is attached as metadata to the dataset and the dataset is made read-only. The publicly accessible DOI resolves to a landing page maintained and operated by the archive that contains metadata information for the dataset and a link to the dataset. If a dataset has been deleted a metadata record called a tombstone record is kept which contains metadata for the dataset and a reason for deletion.

The Archive Storage function

The data is copied from the ingest area to the archive area file-by-file with the hash for each file being computed. The data are stored under a folder with the top-level being the dataset identifier. Metadata for each file are stored in the database table of contents. The dataset metadata supplied by the Depositor and the table of contents generated by the archive which includes the fixity, filepaths, sizes, last modified times, formats form the Archival Information Package (AIP) that is used to successfully manage the dataset in the archive. The Depositors are notified of the success or failure of the archiving. In the case of failure, the Archive Manager works with the Depositor to resolve the issues with the archiving of the data. Once the data is archived, the archive makes a point-in-time copy of the dataset and replicates the dataset to another storage managed by Sigma2.

Once datasets have been published, the datasets are made read-only. The archive possesses the ability to delete datasets if they contravene copyright or if there is a valid reason for deletion. In the latter case, depending on the reason, the data are made inaccessible and marked as eligible for deletion but not actually deleted until dialog with the Depositor has been conducted.

The Data Management function

The data management function covers the metadata management. The NIRD RDA uses a Postgres database with backed-up reliable SSD storage for the database. All metadata for the SIP, AIP and the Dissemination Information Package (DIP) are stored in the database. The database schema is arranged by dataset where each dataset has both dataset metadata and system metadata. All the information stored is necessary for the Ingest, Archive Storage, Administration, Preservation Planning and Access functions.

The Access function

Covers the means that users must find, view and access the archived dataset. Each published dataset has a DOI that resolves to a landing page that is hosted by the archive system. The page contains all the publicly available metadata as well as the license for using the dataset, a link to the table of contents and a link to the dataset (this is the DIP). Users can anonymously download a complete dataset or choose the subset of files they wish to download.

The archive provides a web-based search function based on the widely used Apache Solr platform. Users can search for datasets of interest based on the exposed metadata (for example, terms in the title, description subject, creator can be searched). Although the archive provides metadata that is generic enough to support a wide variety of disciplines, researchers have used the DOI in their more detailed metadata registries to enable more fine-grained search of the data.

The archive also provides an API that also makes use of the Apache Solr platform to provide basic search functionality. In addition, an OAI-PMH interface exists for harvesting by other metadata catalogues. The basic search returns metadata in JSON format with a link to the dataset which can be accessed via the  S3 protocol.

The Administration function

The function covers the management and operation of the archive. The NIRD RDA administrators follow inquiries from users through a ticketing system, and operations are supported by NRIS, ensuring that the NIRD RDA service and the NIRD infrastructure maintain a high level of availability, security and reliability.

The Preservation Planning function

The preservation planning function ensures that the data remains usable over its lifetime. This function splits into two forms: ensuring the service provides access to the datasets (bit-level preservation) and ensuring the datasets remain understandable. 

Bit-level preservation: it is performed by the archive where datasets are regularly checked to ensure integrity. The archive team is included in storage capacity and infrastructure planning, and is working together with NRIS to migrate the data to new infrastructure in as seamless a manner as possible. To date, the NIRD RDA has undergone three successful migrations since its inception in 2014.

Ensuring data understandability: for both the targeted users and users from new domains. The archive does not have in-house expertise in the domains that deposit data. However, each dataset requires a Data Manager/Contact Person to be assigned who is the contact person for the dataset. The Data Manager/Contact Person is responsible for advising the archive of any changes needed for the dataset (such as creating a new version of a dataset including updated metadata, or advising that data needs to be migrated to a new format). 

Retention period: is at least 10 years after the data has been deposited. (see Dataset End-of-life in the depositor agreement).

Datasets reaching the end of their retention period (see Dataset End-of-life in the Depositor Agreement) are reappraised in collaboration with domain experts and the Depositor. If the dataset is considered to still be of value, the retention period will be extended (the period may vary and will be defined on a case-by-case evaluation). If the appraisal is not possible due to unavailability of the Depositor, the datasets are kept unless there is a compelling technical reason to delete them. 

If a dataset is a candidate for deletion after the retention time, the impending deletion will first be announced on the NIRD RDA front page and the Dataset’s landing page. This announcement will be visible for a period of one year. During this grace period, anyone who has an interest in maintaining access to the dataset can renew the retention period by contacting the Archive Manage. If a dataset is deleted the DOI will resolve to a tombstone record containing all the public metadata for the dataset.

Sigma2 also operates an ‘advanced user support (AUS)’ program which provides a mechanism for owners or users of the archive to request support to fulfill any need that ensures the data remain usable.

User feedback

Users of the Sigma2/NRIS infrastructure are sent a questionnaire which includes a request for feedback on the NIRD RDA where users can indicate potential future needs. Regular meetings are held with the communities that regularly deposit data to understand if there are any changes that need to be made to ensure the data remain usable.

Scenarios and contingency plan

User has problems reusing a dataset

Users are expected to email the Archive Manager in case of any problems. If the issue is related to incorrect or inaccurate metadata, the Archive Manager will contact the Data Manager relating the problem (if the problem is complex, the Archive Manager will put the user in contact with the Data Manager. If the issue requires an update of the metadata, the Data Manager or Depositor will then create a new version of the dataset with updated metadata. If the issue is related to the data (missing data, erroneous data or data corrupted before archiving) the Archive Manager will put the user in contact with the Data Manager to understand how to address the issue. If the resolution requires a new version of the dataset to be made, the Data Manager may ask the Depositor to create a new version of the dataset (or may do it themself) with new and any update to the metadata deemed necessary. The Data Manager may require the previous version of the dataset to be deleted. In this case, the Data Manager submits a request to the Archive Manager to delete the dataset. If the Archive Manager accepts the request, the dataset is marked as deleted accompanied with a reason for deletion. The data are made inaccessible, but kept in case users want to access the deleted data. 

Data is not accessible

Should data not be searchable or accessible, the incident management procedure (according to the FitSM standard) is initiated, including communication with the end-user and initiation of the recovery plan. As NIRD RDA is a component of the NIRD, recovery from the incident aligns with the contingency plan of the NIRD.

Data is corrupted

Data are check-summed at the ingestion and checksum values are regularly monitored. Primary copy of the data is regularly replicated onto two secondary physically and logically separated storage. Corrupted data can be restored from the secondary replicas.

Metadata are obsolete, erroneous or no longer valid

It is the Depositor's responsibility to ensure that the metadata are sufficient to be understood and used by the targeted communities at the time of publication of the dataset. Once the dataset has been published, it is the Data Manager/Contact Person’s responsibility to ensure that the metadata remains understandable by the targeted community. If a user of the dataset notices issues with the metadata, they can contact the Archive Manager who will contact the Data Manager/Contact Person to inform them of the issues. The Data Manager/Contact Person can then create a new version of the dataset with corrected metadata. The old dataset, featuring obsolete metadata is kept, with a new metadata record pointing at the newer version. Likewise, the new version contains a metadata record referencing the obsolete one.

Format is obsolete or no longer valid

If the Depositor identifies a need to migrate to a new format, they can notify the NIRD RDA administrators who will work with the Data Manager/Contact Person to create the migration workflow that includes ensuring significant properties of the dataset are maintained. The migration would make use of the existing Sigma2 computing and storage infrastructure. Adequate resources and competences from both Data Manager and Depositor, as well as from the service provider (Sigma2/NRIS) will be allocated to the migration task.

The underlying storage infrastructure is going out of production.

If the underlying infrastructure (NIRD) is going out of production, the migration of the archived data onto the new infrastructure is part of the procurement- and preparations for operations project for the renewal of the infrastructure. Migration is done during the acceptance testing period of the new infrastructure and integrity checks are done before dismissing the old infrastructure and putting the new one in production. Final deletion of the data from the old infrastructure is done after one year from decommissioning.

Governance or funding scheme is suddenly changed 

The archive responsibility is given to the ultimate accountable body for Sigma2, which is the Sigma2 Board. Should the Sigma2 Board be dismissed, or Sigma2 being closed down, the responsibility over the archived data is to be taken by Sikt.

References