1. Introduction

This document describes general data management practices in line with the FAIR (Findable, Accessible, Interoperable and Reusable) guiding principles for scientific data management, and focuses on the management and sharing of dynamical geodata (i.e., geolocated data about processes in nature). The practices aim at a metadata-driven data management regime. The present handbook is a general text (under development) that can be applied by anyone. We have prepared templates for organisation specific information that can be updated and merged into the general handbook by the user organisations.

The purpose of the Data Management Handbook (DMH) is threefold:

  1. to provide an overview of the principles for FAIR data management to be employed;

  2. to help personnel identify their roles and responsibilities for good data management; and

  3. to provide personnel with practical guidelines for carrying out good data management.

Data management is the term used to describe the handling of data in a systematic and cost-effective manner. The data management regime should be continuously evolving, to reflect the evolving nature of data collection. Therefore this DMH is a living document that will be revised and updated from time to time in order to maintain its relevance.

The primary focus of this DMH is on the management of dynamic geodata. Dynamic geodata is weather, environment and climate-related data that changes in space and time and is thus descriptive of processes in nature. Examples are weather observations, weather forecasts, pollution (environmental toxins) in water, air and sea, water flow in rivers, driving conditions on the roads. Dynamic geodata provides important constraints for many decision-making processes and activities in society.

The document has 4 main pillars: overview, practical guide, insight and references. The overview gives a summery of the most important data management principles. In the practical guide data producers are provided with practical guidance to FAIR data management at their respective institution. Further details and a deeper understanding of FAIR data management principles is provided in insight.

The DMH is a strategic governing document and should be used as part of the quality framework the organisation is using.

2. Summary

2.1. The principles of data management for dynamic geodata

Principles of standardised data documentation, publication, sharing and preservation have been formalised in the FAIR Guiding Principles for scientific data management and stewardship [RD3] through a process facilitated by FORCE11. FAIR stands for findability, accessibility, interoperability and reusability.

By following the FAIR principles it is easier to obtain a common approach to data management, or a unified data management model. One of the main motivations for implementing a unified data management is to better serve the users of the data. Primarily, this can be approached by making user needs and requirements the guide for determining what data we provide and how. For example, it will be described below how the specification of datasets should be determined. By implementing the data management practices described here, it is expected that users will experience:

  • Ease of discovering, viewing and accessing datasets;

  • Standardised ways of accessing data, including downloading or streaming data, with reduced need for special solutions on the user side;

  • Reduced storage needs;

  • Simple and standard access to remote datasets and catalogues, with own data visualisation and analysis tools;

  • Ability to compare and combine data from internal and external sources;

  • Ability to apply common data transformations, like spatial, temporal and variables subsetting and reprojection, before downloading anything;

  • Possibility to build specialized metadata catalogues and data portals targeting a specific user community.

2.1.1. Dataset

A dataset is a collection of data. In the context of the data management model, the storage mode of the dataset is irrelevant, since access mechanisms can be decoupled from the storage layer as experienced by a data consumer. Typically, a dataset represents a number of variables in time and space. A more detailed definition is provided in the Glossary of Terms. In order to best serve the data through web services, the following principles are useful for guiding the dataset definition:

  1. A dataset can be a collection of variables stored in, for example, a relational database or as flat files;

  2. A dataset is defined as a number of spatial and/or temporal variables;

  3. A dataset should be defined by the information content and not the production method;

  4. A good dataset does not mix feature types, i.e., trajectories and gridded data should not be present in the same dataset.

Point 3 implies that the output of, e.g., a numerical model may be divided into several datasets that are related. This is also important in order to efficiently serve the data through web services. For instance, model variables defined on different vertical coordinates should be separated as linked datasets, since some OGC services (e.g., WMS) are unable to handle mixed coordinates in the same dataset. One important linked dataset relation is the parent-child relationship. In the numerical model example, the parent dataset would be the model simulation. This (parent) dataset encompasses all datasets created by the model simulation such as, e.g., two NetCDF-CF files (child datasets) with different information content.

Most importantly, a dataset should be defined to meet the consumer needs. This means that the specification of a dataset should follow not only the content guidelines just listed, but also address the consumer needs for data delivery, security and preservation.

2.1.2. Metadata

Metadata is a broad concept. In our data management model the term "metadata" is used in several contexts, specifically the five categories that are briefly described in Table 1.

Table 1. Brief introduction to different types of metadata.
Type Purpose Description Examples

Discovery metadata

Used to find relevant data

Discovery metadata are also called index metadata and are a digital version of the library index card. They describe who did what, where and when, how to access data and potential constraints on the data. They shall also link to further information on the data, such as site metadata.

ISO 19115
GCMD/DIF

Use metadata

Used to understand data found

Use metadata describes the actual content of a dataset and how it is encoded. The purpose is to enable the user to understand the data without any further communication. They describe the content of variables using standardised vocabularies, units of variables, encoding of missing values, map projections, etc.

Climate and Forecast (CF) Convention
BUFR
GRIB

Site metadata

Used to understand data found

Site metadata are used to describe the context of observational data. They describe the location of an observation, the instrumentation, procedures, etc. To a certain extent they overlap with discovery metadata, but also extend discovery metadata. Site metadata can be used for observation network design. Site metadata can be considered a type of use metadata.

WIGOS
OGC O&M

Configuration metadata

Used to tune portal services for datasets intended for data consumers (e.g., WMS)

Configuration metadata are used to improve the services offered through a portal to the user community. This can, e.g., be how to best visualise a product.

System metadata

Used to understand the technical structure of the data management system and track changes in it

System metadata covers, e.g., technical details of the storage system, web services, their purpose and how they interact with other components of the data management system, available and consumed storage, number of users and other KPI elements etc.

The tools and facilities used to manage the information for efficient discovery and use are further described in Section 4.7.

2.2. Summary of the data management at (insert organisation name here)

2.2.1. Data Management roles

Role Description Responsibility

3. Practical Guides

This chapter includes how-to’s and other practical guidance for data producers.

3.1. Create a Data Management Plan (DMP)

The funding agency of your project will usually provide requirements, guidelines or a template for the DMP. If this is not the case or for datasets that are not part of a project use the template provided by your institution or the template based on the recommendations by Science Europe.

3.1.1. Using easyDMP

  1. Log in to easyDMP, use Dataporten if your institution supports that, otherwise pick one of the other login methods.

  2. Click on + Create a new plan and pick a template

  3. By using the Summary button from page two and on, you can get an overview of all the questions.

3.1.2. Publishing the plan

Currently you can use the export function in easyDMP to download an HTML or PDF version of the DMP and use it further. This might change if "Hosted DMP" gets implemented.

3.2. Submitting data as NetCDF-CF

3.2.1. Workflow

  1. Define your dataset (see dataset and Section 2.1.1)

  2. Create a NetCDF-CF file (see Section 3.2.2)

  3. Store the NetCDF-CF file in a suitable location, and distribute it via thredds or another dap server (see, e.g., Section 3.2.3)

  4. Register your dataset in a searchable catalog (see Section 3.2.4)

    • Test that your dataset contains the necessary discovery metadata and create an MMD xml file (see Section 3.2.4.1)

    • Test the MMD xml file (see Section 3.2.4.2)

    • Push the MMD xml file to the discovery metadata catalog (see Section 3.2.4.3)

3.2.2. Creating NetCDF-CF files

By documenting and formatting your data using NetCDF following the CF conventions and the Attribute Convention for Data Discovery (ACDD), MMD files can be automatically generated from the NetCDF files. The CF conventions is a controlled vocabulary providing a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. The ACDD vocabulary describes attributes recommended for describing a NetCDF dataset to data discovery systems. See, e.g., netCDF4-python docs, or xarray docs for documentation about how to create netCDF files.

The ACDD recommendations should be followed in order to properly document your netCDF-CF files. The below tables summarize required and recommended ACDD and some additional attributes that are needed to properly populate a discovery metadata catalog which fulfills the requirements of international standards (e.g., GCMD/DIF, the INSPIRE and WMO profiles of ISO19115, etc.).

3.2.2.1. Notes

Keywords describe the content of your dataset following a given vocabulary. You may use any vocabularies to define your keywords, but a link to the keyword definitions should be provided in the keywords_vocabulary attribute. This attribute provides information about the vocabulary defining the keywords used in the keywords attribute. Example:

:keywords_vocabulary = "GCMDSK:GCMD Science Keywords:https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/sciencekeywords, GEMET:INSPIRE Themes:http://inspire.ec.europa.eu/theme, NORTHEMES:GeoNorge Themes:https://register.geonorge.no/metadata-kodelister/nasjonal-temainndeling" ;

Note that the GCMDSK, GEMET and NORTHEMES vocabularies are required for indexing in S-ENDA and Geonorge. You may find appropriate keywords at the following links:

The keywords should be provided by the keywords attribute as a comma separated list with a short name defining the vocabulary used, followed by the actual keyword, i.e., short_name:keyword. Example:

:keywords = "GCMDSK:Earth Science > Atmosphere > Atmospheric radiation, GEMET:Meteorological geographical features, GEMET:Atmospheric conditions, NORTHEMES:Weather and climate" ;

See https://adc.met.no/node/96 for more information about how to define the ACDD keywords.

A data license provides information about any restrictions on the use of the dataset. To support a linked data approach, it is strongly recommended to use identifiers and URL’s from https://spdx.org/licenses/ and to use a form similar to <URL>(<Identifier>) using elements from the SPDX license list. Example:

:license = "http://spdx.org/licenses/CC-BY-4.0(CC-BY-4.0)" ;

3.2.2.2. List of Attributes

This section provides lists of CF and ACDD global netcdf attribute names that are required and recommended, as well as some extra elements that are needed to fully support our data management needs. The left columns in the below tables provide the CF/ACDD names, the centre columns provide the MET Norway Metadata Specification (MMD) fields that map to the CF/ACDD names (and our extension to ACDD), and the right columns provide descriptions. Please refer to MMD for definitions of these elements, as well as controlled vocabularies that should be used (these are also linked in the descriptions). Note that the below tables are automatically generated - please add an issue in py-mmd-tools or in the data-management-handbook if something is unclear.

In order to check your netCDF-CF files, and to create MMD xml files, you can use the nc2mmd.py script in the py-mmd-tools Python package.

3.2.2.2.1. Climate and Forecast conventions (CF)

The following attributes are required:

CF Attribute

MMD equivalent

Description

Conventions

None

Required. A comma-separated string with names of the conventions that are followed by the dataset, e.g., "CF-1.10, ACDD-1.3".

history

None

Required. Provides an audit trail for modifications to the original data. This attribute is also in the NetCDF Users Guide ('This is a character array with a line for each invocation of a program that has modified the dataset. Well-behaved generic netCDF applications should append a line containing date, time of day, user name, program name and command arguments'). To include a more complete description you can append a reference to an ISO Lineage entity; see NOAA EDM ISO Lineage guidance.

featureType

None

Recommended if the data can be described by the listed feature types. Specifies the type of discrete sampling geometry to which the data in the scope of this attribute belongs, and implies that all data variables in the scope of this attribute contain collections of features of that type. All of the data variables contained in a single file must be of the single feature type indicated by the global featureType attribute. The value assigned to the featureType attribute is case-insensitive; it must be one of the string values listed in the left column of Table 9.1 in chapter 9, Discrete Sampling Geometries, of the CF convention.

comment

None

Not required. Miscellaneous information about the data or methods used to produce it.

external_variables

None

Not required. Identifies variables which are named by cell_measures attributes in the file but which are not present in the file.

3.2.2.2.2. Attribute Convention for Data Discovery (ACDD)

The following ACDD elements are required:

ACDD Attribute

MMD equivalent

Description

id

metadata_identifier

An identifier for the dataset, provided by and unique within its naming authority. The combination of the "naming authority" and the "id" should be globally unique, but the id can be globally unique by itself also. A uuid is recommended.

naming_authority

metadata_identifier

The organization that provides the initial id (see above) for the dataset. The naming authority should be uniquely specified by this attribute. We recommend using reverse-DNS naming for the naming authority.

date_created

last_metadata_update>update>datetime

The date on which this version of the data was created (modification of variable values implies a new version, hence this would be assigned the date of the most recent modification of variable values). Metadata changes are not considered when assigning the date_created. The ISO 8601:2004 extended date format is recommended, e.g., 2020-10-20T12:35:00Z.

title

title>title

A short phrase or sentence describing the dataset. In many discovery systems, the title will be displayed in the results list from a search, and therefore should be human readable and reasonable to display in a list of such names. This attribute is also recommended by the NetCDF Users Guide and the CF conventions.

summary

abstract>abstract

A paragraph describing the dataset, analogous to an abstract for a paper. Use ACDD extension "summary_no" for Norwegian translation.

time_coverage_start

temporal_extent>start_date

Describes the time of the first data point in the data set. Use the ISO 8601:2004 date format, preferably the extended format as recommended in the Attribute Content Guidance section. I.e. YYYY-MM-DDTHH:MM:SSZ (always use UTC).

geospatial_lat_max

geographic_extent>rectangle>north

Describes a simple upper latitude limit; may be part of a 2- or 3-dimensional bounding region. Geospatial_lat_max specifies the northernmost latitude covered by the dataset. Must be decimal degrees north.

geospatial_lat_min

geographic_extent>rectangle>south

Describes a simple lower latitude limit; may be part of a 2- or 3-dimensional bounding region. Geospatial_lat_min specifies the southernmost latitude covered by the dataset. Must be decimal degrees north.

geospatial_lon_max

geographic_extent>rectangle>east

Describes a simple longitude limit; may be part of a 2- or 3-dimensional bounding region. geospatial_lon_max specifies the easternmost longitude covered by the dataset. Cases where geospatial_lon_min is greater than geospatial_lon_max indicate the bounding box extends from geospatial_lon_max, through the longitude range discontinuity meridian (either the antimeridian for -180:180 values, or Prime Meridian for 0:360 values), to geospatial_lon_min; for example, geospatial_lon_min=170 and geospatial_lon_max=-175 incorporates 15 degrees of longitude (ranges 170 to 180 and -180 to -175). Must be decimal degrees east (negative westwards).

geospatial_lon_min

geographic_extent>rectangle>west

Describes a simple longitude limit; may be part of a 2- or 3-dimensional bounding region. geospatial_lon_min specifies the westernmost longitude covered by the dataset. See also geospatial_lon_max. Must be decimal degrees east (negative westwards).

license

use_constraint>resource

Provide the URL to a standard or specific license, enter "Freely Distributed" or "None", or describe any restrictions to data access and distribution in free text. It is strongly recommended to use identifiers and URL’s from https://spdx.org/licenses/ and to use a form similar to <URL>(<Identifier>) using elements from the SPDX license list.

keywords

keywords>keyword

A comma-separated list of keywords and/or phrases. Keywords may be common words or phrases, terms from a controlled vocabulary (GCMD is required), or URIs for terms from a controlled vocabulary (see also "keywords_vocabulary" attribute). If keywords are extracted from, e.g., GCMD Science Keywords, add keywords_vocabulary="GCMDSK" and prefix in any case each keyword with the appropriate prefix.

keywords_vocabulary

keywords>vocabulary

If you are using a controlled vocabulary for the words/phrases in your "keywords" attribute, this is the unique name or identifier of the vocabulary from which keywords are taken. If more than one keyword vocabulary is used, each may be presented with a key, a long name, and a url, followed by a comma, so that keywords may be prefixed with the controlled vocabulary key. Example; 'GCMDSK:GCMD Science Keywords:https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/sciencekeywords, GEMET:INSPIRE Themes:http://inspire.ec.europa.eu/theme'.

The following ACDD elements are recommended (should be used if there is no good reason not to use it):

ACDD Attribute

MMD equivalent

Description

publisher_type

publisher_type

Specifies type of publisher as one of 'person', 'group', 'institution', or 'position'. If this attribute is not specified, the publisher is assumed to be a person.

publisher_email

publisher_email

The email address of the person (or other entity specified by the publisher_type attribute) responsible for publishing the data file or product to users, with its current metadata and format.

time_coverage_end

temporal_extent>end_date

Describes the time of the last data point in the data set. If the dataset is continuously updated with new measurements (e.g., a timeseries receiving new observations), this attribute can be omitted. Use ISO 8601:2004 date format, preferably the extended format as recommended in the Attribute Content Guidance section. I.e. YYYY-MM-DDTHH:MM:SSZ (always use UTC).

geospatial_bounds

geographic_extent>polygon

Describes the data’s 2D or 3D geospatial extent in OGC’s Well-Known Text (WKT) Geometry format (reference the OGC Simple Feature Access (SFA) specification). The meaning and order of values for each point’s coordinates depends on the coordinate reference system (CRS). The ACDD default is 2D geometry in the EPSG:4326 coordinate reference system. The default may be overridden with geospatial_bounds_crs and geospatial_bounds_vertical_crs (see those attributes). EPSG:4326 coordinate values are latitude (decimal degrees_north) and longitude (decimal degrees_east), in that order. Longitude values in the default case are limited to the [-180, 180) range. Example; 'POLYGON 40.26 -111.29, 41.26 -111.29, 41.26 -110.29, 40.26 -110.29, 40.26 -111.29'. Use this to improve the dataset findability through geospatial search.

processing_level

operational_status

A textual description of the processing level of the data. Valid keywords are listed in Section 4.5 of the MMD specification.

contributor_role

personnel>role

The role of any individuals, projects, or institutions that contributed to the creation of this data. May be presented as free text, or in a structured format compatible with conversion to ncML (e.g., insensitive to changes in whitespace, including end-of-line characters). Multiple roles should be presented in the same order and number as the names in contributor_names. Contributor roles should be defined using elements from the contact role types in the MMD specification.

creator_name

personnel>name

The name of the person (or other creator type specified by the creator_type attribute) principally responsible for creating this data. If multiple persons are involved, please list these as a comma separated list. In such situation please remember to add a comma separated string for creator_institution, creator_email and creator_role as well. Anyone that should be listed as dataset creators in DOI landing pages should be added to this item.

contributor_name

personnel>name

The name of any individuals, projects, or institutions that contributed to the creation of this data. May be presented as free text, or in a structured format compatible with conversion to ncML (e.g., insensitive to changes in whitespace, including end-of-line characters). If multiple persons are involved, please list these as a comma separated list.

creator_type

personnel>creator_type

Specifies type of creator (one of 'person', 'group', 'institution', or 'position'). If this attribute is not specified, the creator is assumed to be a person. If multiple persons are involved, please list these as a comma separated string. In such situation please remember to add a comma separated string for creator_institution, creator_email and creator_role as well. Consistency between these fields are done from left to right.

creator_email

personnel>email

The email address of the person (or other creator type specified by the creator_type attribute) principally responsible for creating this data. See description of creator_type. Consistency across comma separated lists for all creator_* attributes is required.

creator_institution

personnel>organisation

The institution of the creator; should uniquely identify the creator’s institution. This attribute’s value should be specified even if it matches the value of publisher_institution, or if creator_type is institution. See description of creator_type. Consistency across comma separated lists for all creator_* attributes is required.

institution

data_center>data_center_name>long_name

The name of the institution principally responsible for originating this data. This attribute is recommended by the CF convention.

publisher_url

data_center>data_center_url

The URL of the person (or other entity specified by the publisher_type attribute) responsible for publishing the data file or product to users.

references

related_information>resource

A comma separated list of published or web-based references that describe the data or methods used to produce it. We recommend to use URIs (such as a URL or DOI) for papers or other references, and to use a form similar to <URL>(<Type>) using type elements from the related information types in the MMD specification. This attribute is defined in the CF conventions.

project

project

The name of the project(s) principally responsible for originating this data in the format <long projectname> (<short project name>). Multiple projects can be separated by commas, as described under Attribute Content Guidelines. Examples; 'PATMOS-X', 'Extended Continental Shelf Project' becomes 'Extended Continental Shelf Project (PATMOS-X)'. I.e., if each substring includes a keyword in parantheses, the content within the paranthesis is interpreted as the short name for the project while the rest is the long name, e.g., 'Nansen Legacy (NLEG)'.

platform

platform>long_name

Name of the platform(s) that supported the sensor used to create this dataset. Platforms can be of any type, including satellite, ship, station, aircraft or other. Both MMD and GCMD has controlled vocabularies for platform names (the GCMD one is a large xml file in which the data producer must search for the correct platform name [use prefLabel], e.g. like <ctrl>-f "models</skos:prefLabel"). Indicate which controlled vocabulary that is used in the platform_vocabulary attribute. Comma separated list.

platform_vocabulary

platform>resource

Controlled vocabulary for the names used in the "platform" attribute, e.g., MMD or GCMD. Should be provided as urls in a comma separated list.

instrument

platform>instrument>long_name

Name of the instrument(s) or sensor(s) used to create this dataset. Both MMD and GCMD has controlled vocabularies for instrument names (the GCMD one is a large xml file in which the data producer must search for the correct instrument name [use prefLabel], e.g. like <ctrl>-f "thermometers</skos:prefLabel"). Indicate which controlled vocabulary that is used in the instrument_vocabulary attribute. Comma separated list.

instrument_vocabulary

platform>instrument>resource

Controlled vocabulary for the names used in the "instrument" attribute, e.g., MMD or GCMD. Should be provided as urls in a comma separated list.

source

activity_type

The method of production of the original data. This attribute is defined in the CF Conventions. Valid MMD values are listed in section 4.8 of the MMD specification.

creator_name

dataset_citation>author

The name of the person (or other creator type specified by the creator_type attribute) principally responsible for creating this data.

date_created

dataset_citation>publication_date

The date on which this version of the data was created (modification of variable values implies a new version, hence this would be assigned the date of the most recent modification of variable values). Metadata changes are not considered when assigning the date_created. The ISO 8601:2004 extended date format is recommended, e.g., 2020-10-20T12:35:00Z.

publisher_name

dataset_citation>publisher

The name of the person (or entity specified by the publisher_type attribute) responsible for publishing the data file or product to users.

metadata_link

dataset_citation>url

A URL that gives the location of more complete metadata. A persistent URL is recommended for this attribute.

The following elements are recommended ACDD extensions that are useful to improve (meta)data interoperability. Please refer to the documentation of MMD for more details:

Attribute

MMD equivalent

Description

spatial_representation

spatial_representation

The method used to spatially represent geographic information. Valid entries are vector, grid, point and trajectory (see section 4.16 of the MMD specification).

alternate_identifier

alternate_identifier>alternate_identifier

Alternative identifier for the dataset described by the metadata document. This identifier is when datasets may have multiple identifiers, e.g., identifiers depending on the framework data are shared through.

alternate_identifier_type

alternate_identifier>type

Type of identifier used. Currently no controlled vocabulary is defined, but this should be added once better knowledge of domains are known.

title_no

title>title

Norwegian version of the title.

title_lang

title>lang

ISO language code for the title. Defaults to "en".

summary_no

abstract>abstract

Norwegian version of the abstract.

summary_lang

abstract>lang

ISO language code for the summary. Defaults to "en".

dataset_production_status

dataset_production_status

Production status for the dataset, using a controlled vocabulary. The valid keywords are listed in section 4.2 of the MMD specification. If set as "In Work", remember that end_date in section 2.8 of the MMD specification can (should) be empty.

access_constraint

access_constraint

Limitations on the access to the dataset. See section 4.6 of the MMD specification for a list of valid values.

license_identifier

use_constraint>identifier

Referring to the spdx licenseId. If the identifier is specified in the license attribute as <URL>(<Identifier>), license_identifier is not needed.

contributor_email

personnel>email

The email address of the contributor(s). Consistency across comma separated lists for all contributor_* attributes is required.

contributor_institution

personnel>organisation

The institution of the contributor(s). Consistency across comma separated lists for all contributor_* attributes is required.

institution_short_name

data_center>data_center_name>short_name

Short version of the institution name.

related_dataset

related_dataset

Specifies the relation between this dataset and another dataset in the form "<naming_authority:id> (relation type)". The type of relationship must be either "parent" (this dataset is a child dataset of the referenced dataset) or "auxiliary" (this dataset is auxiliary data for the referenced dataset).

iso_topic_category

iso_topic_category

ISO topic category fetched from a controlled vocabulary. Accepted elements are listed in the MMD specification.

quality_control

quality_control

The level of quality control performed on the dataset/product. Valid keywords are listed in section 4.22 of the MMD specification. Additional information about data quality control can be provided through the related_information element providing a URL to the quality control documentation.

doi

dataset_citation>doi

Digital Object Identifier (if available).

3.2.2.3. Default global attribute values for <organisation-name-here>

The below is an example of selected global attributes, as shown from the output of ncdump -h <filename>:

// global attributes:
    :creator_institution = "<to-be-added>" ;
    :institution = "<to-be-added>" ;
    :institution_short_name = "<to-be-added>" ;
    :keywords = "GCMDSK:<to-be-added-by-data-creator>, GEMET:<to-be-added-by-data-creator>, GEMET:<to-be-added-by-data-creator>, GEMET:<to-be-added-by-data-creator>, NORTHEMES:<to-be-added-by-data-creator>" ;
    :keywords_vocabulary = "GCMDSK:GCMD Science Keywords:https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/sciencekeywords, GEMET:INSPIRE Themes:http://inspire.ec.europa.eu/theme, NORTHEMES:GeoNorge Themes:https://register.geonorge.no/metadata-kodelister/nasjonal-temainndeling" ;
    :license = "http://spdx.org/licenses/CC-BY-4.0(CC-BY-4.0)" ;
    :naming_authority = "<to-be-added>" ;
    :publisher_name = "<publisher-institution-name>" ;
    :publisher_type = "institution" ;
    :publisher_email = "<to-be-added>" ;
    :publisher_url = "<to-be-added>" ;
3.2.2.4. Example script to update a NetCDF-CF file with correct discovery metadata

We use a test dataset containing the U component of wind at 10 m height, x_wind_10m, extracted from a MEPS 2.5km file. The file can be downloaded by clicking on the link at access point 2., HTTPServer.

Below is an example python script to add the required discovery metadata to the netCDF-CF file. The resulting file can be validated by the script scripts/nc2mmd.py in py-mmd-tools, which also can parse the netCDF-CF file into an MMD xml file.

#!/usr/bin/env python
from datetime import datetime, timezone
import sys
import netCDF4
from uuid import uuid4


with netCDF4.Dataset(sys.argv[1], "a") as f:
    f.id = str(uuid4())
    f.naming_authority = "no.met"

    f.creator_institution = "Norwegian Meteorological Institute"
    f.delncattr("creator_url")  # not needed, replaced by publisher_url

    f.institution = "Norwegian Meteorological Institute"
    f.institution_short_name = 'MET Norway'

    f.title = "U component of wind speed at ten meter height (example)"
    f.title_no = "U-komponent av vindhastighet i ti meters høyde (eksempel)"
    f.summary = (
        "Test dataset to demonstrate how to document a netcdf-cf file"
        " containing wind speed at 10 meter height from a MEPS 2.5km "
        "dataset with ACDD attributes.")
    f.summary_no = (
        "Test datasett for å demonstrere hvordan du dokumenterer en "
        "netcdf-cf-fil som inneholder vindhastighet i 10 meters "
        "høyde fra et MEPS 2,5 km datasett med ACDD-attributter.")
    f.references = "https://github.com/metno/NWPdocs/wiki (Users guide)"

    f.keywords = (
        "GCMDSK:Earth Science > Atmosphere > Atmospheric winds, "
        "GEMET:Meteorological geographical features, "
        "GEMET:Atmospheric conditions, "
        "GEMET:Oceanographic geographical features, "
        "NORTHEMES:Weather and climate")
    f.keywords_vocabulary = (
        "GCMDSK:GCMD Science Keywords:https://gcmd.earthdata.nasa.gov"
        "/kms/concepts/concept_scheme/sciencekeywords, "
        "GEMET:INSPIRE themes:https://inspire.ec.europa.eu/theme, "
        "NORTHEMES:GeoNorge Themes:https://register.geonorge.no/"
        "subregister/metadata-kodelister/kartverket/nasjonal-"
        "temainndeling")

    # Set ISO topic category
    f.iso_topic_category = "climatologyMeteorologyAtmosphere"

    # Set the correct license
    f.license = "http://spdx.org/licenses/CC-BY-4.0(CC-BY-4.0)"

    f.publisher_name = "Norwegian Meteorological Institute"
    f.publisher_type = "institution"
    f.publisher_email = "csw-services@met.no"
    f.publisher_url = "https://www.met.no/"

    # Extract time_coverage_start from the time variable
    tstart = f.variables['time'][0]  # masked array
    tstart = int(tstart[~tstart.mask].data[0])
    tstart = datetime.fromtimestamp(tstart)
    # The time zone is utc (this can be seen from the metadata of the
    # time variable) and the time in isoformat is
    # 2023-02-01T10:00:00+00:00
    f.time_coverage_start = tstart.replace(tzinfo=timezone.utc).isoformat()

    # Extract time_coverage_end from the time variable
    tend = f.variables['time'][-1]  # masked array
    tend = int(tend[~tend.mask].data[0])
    tend = datetime.fromtimestamp(tend)
    # The time in isoformat is 2023-02-03T23:00:00+00:00
    f.time_coverage_end = tend.replace(tzinfo=timezone.utc).isoformat()

    # The date_created is in this case recorded in the history attribute
    # We have hardcoded it here for simplicity
    f.date_created = datetime.strptime("2023-02-01T11:30:05", "%Y-%m-%dT%H:%M:%S").replace(
        tzinfo=timezone.utc).isoformat()

    # Set the spatial representation (a MET ACDD extension)
    f.spatial_representation = 'grid'

    # Update the conventions attribute to the correct ones
    f.Conventions = 'CF-1.10, ACDD-1.3'

To test it yourself, you can do the following:

$ git clone git@github.com:metno/data-management-handbook.git
$ cd data-management-handbook/example_scripts
$ wget https://thredds.met.no/thredds/fileServer/metusers/magnusu/test-2023-02-03/meps_reduced.nc
$ ./update_meps_file.py meps_reduced
$ cd <path-to-py-mmd-tools>/script
$ ./nc2mmd.py -i <path-to-data-management-handbook>/example_scripts/meps_reduced.nc -o .
$ less meps_reduced.xml  # to see the MMD xml file

3.2.3. How to add NetCDF-CF data to thredds

This section should contain institution specific information about how to add netcdf-cf files to thredds.

3.2.4. How to register your data in the catalog service

In order to make a dataset findable, a dataset must be registered in a searchable catalog with appropriate metadata. The (meta)data catalog is indexed and exposed through CSW.

The following needs to be done:

  1. Generate an MMD xml file from your NetCDF-CF file (see Section 3.2.4.1)

  2. Test your mmd xml metadata file (see Section 3.2.4.2)

  3. Push the MMD xml file to the discovery metadata catalog (see Section 3.2.4.3)

3.2.4.1. Generation of MMD xml file from NetCDF-CF

Clone the py-mmd-tools repo and make a local installation with eg pip install .. This should bring in all needed dependencies (we recommend to use a virtual environment).

Then, generate your mmd xml file as follows:

cd script
./nc2mmd.py -i <your netcdf file> -o <your xml output directory>

See ./nc2mmd.py --help for documentation and extra options.

You will find Extensible Stylesheet Language Transformations (XSLT) documents in the MMD repository. These can be used to translate the metadata documents from MMD to other vocabularies, such as ISO19115:

./bin/convert_from_mmd -i <your mmd xml file> -f iso -o <your iso output file name>

Note that the discovery metadata catalog ingestion tool will take care of translations from MMD, so you don’t need to worry about that unless you have special interest in it.

3.2.4.2. Test the MMD xml file

Install the dmci app, and run the usage example locally. This will return an error message if anything is wrong with your MMD file.

3.2.4.3. Push the MMD xml file to the discovery metadata catalog

For development and verification purposes:

curl --data-binary "@<PATH_TO_MMD_FILE>" https://dmci.s-enda-*.k8s.met.no/v1/insert

where * should be either dev or staging.

For production (the official catalog):

curl --data-binary "@<PATH_TO_MMD_FILE>" https://dmci.s-enda.k8s.met.no/v1/insert

3.3. Searching data in the Catalog Service for the Web (CSW) interface

3.3.1. Using OpenSearch

3.3.1.1. Local test machines

The vagrant-s-enda environment found at vagrant-s-enda provides OpenSearch support through PyCSW. To test OpenSearch via the browser, start the vagrant-s-enda vm (vagrant up) and go to the following address:

This will return a description document of the catalog service. The URL field in the description document is a template format that can be used to represent a parameterized form of the search. The search client will process the URL template and attempt to replace each instance of a template parameter, generally represented in the form {name}, with a value determined at query time (OpenSearch URL template syntax). The question mark following any search parameter means that the parameter is optional.

PyCSW opensearch only supports geographical searches querying for a box. For more advanced geographical searches, one must write specific XML files. For example:

  • To find all datasets containing a point:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<csw:GetRecords
    xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
    xmlns:ogc="http://www.opengis.net/ogc"
    xmlns:gml="http://www.opengis.net/gml"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    service="CSW"
    version="2.0.2"
    resultType="results"
    maxRecords="10"
    outputFormat="application/xml"
    outputSchema="http://www.opengis.net/cat/csw/2.0.2"
    xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd" >
  <csw:Query typeNames="csw:Record">
    <csw:ElementSetName>full</csw:ElementSetName>
    <csw:Constraint version="1.1.0">
      <ogc:Filter>
        <ogc:Contains>
          <ogc:PropertyName>ows:BoundingBox</ogc:PropertyName>
          <gml:Point>
            <gml:pos srsDimension="2">59.0 4.0</gml:pos>
          </gml:Point>
        </ogc:Contains>
      </ogc:Filter>
    </csw:Constraint>
  </csw:Query>
</csw:GetRecords>
  • To find all datasets intersecting a polygon:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<csw:GetRecords
    xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
    xmlns:gml="http://www.opengis.net/gml"
    xmlns:ogc="http://www.opengis.net/ogc"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    service="CSW"
    version="2.0.2"
    resultType="results"
    maxRecords="10"
    outputFormat="application/xml"
    outputSchema="http://www.opengis.net/cat/csw/2.0.2"
    xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd" >
  <csw:Query typeNames="csw:Record">
    <csw:ElementSetName>full</csw:ElementSetName>
    <csw:Constraint version="1.1.0">
      <ogc:Filter>
        <ogc:Intersects>
          <ogc:PropertyName>ows:BoundingBox</ogc:PropertyName>
          <gml:Polygon>
            <gml:exterior>
              <gml:LinearRing>
                <gml:posList>
                  47.00 -5.00 55.00 -5.00 55.00 20.00 47.00 20.00 47.00 -5.00
                </gml:posList>
              </gml:LinearRing>
            </gml:exterior>
          </gml:Polygon>
        </ogc:Intersects>
      </ogc:Filter>
    </csw:Constraint>
  </csw:Query>
</csw:GetRecords>
  • Then, you can query the CSW endpoint with, e.g., python:

import requests
requests.post('https://csw.s-enda.k8s.met.no', data=open(my_xml_request).read()).text

3.3.3. Web portals

GeoNorge.no

TODO: describe how to search in geonorge, possibly with screenshots

3.3.4. QGIS

MET Norway’s S-ENDA CSW catalog service is available at https://csw.s-enda.k8s.met.no. This can be used from QGIS as follows:

  1. Select Web > MetaSearch > MetaSearch menu item

  2. Select Services > New

  3. Type, e.g., csw.s-enda.k8s.met.no for the name

  4. Type https://csw.s-enda.k8s.met.no for the URL

Under the Search tab, you can then add search parameters, click Search, and get a list of available datasets.

3.4. Practical Guidance to data producers at [INSTITUTION]

This chapter includes how-to’s and other practical guidance for…​

4. Insight

The purpose of the insight chapter is to provide the reader with more details regarding FAIR data management and this is described along 4 pillars: structuring and documenting data; data services; user portals and documentation; and data governance.

4.1. External data management requirements and forcing mechanisms

Any organisation that strives to implement FAIR data management model has to relate to external forcing mechanisms concerning data management at several levels. At the national level, the organisation must comply with national regulations as decided by the government. Some of these are indications of expected behaviour (e.g., OECD regulations) and some are implemented through a legal framework. The Norwegian government has over time promoted free and open sharing of public data. Mechanisms for how to do this are governed by the Geodataloven (implemented as Geonorge), which is a national implementation of the European INSPIRE directive (to be amended in 2019). INSPIRE defines a federated multinational Spatial Data Infrastructure (SDI) for the European Union, similar to NSDI in the USA or UNSDI under the United Nations. The goal is to provide a standardised access to data and provide the necessary tools to be able to work with the data in a unified manner. In short, these legal frameworks require standardised documentation (at discovery and use level; these concepts are described later) and access (through specified protocols) to the data identified.

Other external requirements and forcing mechanisms that are organisation-specific are provided in [specialized-external-requirements].

4.2. The data value chain

The process of getting the data from the data producer to the consumer can be viewed as a value chain. An example of a data value chain is presented in Figure 1. Typically, data from a wide variety of providers are used in the value chain. Traditionally, the data used have been transmitted on request from one data centre to another, and used in the specific processing chains that requested the data. The focus on reuse of data in various contexts has been missing.

Value chain
Figure 1. Value chain for data.

Datasets and metadata are what travels through the value chain. At the end of the data management value chain are the data consumers.

4.3. Metadata

Metadata is a broad concept. In our data management model the term "metadata" is used in several contexts, specifically the five categories that are briefly described in Table 1.

Table 2. Brief introduction to different types of metadata.
Type Purpose Description Examples

Discovery metadata

Used to find relevant data

Discovery metadata are also called index metadata and are a digital version of the library index card. They describe who did what, where and when, how to access data and potential constraints on the data. They shall also link to further information on the data, such as site metadata.

ISO 19115
GCMD/DIF

Use metadata

Used to understand data found

Use metadata describes the actual content of a dataset and how it is encoded. The purpose is to enable the user to understand the data without any further communication. They describe the content of variables using standardised vocabularies, units of variables, encoding of missing values, map projections, etc.

Climate and Forecast (CF) Convention
BUFR
GRIB

Site metadata

Used to understand data found

Site metadata are used to describe the context of observational data. They describe the location of an observation, the instrumentation, procedures, etc. To a certain extent they overlap with discovery metadata, but also extend discovery metadata. Site metadata can be used for observation network design. Site metadata can be considered a type of use metadata.

WIGOS
OGC O&M

Configuration metadata

Used to tune portal services for datasets intended for data consumers (e.g., WMS)

Configuration metadata are used to improve the services offered through a portal to the user community. This can, e.g., be how to best visualise a product.

System metadata

Used to understand the technical structure of the data management system and track changes in it

System metadata covers, e.g., technical details of the storage system, web services, their purpose and how they interact with other components of the data management system, available and consumed storage, number of users and other KPI elements etc.

The tools and facilities used to manage the information for efficient discovery and use are further described in Section 4.7.

4.4. A data management model based on the FAIR principles

The data management model is built upon the following principles:

  • Standardisation – compliance with established international standards;

  • Interoperability – enabling machine-to-machine interfaces including standardised documentation and encoding of data;

  • Integrity – ensuring that data and data access can be maintained over time, and ensuring that the consumer receives the same data at any time of request;

  • Traceability – documentation of the provenance of a dataset, i.e., all actions taken to produce and maintain the dataset and the usage of the data in downstream systems;

  • Modularisation – enabling replacement of one component of the system without necessitating other changes.

The model’s basic functions fall into three main categories:

  1. Documentation of data using discovery and use metadata. The documentation identifies who, what, when, where, and how, and shall make it easy for consumers to find and understand data. This requires application of information containers and utilisation of controlled vocabularies and ontologies where textual representation is required. It also covers the topic of data provenance which is used to describe the origin and all actions done on a dataset. Data provenance is closely linked with workflow management. Furthermore, it covers the relationship between datasets. Application of ontologies in data documentation is closely linked to the concept of linked data.

  2. Publication and sharing of data focuses on making data accessible to consumers internally and externally. Application of standardised approaches is vital, along with cost effective solutions that are sustainable. Direct integration of data in applications for analysis through data streaming minimises the complexity and overhead in dissemination solutions. This category also covers persistent identifiers for data.

  3. Preservation of data includes short and long term management of data, which secures access and availability throughout the lifespan of the data. Good solutions in this area depend on expected and actual usage of the data. Preservation of data includes the concept of data life cycle, i.e., the documented flow of data from initial storage through to obsolescence and permanent archiving (or deletion) and preserving the metadata for the same data (even after deleting).

4.5. Human roles in data management

4.5.1. Context

Data is processed and interpreted to generate knowledge (e.g., about the weather) for end users. The knowledge can be presented as information in the form of actual data, illustrations, text or other forms of communication. In this context, an illustration is a representation of data, whereas data means the numerical values needed to analyse and interpret a natural process (i.e., calibrated or with calibration information; it must be possible to understand the meaning of the numerical value from the available and machine-readable information).

information to knowledge

Definition:

Data here means the numerical values needed to analyse and interpret a natural process (i.e., calibrated or with calibration information, provenance, etc.; it must be possible to understand the meaning of the numerical value from the available and machine-readable information).

Advanced users typically consume some type of data in order to process and interpret it, and produce new knowledge, e.g., in the form of a new dataset or other information. The datasets can be organised in different levels, such as the WMO WIGOS definition for levels of data. Less advanced users apply information based on data (e.g., an illustration) to make decisions (e.g., clothing adapted to the forecast weather).

4.5.2. User definitions

We define two types of users:

  1. Producers: Those that create / produce data

  2. Consumers: Those that consume data

A consumer of one level of data is typically a producer of data at the next level. A user can both consume data and produce data, or just have one of these roles (e.g., at the start/end of the production chain).

user definitions
4.5.2.1. Data consumers

We split between three types of data consumers: (1) advanced, (2) intermediate, and (3) simple. These are defined below.

4.5.2.1.1. Advanced consumers

Advanced consumers require information in the form of data and metadata (including provenance) to gain a full understanding of what data exists and how to use it (discovery and use metadata), and to automatize the generation of derived data (new knowledge generation), verification (of information), and validation of data products.

Example questions:

  • Need all historical weather data, that can be used to model / predict the weather in the future

Specific consumers:

  • Researcher (e.g., for climate projections within the "Klima i Norge 2100" research project)

4.5.2.1.2. Intermediate consumers

Intermediate consumers need enough information to find data and understand if it can answer their question(s) (discovery and use metadata). Also, they often want to cross reference a dataset with another dataset or metadata for inter-comparative verification of information.

Example questions:

  • Is this observation a record / weather extreme (coldest, warmest, wettest)?

  • What was the amount of rain in last month in a certain watershed?

Specific consumers:

  • Klimavakt (MET)

  • Developer (app, website, control systems, machine learning, etc.)

  • Energy sector (hydro, energy prices)

  • External partners

4.5.2.1.3. Simple consumers

Simple consumers do not have any prior knowledge about the data. Information in the form of text or illustrations is sufficient for their decision making. They do not need to understand either data or metadata, and they are most likely looking for answers to simple questions.

Example questions:

  • Will it be raining today?

  • Can the event take place, or will the weather impeed it?

  • When should I harvest my crops?

Specific consumers:

  • Event organizer

  • Journalist

  • Farmer, or other people who work with the land like tree planters

Note

An advanced consumer may discover information pertaining a role as a simple consumer. Such a user may, for some reason, be interested in tracking the data in order to use it together with other data (interoperability) or to verify the information. Therefore, it is important to have provenance metadata pointing to the basic data source(s) also at the simplest information level.

4.5.2.2. Data producers

A producer is an advanced consumer at one level of data that generate new information at a higher level. This new information could be in the form of actual data or simple information, such as an illustration or a text summary. It is essential that any information can be traced back to the source(s).

4.5.2.3. Data Management Roles

Between the data providers and data consumers are the processes that manage and deliver the datasets (cf. Figure 1). A number of human roles may be defined with responsibilities that, together, ensure that these processes are carried out in accordance with the data management requirements of the organisation. The definition and filling of these roles depend heavily on the particular organisation, and each organisation must devise its own best solution.

4.6. Summary of data management requirements

The data management regime described in this DMH follows the Arctic Data centre model and shall ensure that:

  1. There are relevant metadata for all datasets, and both data and metadata are available in a form and in such a way that they can be utilised by both humans and machines

    • There are sufficient metadata for each dataset for both discovery and use purposes

    • Discovery metadata are indexed and can be retrieved from available services in a standard way and with standard protocols

    • There are interfaces for discovery, visualisation and download, as well as portals for human access, that operate seamlessly across institutions

    • The data are described in a relevant, standardised and managed vocabulary that supports machine-machine interfaces

    • Datasets have attached a unique and permanent identifier that enables traceability

    • Datasets have licensing that ensures free use and reuse wherever possible

    • Datasets are available for download in a standard form according to the FAIR guiding principles and through standard protocols that are accepted and utilised in the user environment

    • There are authentication and authorisation mechanisms that ensure access control to data with restrictions, and that are compatible with and coupled to relevant public authentication solutions (FEIDE, eduGAIN, Google, etc.)

  2. There is an organisation that provides for the management of each dataset throughout its lifetime (life cycle management)

    • There is documentation that describes physical storage, lifetime of each dataset, degree of storage redundancy, metadata consistency methods, how dataset versioning is implemented and unique IDs to ensure traceability

    • The organisation provides seamless access to data from distributed data centres through various portals

    • The above and a business model at dataset level are described in a Data Management Plan (DMP)

  3. There are services or tools that provide the following functionalities on the datasets:

    • Transformations

      • Subsetting

      • Slicing of gridded datasets to points, sections, profiles

      • Reprojection

      • Resampling

      • Reformatting

    • Visualisation (time series, mapping services, etc.)

    • Aggregation

    • Upload of new datasets (including enabling and configuring data access services)

4.7. Structuring and Documenting Data

Purpose

In order to properly find, understand and use geophysical data, standardised encoding and documentation are required, i.e., metadata.

Both discovery metadata and use metadata can be embedded in the files produced for a dataset through utilisation of self-explaining file formats. If properly done by the data producer, publication and preservation of data through services is simplified and can be automated.

Implementation

An essential prerequisite for structuring and documenting data is the specification of the dataset(s), cf. Section 2.1.1. The dataset is the basic building block of our data management model; all the documentation and services described in this DMH are built on datasets. The dataset specification is the first step in structuring one’s data for efficient management, and it is mandatory.

4.7.1. Structuring and documenting data at [insert organisation here]

4.7.2. Current practice in structuring and documenting data

Table 3. Data types available at [insert institution here], with the fileformats supported. The primary fileformat is marked in bold
Supported file formats/structures Datatype Available metadata Examples

Comments

4.7.3. Planned developments in the near-term (< 2 years)

4.7.4. Expected evolution in the longer term (> 2 years)

4.8. Data Services

Purpose

The purpose of this chapter is to describe services that benefit from the standardisation performed in the previous step. The information structures described in the previous chapter pave the way for efficient data discovery and use through tools and automated services. Implementation of the services must be in line with the institute’s delivery architecture.

Data services include:

  1. Data ingestion, storing the data in the proper locations for long term preservation and sharing;

  2. Data cataloging, extracting the relevant information for proper discovery of the data;

  3. Configuration of visualisation and data publication services (e.g., OGC WMS and OpeNDAP).

Implementation

When planning and implementing data services, there are a number of external requirements that constrain choices, especially if reuse of solutions nationally and internationally is intended and wanted for the data in question. At the national level, important constraints are imposed by the national implementation of the INSPIRE directive through Norge digitalt.

4.8.1. Legacy

4.8.2. Planned developments in near-term (< 2 years)

4.8.3. Expected evolution in the longer term (> 2 years)

4.9. User Portals and Documentation

Purpose

The purpose of this chapter is to describe the human interfaces which data consumers would use to navigate data and the related services. A portal is an entry point for data consumers, enabling them to discover and search for datasets and services, and providing sufficient documentation and guidance to ensure that they are able to serve themselves using the interactive and machine interfaces offered.

Here, we can distinguish between a general portal for all publishable datasets from the institution and targeted portals that offer a focused selection of data, which may include external datasets. Targeted portals cater to specific user groups and may have a limited lifetime, but also can be long-term commitments.

Implementation of user portals at [institution]

Table 4. User portals in use at [institution]
User portal Description General or targeted portal Data consumer

Example

4.9.1. Current implementation

4.9.2. Planned developments in near-term (< 2 years)

4.9.3. Expected evolution in the longer term (> 2 years)

4.10. Data Governance

Purpose

This chapter describes how we organise and steer data management activities in order to ensure that:

  1. The guidelines described above are implemented throughout the organisation;

  2. Our data management practices are in line with and contribute to the institute’s strategic aims;

  3. Our data management regime is subject to review, analysis and revision in a timely manner.

These higher level aspects of data management are often referred to as data governance. A useful definition is:

"Data governance …​ is the overall management of the availability, usability, integrity and security of data used in an enterprise. A sound data governance program includes a governing body or council, a defined set of procedures and a plan to execute those procedures."

In this chapter we address many aspects of this definition, but a full description of data governance touches on management structures that are beyond the scope of this handbook.

Data life cycle management

Data life cycle management is steered by documentation describing how data generated or used in an activity will be handled throughout the lifetime of the activity and after the activity has been completed. This is living documentation that follows the activity and specifies what kind of data will be generated or acquired, how the data will be described, where the data will be stored, whether and how the data can be shared, and how the data will be retired (archived or deleted). The purpose of life cycle management is to safeguard the data, not just during their “active” period but also for future reuse of the data, and to facilitate cost-effective data handling.

This DMH recommends the following concepts of life cycle management to be implemented for the institution:

  • An institution specific Data Management Handbook (DMH) based on a common general template;

  • Extended discovery metadata for data in internal production chains (these are metadata elements that provide the necessary information for life cycle management just described); and

  • A Data Management Plan (DMP) document (a DMP is expected for datasets produced in external projects, but may also be useful for internal datasets, as a supplement to the extended discovery metadata).

The goal is that life cycle management information shall be readily available for every dataset managed by the institute. How these concepts are implemented are described in the subsections below.

Data Management Plan

A Data Management Plan (DMP) is a document that describes textually how the data life cycle management will be carried out for datasets used and produced in specific projects. Generally, these are externally financed projects for which such documentation is required by funding agencies. However, larger internal projects covering many datasets may also find it beneficial to create a specific document of this type.

Currently, agencies funding R&D (such as NFR and the EU) do not strictly require a DMP from the start of any project. However, for projects in the geosciences, data management is an issue that must be addressed, and the agencies strongly recommend a DMP solution. For example, NFR publishes guidelines for the contents of a DMP, including links to tools (templates and online services); these guidelines are recommended for any data management project or activity and will in time become a requirement according to NFR.

4.10.1. Data governance at [insert organisation here]

4.10.2. Current implementation

4.10.2.1. Organisational Roles
4.10.2.2. Status DMH
4.10.2.3. Status Discovery metadata
4.10.2.4. Status DMP

4.10.3. Planned developments in the near-term (< 2 years)

Revise DMH annually or when needed.

4.10.4. Expected evolution in the longer term (> 2 years)

Revise DMH annually or when needed.

5. References

5.1. FAIR principles

FAIR stands for findability, accessibility, interoperability and reusability.

By following the FAIR principles it is easier to obtain a common approach to data management, or a unified data management model. One of the main motivations for implementing a unified data management is to better serve the users of the data. Primarily, this can be approached by making user needs and requirements the guide for determining what data we provide and how. For example, it will be described below how the specification of datasets should be determined. By implementing the data management practices described here, it is expected that users will experience:

  • Ease of discovering, viewing and accessing datasets;

  • Standardised ways of accessing data, including downloading or streaming data, with reduced need for special solutions on the user side;

  • Reduced storage needs;

  • Simple and standard access to remote datasets and catalogues, with own data visualisation and analysis tools;

  • Ability to compare and combine data from internal and external sources;

  • Ability to apply common data transformations, like spatial, temporal and variables subsetting and reprojection, before downloading anything;

  • Possibility to build specialized metadata catalogues and data portals targeting a specific user community.

5.1.1. FAIR data management model

The data management model is built upon the following principles:

  • Standardisation – compliance with established international standards;

  • Interoperability – enabling machine-to-machine interfaces including standardised documentation and encoding of data;

  • Integrity – ensuring that data and data access can be maintained over time, and ensuring that the consumer receives the same data at any time of request;

  • Traceability – documentation of the provenance of a dataset, i.e., all actions taken to produce and maintain the dataset and the usage of the data in downstream systems;

  • Modularisation – enabling replacement of one component of the system without necessitating other changes.

The model’s basic functions fall into three main categories:

  1. Documentation of data using discovery and use metadata. The documentation identifies who, what, when, where, and how, and shall make it easy for consumers to find and understand data. This requires application of information containers and utilisation of controlled vocabularies and ontologies where textual representation is required. It also covers the topic of data provenance which is used to describe the origin and all actions done on a dataset. Data provenance is closely linked with workflow management. Furthermore, it covers the relationship between datasets. Application of ontologies in data documentation is closely linked to the concept of linked data.

  2. Publication and sharing of data focuses on making data accessible to consumers internally and externally. Application of standardised approaches is vital, along with cost effective solutions that are sustainable. Direct integration of data in applications for analysis through data streaming minimises the complexity and overhead in dissemination solutions. This category also covers persistent identifiers for data.

  3. Preservation of data includes short and long term management of data, which secures access and availability throughout the lifespan of the data. Good solutions in this area depend on expected and actual usage of the data. Preservation of data includes the concept of data life cycle, i.e., the documented flow of data from initial storage through to obsolescence and permanent archiving (or deletion) and preserving the metadata for the same data (even after deleting).

Principles of standardised data documentation, publication, sharing and preservation have been formalised in the FAIR Guiding Principles for scientific data management and stewardship [RD3] through a process facilitated by FORCE11.

5.2. External data management requirements and forcing mechanisms

Any organisation that strives to implement FAIR data management model has to relate to external forcing mechanisms concerning data management at several levels. At the national level, the organisation must comply with national regulations as decided by the government. Some of these are indications of expected behaviour (e.g., OECD regulations) and some are implemented through a legal framework. The Norwegian government has over time promoted free and open sharing of public data. Mechanisms for how to do this are governed by the Geodataloven (implemented as Geonorge), which is a national implementation of the European INSPIRE directive (to be amended in 2019). INSPIRE defines a federated multinational Spatial Data Infrastructure (SDI) for the European Union, similar to NSDI in the USA or UNSDI under the United Nations. The goal is to provide a standardised access to data and provide the necessary tools to be able to work with the data in a unified manner. In short, these legal frameworks require standardised documentation (at discovery and use level; these concepts are described later) and access (through specified protocols) to the data identified.

Other external requirements and forcing mechanisms that are organisation-specific are provided in [specialized-external-requirements].

Acknowledgements

At various stages during the writing of the first version of this handbook, we have solicited comments on the manuscript from coworkers at MET Norway (in alphabetical order): Åsmund Bakketun, Arild Burud, Lara Ferrighi, Håvard Futsæter and Nina Larsgård. Their comments and advice are gratefully acknowledged. In addition, we thank members of the top management at MET Norway, including Lars-Anders Breivik, Bård Fjukstad, Jørn Kristensen, Anne-Cecilie Riiser, Roar Skålin and Cecilie Stenersen, who have provided valuable criticism and advice.

While working on the second version of this handbook, valuable input has come from Matteo De Stefano from NINA. This input has made it possible to transform this handbook into a tool that can be adopted by institutions outside of MET Norway.

Glossary of Terms and Names

Term Description

Application service

TBC

CDM dataset

A dataset that “may be a NetCDF, HDF5, GRIB, etc. file, an OPeNDAP dataset, a collection of files, or anything else which can be accessed through the NetCDF API.” Unidata Common Data Model

Controlled vocabulary

A carefully selected list of terms (words and phrases) controlled by some authority. They are used to tag information elements (such as datasets) so that they are easier to search for. (see Wikipedia article) A basic element in the implementation of the Semantic web.

Data Governance

Tech Target (https://searchdatamanagement.techtarget.com/definition/data-governance). An alternative definition by George Firican: “Data Governance is the discipline which provides all data management practices with the necessary foundation, strategy, and structure needed to ensure that data is managed as an asset and transformed into meaningful information.” (http://www.lightsondata.com/what-is-data-governance/ which also contains several more definitions.)

Data life cycle management

“Data life cycle management (DLM) is a policy-based approach to managing the flow of an information system’s data throughout its life cycle: from creation and initial storage to the time when it becomes obsolete and is deleted.” Excerpt from TechTarget article. Alias: life cycle management

Data Management Plan

“A data management plan (DMP) is a written document that describes the data you expect to acquire or generate during the course of a research project, how you will manage, describe, analyse, and store those data, and what mechanisms you will use at the end of your project to share and preserve your data.” Stanford Libraries

Data centre

A combination of a (distributed) data repository and the data availability services and information about them (e.g., a metadata catalog). A data centre may include contributions from several other data centres.

Data management

How data sets are handled by the organisation through the entire value chain - include receiving, storing, metadata management and data retrieval.

Data provenance

“The term ‘data provenance’ refers to a record trail that accounts for the origin of a piece of data (in a database, document or repository) together with an explanation of how and why it got to the present place.” (Gupta, 2009). See also Boohers (2015)

Data repository

A set of distributed components that will hold the data and ensure they can be queried and accessed according to agreed protocols. This component is also known as a Data Node.

Dataset

A dataset is a pre-defined grouping or collection of related data for an intended use. Datasets may be categorised by:

Source, such as observations (in situ, remotely sensed) and numerical model projections and analyses;

Processing level, such as “raw data” (values measured by an instrument), calibrated data, quality-controlled data, derived parameters (preferably with error estimates), temporally and/or spatially aggregated variables;

Data type, including point data, sections and profiles, lines and polylines, polygons, gridded data, volume data, and time series (of points, grids, etc.).

Data having all of the same characteristics in each category, but different independent variable ranges and/or responding to a specific need, are normally considered part of a single dataset. In the context of data preservation a dataset consists of the data records and their associated knowledge (information, tools). In practice, our datasets should conform to the Unidata CDM dataset definition, as much as possible.

Dynamic geodata

Data describing geophysical processes which are continuously evolving over time. Typically these data are used for monitoring and prediction of the weather, sea, climate and environment. Dynamic geodata is weather, environment and climate-related data that changes in space and time and is thus descriptive of processes in nature. Examples are weather observations, weather forecasts, pollution (environmental toxins) in water, air and sea, information on the drift of cod eggs and salmon lice, water flow in rivers, driving conditions on the roads and the distribution of sea ice. Dynamic geodata provides important constraints for many decision-making processes and activities in society.

FAIR principles

The four foundational principles of good data management and stewardship: *F*indability, *A*ccessibility, *I*nteroperability and *R*eusability. Nature article [RD3], FAIR Data Principles, FAIR metrics proposal, EU H2020 Guidelines

Feature type

A categorisation of data according to how they are stored, for example, grid, time series, profile, etc. It has been formalised in the NetCDF/CF feature type table, which currently defines eight feature types.

Geodataloven

"Norwegian regulation toward good and efficient access to public geographic information for public and private purposes." See Deling av geodata – Geodataloven.

Geonorge

"Geonorge is the national website for map data and other location information in Norway. Users of map data can search for any such information available and access it here." See Geonorge.

Geographic Information System

A geographic information system (GIS) is a system designed to capture, store, manipulate, analyze, manage and present spatial or geographic data. (Clarke, K. C., 1986) GIS systems have lately evolved in distributed Spatial Data Infrastructures (SDI)

Glossary

Terms and their definitions, possibly with synonyms.

Interoperability

The ability of data or tools from non-cooperating resources to integrate or work together with minimal effort.

Linked data

A method of publishing structured data so that they can be interlinked and become more useful through semantic queries, i.e., through machine-machine interactions. (see Wikipedia article)

Metadata - Discovery metadata

See Discovery metadata definition in Table 3

Metadata - Configuration metadata

See Configuration metadata definition in Table 3

Metadata - Site metadata

See Site metadata definition in Table 3

Metadata - Use metadata

See Use metadata definition in Table 3

Ontology

A set of concepts with attributes and relationships that define a domain of knowledge.

OpenSearch

A collection of simple formats for the sharing of search results (OpenSearch)

Product

"Product" is not a uniquely defined term among the various providers of dynamical geodata, either nationally or internationally. It is often used synonymously with "dataset." For the sake of clarity, "product" is not used in this handbook. The term "dataset" is adequate for our purpose.

Semantic web

“The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries". W3C (see Wikipedia article)

Spatial Data Infrastructure

"Spatial Data Infrastructure (SDI) is defined as a framework of policies, institutional arrangements. technologies, data, and people that enables the sharing and effective usage of geographic information by standardising formats and protocols for access and interoperability." (Tonchovska et al, 2012). SDI has evolved from GIS. Among the largest implementations are: NSDI in the USA, INSPIRE in Europe and UNSDI as an effort by the United Nations. For areas in the Arctic, there is arctic-sdi.org.

Unified data management

A common approach to data management in a grouping of separate data management enterprises.

Web portal

A central website where all users can search, browse, access, transform, display and download datasets irrespective of the data repository in which the data are held.

Web service

Web services are used to communicate metadata, data and to offer processing services. Much effort has been put on standardisation of web services to ensure they are reusable in different contexts. In contrast to web applications, web services communicate with other programs, instead of interactively with users. (See TechTerms article)

Workflow management

Workflow management is the process of tracking data, software and other actions on data into a new form of the data. It is related to data provenance, but is usually used in the context of workflow management systems.

(Scientific) Workflow management systems

A scientific workflow system is a specialised form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or workflow, in a scientific application. (Wikipedia) As of today, many different frameworks exist with their own proprietary languages, these might eventually get connected by using a common workflow definition language.

Metadata

Metadata is a broad concept. In our data management model the term "metadata" is used in several contexts, specifically the five categories that are briefly described in Table 1.

Table 5. Brief introduction to different types of metadata.
Type Purpose Description Examples

Discovery metadata

Used to find relevant data

Discovery metadata are also called index metadata and are a digital version of the library index card. They describe who did what, where and when, how to access data and potential constraints on the data. They shall also link to further information on the data, such as site metadata.

ISO 19115
GCMD/DIF

Use metadata

Used to understand data found

Use metadata describes the actual content of a dataset and how it is encoded. The purpose is to enable the user to understand the data without any further communication. They describe the content of variables using standardised vocabularies, units of variables, encoding of missing values, map projections, etc.

Climate and Forecast (CF) Convention
BUFR
GRIB

Site metadata

Used to understand data found

Site metadata are used to describe the context of observational data. They describe the location of an observation, the instrumentation, procedures, etc. To a certain extent they overlap with discovery metadata, but also extend discovery metadata. Site metadata can be used for observation network design. Site metadata can be considered a type of use metadata.

WIGOS
OGC O&M

Configuration metadata

Used to tune portal services for datasets intended for data consumers (e.g., WMS)

Configuration metadata are used to improve the services offered through a portal to the user community. This can, e.g., be how to best visualise a product.

System metadata

Used to understand the technical structure of the data management system and track changes in it

System metadata covers, e.g., technical details of the storage system, web services, their purpose and how they interact with other components of the data management system, available and consumed storage, number of users and other KPI elements etc.

The tools and facilities used to manage the information for efficient discovery and use are further described in Section 4.7.

List of Acronyms

This list contains acronyms used throughout the DMH. The column "General/Specific" indicates if the acronyms are used in the general part of the DMH (from the template) or if they are organisation specific.

Acronym Meaning General/specific

ACDD

Attribute Convention for Dataset Discovery RD5

ADC

Arctic Data Centre (ADC)

AeN

Arven etter Nansen (English: Nansen Legacy)

BUFR

Binary Universal Form for the Representation of meteorological data. WMO standard format for binary data, particularly for non-gridded data (BUFR)

CDM

Unidata Common Data Model (CDM)

CF

Climate and Forecast Metadata Conventions (CF)

CMS

Content Management System

CSW

Catalog Service for the Web (CSW)

DAP

Data access protocol (DAP)

DBMS

DataBase Management System (DBMS)

DIANA

Digital Analysis tool for visualisation of geodata, open source from MET Norway (DIANA)

diana-WMS

WMS implementation in DIANA

DIAS

Copernicus Data and Information Access Services (DIAS)

DIF

Directory Interchange Format of GCMD (DIF)

DLM

Data life cycle management (DLM)

DM

Data Manager

DMH

Data Management Handbook (this document)

DMCG

Data Management Coordination Group

DMP

Data Management Plan (DMP definition, easyDMP tool)

DOI

Digital Object Identifier (DOI)

eduGAIN

The Global Academic Interfederation Service (eduGAIN)

ENVRI

European Environmental Research Infrastructures (ENVRI)

ENVRI FAIR

"Making the ENV RIs data services FAIR." A proposal to the EU’s Horizon 2020 call INFRAEOSC-04

EOSC

European Open Science Cloud (EOSC)

ERDDAP

NOAA Environmental Research Division Data Access Protocol (ERDDAP)

ESA

European Space Agency (ESA)

ESGF

Earth System Grid Federation (ESGF)

EWC

European Weather Cloud ()

FAIR

Findability, Accessibility, Interoperability and Reusability RD3

FEIDE

Identity Federation of the Norwegian National Research and Education Network (UNINETT) (FEIDE)

FFI

Norwegian Defence Research Establishment (FFI)

FORCE11

Future of Research Communication and e-Scholarship (FORCE11)

GCMD

Global Change Master Directory (GCMD)

GCW

Global Cryosphere Watch (GCW)

GeoAccessNO

An NFR-funded infrastructure project, 2015- (GeoAccessNO)

GIS

Geographic Information System

GRIB

GRIdded Binary or General Regularly-distributed Information in Binary form. WMO standard file format for gridded data (GRIB)

HDF, HDF5

Hierarchical Data Format (HDF)

Hyrax

OPeNDAP 4 Data Server (Hyrax)

IMR

Institute of Marine Research (IMR)

INSPIRE

Infrastructure for Spatial Information in Europe (INSPIRE)

ISO 19115

ISO standard for geospatial metadata (ISO 19115-1:2014).

IPY

International Polar Year (IPY)

JRCC

Joint Rescue Coordination centre (Hovedredningssentralen)

KDVH

KlimaDataVareHus

Specific to METNorway

KPI

Key Performance Indicator (KPI)

METCIM

MET Norway Crisis and Incident Management (METCIM)

Specific to METNorway

METSIS

MET Norway Scientific Information System

Specific to METNorway

MMD

Met.no Metadata Format MMD

MOAI

Meta Open Archives Initiative server (MOAI)

ncWMS

WMS implementation for NetCDF files (ncWMS)

NERSC

Nansen Environmental and Remote Sensing Center (NERSC)

NetCDF

Network Common Data Format (NetCDF)

NetCDF/CF

A common combination of NetCDF file format with CF-compliant attributes.

NFR

The Research Council of Norway (NFR)

NILU

Norwegian Institute for Air Research (NILU)

NIVA

Norwegian Institute for Water Research (NIVA)

NMDC

Norwegian Marine Data Centre, NFR-supported infrastructure project 2013-2017 (NMDC)

NorDataNet

Norwegian Scientific Data Network, an NFR-funded project 2015-2020 (NorDataNet)

Norway Digital

Norwegian national spatial data infrastructure organisation (Norway Digital). Norwegian: Norge digitalt

NORMAP

Norwegian Satellite Earth Observation Database for Marine and Polar Research, an NFR-funded project 2010-2016 (NORMAP)

NRPA

Norwegian Radiation Protection Authority (NRPA)

NSDI

National Spatial Data Infrastructure, USA (NSDI)

NVE

Norwegian Water Resources and Energy Directorate (NVE)

NWP

Numerical Weather Prediction

OAI-PMH

Open Archives Initiative - Protocol for Metadata Harvesting (OAI-PMH)

OAIS

Open Archival Information System (OAIS)

OCEANOTRON

Web server dedicated to the dissemination of ocean in situ observation data collections (OCEANOTRON)

OECD

The organisation for Economic Co-operation and Developement. OECD

OGC

Open Geospatial Consortium (OGC)

OGC O&M

OGC Observations and Measurements standard (OGC O&M)

OLA

Operational-level Agreement (OLA)

OPeNDAP

Open-source Project for a Network Data Access Protocol (OPeNDAP) - reference server implementation

PID

Persistent Identifier (PID)

RM-ODP

Reference Model of Open Distributed Processing (RM-ODP)

PROV

A W3C Working Group on provenance and a Family of Documents (PROV)

SAON

Sustaining Arctic Observing Networks (SAON/IASC)

SDI

Spatial Data Infrastructure

SDN

SeaDataNet, Pan-European infrastructure for ocean & marine data management

SIOS

Svalbard Integrated Arctic Earth Observing System

SIOS-KC

SIOS Knowledge Centre, an NFR-supported project 2015-2018 (SIOS-KC)

SKOS

Simple Knowledge Organization System (SKOS)

SLA

Service-level Agreement (SLA)

SolR

Apache Enterprise search server with a REST-like API (SolR)

StInfoSys

MET Norway’s Station Information System

Specific to METnorway

TDS

THREDDS Data Server (TDS)

THREDDS

Thematic Real-time Environmental Distributed Data Services

UNSDI

United Nations Spatial Data Infrastructure (UNSDI)

UUID

Universally Unique Identifier (UUID)

W3C

World Wide Web Consortium (W3C)

WCS

OGC Web Coverage Service (WCS)

WFS

OGC Web Feature Service (WFS)

WIGOS

WMO Integrated Global Observing System (WIGOS)

WIS

WMO Information System (WIS)

WMO

World Meteorological Organisation (WMO)

WMS

OGC Web Map Service (WMS)

WPS

OGC Web Processing Service (WPS)

YOPP

Year of Polar Prediction (YOPP Data Portal)

Appendix A: List of Referenced Software or Services

Name Description Reference

Fimex package

including fimex

File Interpolation, Manipulation and EXtraction library for gridded geospatial data

wiki.met.no documentation

github repository

frost2nc

Dump observational time series from KDVH to NetCDF files

github repository

met_moai

OAI-PMH implementation based on MOAI

github repository

mdharvest

Perl and Python code to harvest discovery metadata using OAI-PMH, OpenSearch and OGC CSW

github repository

METSIS-data

-ingestion

A generic utility to index MMD dataset, thumbnails to SolR.

github repository: metsis-metadata

METSIS-data

-drupal

A module linking the METSIS back-end services to the Drupal CMS

github repository: metsis-metadata

METSIS-station-handling

TBC

TBC

METSIS-ts

WPS or HTTP interface to graphical diagrams.

Not yet openly available, but beta-version is in use in ADC, SIOS, GCW, NorDataNet, YOPP and APPLICATE portals.

MMD XSD

XML Schema document for MMD

github repository: mmd

nc_to_mmd.py

Builds MMD metadata from ACDD-compliant NetCDF file attributes.

github repository: py-mmd-tools

NorDataNet validator

Validates NetCDF files for CF and ACDD compliance.

Access URL

threddslso

Extracting discovery metadata from NetCDF/CF files with ACDD to ISO 19115

github repository

Appendix B: Users of [insert institution]’s data

Users are divided into categories by type of collaboration with [insert institution], not by type of service they consume.

User category Description Examples