Example of creating a timeseries dataset in xarray¶

Example of creating a simple timeseries in xarray with attributes for S-ENDA

In [ ]:
%pip install xarray netCDF4
Requirement already satisfied: xarray in ./.venv/lib64/python3.11/site-packages (2023.12.0)
Requirement already satisfied: netCDF4 in ./.venv/lib64/python3.11/site-packages (1.6.5)
Requirement already satisfied: numpy>=1.22 in ./.venv/lib64/python3.11/site-packages (from xarray) (1.26.3)
Requirement already satisfied: packaging>=21.3 in ./.venv/lib64/python3.11/site-packages (from xarray) (23.2)
Requirement already satisfied: pandas>=1.4 in ./.venv/lib64/python3.11/site-packages (from xarray) (2.1.4)
Requirement already satisfied: cftime in ./.venv/lib64/python3.11/site-packages (from netCDF4) (1.6.3)
Requirement already satisfied: certifi in ./.venv/lib64/python3.11/site-packages (from netCDF4) (2023.11.17)
Requirement already satisfied: python-dateutil>=2.8.2 in ./.venv/lib64/python3.11/site-packages (from pandas>=1.4->xarray) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib64/python3.11/site-packages (from pandas>=1.4->xarray) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in ./.venv/lib64/python3.11/site-packages (from pandas>=1.4->xarray) (2023.4)
Requirement already satisfied: six>=1.5 in ./.venv/lib64/python3.11/site-packages (from python-dateutil>=2.8.2->pandas>=1.4->xarray) (1.16.0)

[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
In [ ]:
import xarray as xr
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

Create a timeseries dataset¶

This creates the dataset from a dataframe, but it could also be read from a csv file using pd.read_csv.

In [ ]:
now = datetime.utcnow()
df = pd.DataFrame(
    dict(
        time=[(now+timedelta(days=d)).replace(microsecond=0) for d in range(0,5)],
        temperature=[4, None, 8, 22, -1],
        turbidity=[None, 23.8, 2.5, 32.2, 4.1],
    )
)
ds = xr.Dataset.from_dataframe(df.set_index(["time"]))
ds
Out[ ]:
<xarray.Dataset>
Dimensions:      (time: 5)
Coordinates:
  * time         (time) datetime64[ns] 2024-01-17T14:04:20 ... 2024-01-21T14:...
Data variables:
    temperature  (time) float64 4.0 nan 8.0 22.0 -1.0
    turbidity    (time) float64 nan 23.8 2.5 32.2 4.1
xarray.Dataset
    • time: 5
    • time
      (time)
      datetime64[ns]
      2024-01-17T14:04:20 ... 2024-01-...
      array(['2024-01-17T14:04:20.000000000', '2024-01-18T14:04:20.000000000',
             '2024-01-19T14:04:20.000000000', '2024-01-20T14:04:20.000000000',
             '2024-01-21T14:04:20.000000000'], dtype='datetime64[ns]')
    • temperature
      (time)
      float64
      4.0 nan 8.0 22.0 -1.0
      array([ 4., nan,  8., 22., -1.])
    • turbidity
      (time)
      float64
      nan 23.8 2.5 32.2 4.1
      array([ nan, 23.8,  2.5, 32.2,  4.1])
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2024-01-17 14:04:20', '2024-01-18 14:04:20',
                     '2024-01-19 14:04:20', '2024-01-20 14:04:20',
                     '2024-01-21 14:04:20'],
                    dtype='datetime64[ns]', name='time', freq=None))

Update coordinates with location and metadata¶

A dataset have support for metadata on each variable.

In [ ]:
lat, lon = 60.3833, 5.3443
ds = ds.assign_coords(
    dict(
        # we don't need location dimension when it is just one point
        longitude=xr.Variable((), lon, dict(standard_name="longitude", long_name="Longitude", units="degree_east", axis="X")),
        latitude=xr.Variable((), lat, dict(standard_name="latitude", long_name="Latitude", units="degree_north", axis="Y")),
        time=xr.Variable("time", ds.time, dict(standard_name="time", long_name="Time of measurement", axis="T")),
    )
)
ds
Out[ ]:
<xarray.Dataset>
Dimensions:      (time: 5)
Coordinates:
    longitude    float64 5.344
    latitude     float64 60.38
  * time         (time) datetime64[ns] 2024-01-17T14:04:20 ... 2024-01-21T14:...
Data variables:
    temperature  (time) float64 4.0 nan 8.0 22.0 -1.0
    turbidity    (time) float64 nan 23.8 2.5 32.2 4.1
xarray.Dataset
    • time: 5
    • longitude
      ()
      float64
      5.344
      standard_name :
      longitude
      long_name :
      Longitude
      units :
      degree_east
      axis :
      X
      array(5.3443)
    • latitude
      ()
      float64
      60.38
      standard_name :
      latitude
      long_name :
      Latitude
      units :
      degree_north
      axis :
      Y
      array(60.3833)
    • time
      (time)
      datetime64[ns]
      2024-01-17T14:04:20 ... 2024-01-...
      standard_name :
      time
      long_name :
      Time of measurement
      axis :
      T
      array(['2024-01-17T14:04:20.000000000', '2024-01-18T14:04:20.000000000',
             '2024-01-19T14:04:20.000000000', '2024-01-20T14:04:20.000000000',
             '2024-01-21T14:04:20.000000000'], dtype='datetime64[ns]')
    • temperature
      (time)
      float64
      4.0 nan 8.0 22.0 -1.0
      array([ 4., nan,  8., 22., -1.])
    • turbidity
      (time)
      float64
      nan 23.8 2.5 32.2 4.1
      array([ nan, 23.8,  2.5, 32.2,  4.1])
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2024-01-17 14:04:20', '2024-01-18 14:04:20',
                     '2024-01-19 14:04:20', '2024-01-20 14:04:20',
                     '2024-01-21 14:04:20'],
                    dtype='datetime64[ns]', name='time', freq=None))

Add station name¶

In [ ]:
ds["station_name"] = xr.DataArray("store_lungen", dims=(), attrs=dict(cf_role="timeseries_id"))

Add metadata for each data variable¶

In [ ]:
ds.temperature.attrs["standard_name"] = "sea_water_temperature"
ds.temperature.attrs["long_name"] = "Sea Water Temperature"
ds.temperature.attrs["units"] = "degree_Celcius"
ds.temperature.attrs["comment"] = "I lost the thermometer in Store Lundgårdsvann"

ds.turbidity.attrs["standard_name"] = "sea_water_turbidity"
ds.turbidity.attrs["long_name"] = "Sea Water Turbidity"
ds.turbidity.attrs["units"] = "NTU"

Assign global attributes¶

In [ ]:
ds = ds.assign_attrs(
    dict(
        id="e5d54ede-685d-4951-917b-25157ce67314", # can also be set later
        naming_authority="bb.badebussen", # can also be set later
        title="Measurements in the middle of Store Lundgårdsvann",
        title_no="Målinger midt i Store Lungegårdsvann",
        summary="Measurements taken at a fixed point in Store Lungegårdsvann during my daily swim",
        summary_no="Målinger tatt på eit fast punkt under min daglige svømmetur i Store Lungegårdsvann",
        keywords=",".join(
            [
                "GCMDSK:EARTH SCIENCE > HUMAN DIMENSIONS > SUSTAINABILITY > SUSTAINABLE DEVELOPMENT",
                "GCMDLOC:CONTINENT > EUROPE > NORTHERN EUROPE > SCANDINAVIA > NORWAY",
            ]
        ),
        keywords_vocabulary=",".join(
            [
                "GCMDSK:GCMD Science Keywords:https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/sciencekeywords",
                "GCMDLOC:GCMD Locations:https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/locations",
            ]
        ),
        iso_topic_category="Not available",
        featureType="timeseries",
        date_created=datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"),
        project="Store Lungen",
        time_coverage_start=np.datetime_as_string(ds.time.min().values, unit="s", timezone="UTC"),
        time_coverage_end=np.datetime_as_string(ds.time.max().values, unit="s", timezone="UTC"),
        geospatial_lat_min=float(ds.latitude.min()),
        geospatial_lat_max=float(ds.latitude.max()),
        geospatial_lon_min=float(ds.longitude.min()),
        geospatial_lon_max=float(ds.longitude.max()),
        spatial_representation="point",
        creator_type='institution',
        creator_institution='Badebussen',
        institution='Badebussen',
        institution_short_name='BB',
        creator_email='badebussen@lungen.bb',
        creator_url='https://badebussen.bb',
        data_owner='Badebussen',
        processing_level='Operational',
        Conventions='CF-1.7, ACDD-1.3',
        publisher_name='badebussen',
        publisher_email='publisher@badebussen.bb',
        publisher_url='https://badebussen.bb',
        license='http://spdx.org/licenses/CC-BY-4.0(CC-BY-4.0)',
        history='Created on jupyterhub',
    )
)
ds
Out[ ]:
<xarray.Dataset>
Dimensions:       (time: 5)
Coordinates:
    longitude     float64 5.344
    latitude      float64 60.38
  * time          (time) datetime64[ns] 2024-01-17T14:04:20 ... 2024-01-21T14...
Data variables:
    temperature   (time) float64 4.0 nan 8.0 22.0 -1.0
    turbidity     (time) float64 nan 23.8 2.5 32.2 4.1
    station_name  <U12 'store_lungen'
Attributes: (12/33)
    id:                      e5d54ede-685d-4951-917b-25157ce67314
    naming_authority:        bb.badebussen
    title:                   Measurements in the middle of Store Lundgårdsvann
    title_no:                Målinger midt i Store Lungegårdsvann
    summary:                 Measurements taken at a fixed point in Store Lun...
    summary_no:              Målinger tatt på eit fast punkt under min daglig...
    ...                      ...
    Conventions:             CF-1.7, ACDD-1.3
    publisher_name:          badebussen
    publisher_email:         publisher@badebussen.bb
    publisher_url:           https://badebussen.bb
    license:                 http://spdx.org/licenses/CC-BY-4.0(CC-BY-4.0)
    history:                 Created on jupyterhub
xarray.Dataset
    • time: 5
    • longitude
      ()
      float64
      5.344
      standard_name :
      longitude
      long_name :
      Longitude
      units :
      degree_east
      axis :
      X
      array(5.3443)
    • latitude
      ()
      float64
      60.38
      standard_name :
      latitude
      long_name :
      Latitude
      units :
      degree_north
      axis :
      Y
      array(60.3833)
    • time
      (time)
      datetime64[ns]
      2024-01-17T14:04:20 ... 2024-01-...
      standard_name :
      time
      long_name :
      Time of measurement
      axis :
      T
      array(['2024-01-17T14:04:20.000000000', '2024-01-18T14:04:20.000000000',
             '2024-01-19T14:04:20.000000000', '2024-01-20T14:04:20.000000000',
             '2024-01-21T14:04:20.000000000'], dtype='datetime64[ns]')
    • temperature
      (time)
      float64
      4.0 nan 8.0 22.0 -1.0
      standard_name :
      sea_water_temperature
      long_name :
      Sea Water Temperature
      units :
      degree_Celcius
      comment :
      I lost the thermometer in Store Lundgårdsvann
      array([ 4., nan,  8., 22., -1.])
    • turbidity
      (time)
      float64
      nan 23.8 2.5 32.2 4.1
      standard_name :
      sea_water_turbidity
      long_name :
      Sea Water Turbidity
      units :
      NTU
      array([ nan, 23.8,  2.5, 32.2,  4.1])
    • station_name
      ()
      <U12
      'store_lungen'
      cf_role :
      timeseries_id
      array('store_lungen', dtype='<U12')
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2024-01-17 14:04:20', '2024-01-18 14:04:20',
                     '2024-01-19 14:04:20', '2024-01-20 14:04:20',
                     '2024-01-21 14:04:20'],
                    dtype='datetime64[ns]', name='time', freq=None))
  • id :
    e5d54ede-685d-4951-917b-25157ce67314
    naming_authority :
    bb.badebussen
    title :
    Measurements in the middle of Store Lundgårdsvann
    title_no :
    Målinger midt i Store Lungegårdsvann
    summary :
    Measurements taken at a fixed point in Store Lungegårdsvann during my daily swim
    summary_no :
    Målinger tatt på eit fast punkt under min daglige svømmetur i Store Lungegårdsvann
    keywords :
    GCMDSK:EARTH SCIENCE > HUMAN DIMENSIONS > SUSTAINABILITY > SUSTAINABLE DEVELOPMENT,GCMDLOC:CONTINENT > EUROPE > NORTHERN EUROPE > SCANDINAVIA > NORWAY
    keywords_vocabulary :
    GCMDSK:GCMD Science Keywords:https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/sciencekeywords,GCMDLOC:GCMD Locations:https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/locations
    iso_topic_category :
    Not available
    featureType :
    timeseries
    date_created :
    2024-01-17T14:07:45Z
    project :
    Store Lungen
    time_coverage_start :
    2024-01-17T14:04:20Z
    time_coverage_end :
    2024-01-21T14:04:20Z
    geospatial_lat_min :
    60.3833
    geospatial_lat_max :
    60.3833
    geospatial_lon_min :
    5.3443
    geospatial_lon_max :
    5.3443
    spatial_representation :
    point
    creator_type :
    institution
    creator_institution :
    Badebussen
    institution :
    Badebussen
    institution_short_name :
    BB
    creator_email :
    badebussen@lungen.bb
    creator_url :
    https://badebussen.bb
    data_owner :
    Badebussen
    processing_level :
    Operational
    Conventions :
    CF-1.7, ACDD-1.3
    publisher_name :
    badebussen
    publisher_email :
    publisher@badebussen.bb
    publisher_url :
    https://badebussen.bb
    license :
    http://spdx.org/licenses/CC-BY-4.0(CC-BY-4.0)
    history :
    Created on jupyterhub

Store the dataset¶

You can specify encoding as an dictionary, C&F doesn't use fillvalue in coordinates and some programs doesn't like int64

In [ ]:
ds.to_netcdf(
    "badebussen.nc",
    unlimited_dims=["time"],
    encoding=dict(
        time={"dtype": "int32", "_FillValue": None, "units": "seconds since 1970-01-01 00:00:00"},
        longitude={"_FillValue": None},
        latitude={"_FillValue": None},
    ),
)