Example of creating a timeseries dataset in xarray¶
Example of creating a simple timeseries in xarray with attributes for S-ENDA
In [ ]:
%pip install xarray netCDF4
Requirement already satisfied: xarray in ./.venv/lib64/python3.11/site-packages (2023.12.0) Requirement already satisfied: netCDF4 in ./.venv/lib64/python3.11/site-packages (1.6.5) Requirement already satisfied: numpy>=1.22 in ./.venv/lib64/python3.11/site-packages (from xarray) (1.26.3) Requirement already satisfied: packaging>=21.3 in ./.venv/lib64/python3.11/site-packages (from xarray) (23.2) Requirement already satisfied: pandas>=1.4 in ./.venv/lib64/python3.11/site-packages (from xarray) (2.1.4) Requirement already satisfied: cftime in ./.venv/lib64/python3.11/site-packages (from netCDF4) (1.6.3) Requirement already satisfied: certifi in ./.venv/lib64/python3.11/site-packages (from netCDF4) (2023.11.17) Requirement already satisfied: python-dateutil>=2.8.2 in ./.venv/lib64/python3.11/site-packages (from pandas>=1.4->xarray) (2.8.2) Requirement already satisfied: pytz>=2020.1 in ./.venv/lib64/python3.11/site-packages (from pandas>=1.4->xarray) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in ./.venv/lib64/python3.11/site-packages (from pandas>=1.4->xarray) (2023.4) Requirement already satisfied: six>=1.5 in ./.venv/lib64/python3.11/site-packages (from python-dateutil>=2.8.2->pandas>=1.4->xarray) (1.16.0) [notice] A new release of pip is available: 23.2.1 -> 23.3.2 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
In [ ]:
import xarray as xr
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
Create a timeseries dataset¶
This creates the dataset from a dataframe, but it could also be read from a csv file using pd.read_csv.
In [ ]:
now = datetime.utcnow()
df = pd.DataFrame(
    dict(
        time=[(now+timedelta(days=d)).replace(microsecond=0) for d in range(0,5)],
        temperature=[4, None, 8, 22, -1],
        turbidity=[None, 23.8, 2.5, 32.2, 4.1],
    )
)
ds = xr.Dataset.from_dataframe(df.set_index(["time"]))
ds
Out[ ]:
<xarray.Dataset>
Dimensions:      (time: 5)
Coordinates:
  * time         (time) datetime64[ns] 2024-01-17T14:04:20 ... 2024-01-21T14:...
Data variables:
    temperature  (time) float64 4.0 nan 8.0 22.0 -1.0
    turbidity    (time) float64 nan 23.8 2.5 32.2 4.1Update coordinates with location and metadata¶
A dataset have support for metadata on each variable.
In [ ]:
lat, lon = 60.3833, 5.3443
ds = ds.assign_coords(
    dict(
        # we don't need location dimension when it is just one point
        longitude=xr.Variable((), lon, dict(standard_name="longitude", long_name="Longitude", units="degree_east", axis="X")),
        latitude=xr.Variable((), lat, dict(standard_name="latitude", long_name="Latitude", units="degree_north", axis="Y")),
        time=xr.Variable("time", ds.time, dict(standard_name="time", long_name="Time of measurement", axis="T")),
    )
)
ds
Out[ ]:
<xarray.Dataset>
Dimensions:      (time: 5)
Coordinates:
    longitude    float64 5.344
    latitude     float64 60.38
  * time         (time) datetime64[ns] 2024-01-17T14:04:20 ... 2024-01-21T14:...
Data variables:
    temperature  (time) float64 4.0 nan 8.0 22.0 -1.0
    turbidity    (time) float64 nan 23.8 2.5 32.2 4.1Add station name¶
In [ ]:
ds["station_name"] = xr.DataArray("store_lungen", dims=(), attrs=dict(cf_role="timeseries_id"))
Add metadata for each data variable¶
In [ ]:
ds.temperature.attrs["standard_name"] = "sea_water_temperature"
ds.temperature.attrs["long_name"] = "Sea Water Temperature"
ds.temperature.attrs["units"] = "degree_Celcius"
ds.temperature.attrs["comment"] = "I lost the thermometer in Store Lundgårdsvann"
ds.turbidity.attrs["standard_name"] = "sea_water_turbidity"
ds.turbidity.attrs["long_name"] = "Sea Water Turbidity"
ds.turbidity.attrs["units"] = "NTU"
Assign global attributes¶
In [ ]:
ds = ds.assign_attrs(
    dict(
        id="e5d54ede-685d-4951-917b-25157ce67314", # can also be set later
        naming_authority="bb.badebussen", # can also be set later
        title="Measurements in the middle of Store Lundgårdsvann",
        title_no="Målinger midt i Store Lungegårdsvann",
        summary="Measurements taken at a fixed point in Store Lungegårdsvann during my daily swim",
        summary_no="Målinger tatt på eit fast punkt under min daglige svømmetur i Store Lungegårdsvann",
        keywords=",".join(
            [
                "GCMDSK:EARTH SCIENCE > HUMAN DIMENSIONS > SUSTAINABILITY > SUSTAINABLE DEVELOPMENT",
                "GCMDLOC:CONTINENT > EUROPE > NORTHERN EUROPE > SCANDINAVIA > NORWAY",
            ]
        ),
        keywords_vocabulary=",".join(
            [
                "GCMDSK:GCMD Science Keywords:https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/sciencekeywords",
                "GCMDLOC:GCMD Locations:https://gcmd.earthdata.nasa.gov/kms/concepts/concept_scheme/locations",
            ]
        ),
        iso_topic_category="Not available",
        featureType="timeseries",
        date_created=datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"),
        project="Store Lungen",
        time_coverage_start=np.datetime_as_string(ds.time.min().values, unit="s", timezone="UTC"),
        time_coverage_end=np.datetime_as_string(ds.time.max().values, unit="s", timezone="UTC"),
        geospatial_lat_min=float(ds.latitude.min()),
        geospatial_lat_max=float(ds.latitude.max()),
        geospatial_lon_min=float(ds.longitude.min()),
        geospatial_lon_max=float(ds.longitude.max()),
        spatial_representation="point",
        creator_type='institution',
        creator_institution='Badebussen',
        institution='Badebussen',
        institution_short_name='BB',
        creator_email='badebussen@lungen.bb',
        creator_url='https://badebussen.bb',
        data_owner='Badebussen',
        processing_level='Operational',
        Conventions='CF-1.7, ACDD-1.3',
        publisher_name='badebussen',
        publisher_email='publisher@badebussen.bb',
        publisher_url='https://badebussen.bb',
        license='http://spdx.org/licenses/CC-BY-4.0(CC-BY-4.0)',
        history='Created on jupyterhub',
    )
)
ds
Out[ ]:
<xarray.Dataset>
Dimensions:       (time: 5)
Coordinates:
    longitude     float64 5.344
    latitude      float64 60.38
  * time          (time) datetime64[ns] 2024-01-17T14:04:20 ... 2024-01-21T14...
Data variables:
    temperature   (time) float64 4.0 nan 8.0 22.0 -1.0
    turbidity     (time) float64 nan 23.8 2.5 32.2 4.1
    station_name  <U12 'store_lungen'
Attributes: (12/33)
    id:                      e5d54ede-685d-4951-917b-25157ce67314
    naming_authority:        bb.badebussen
    title:                   Measurements in the middle of Store Lundgårdsvann
    title_no:                Målinger midt i Store Lungegårdsvann
    summary:                 Measurements taken at a fixed point in Store Lun...
    summary_no:              Målinger tatt på eit fast punkt under min daglig...
    ...                      ...
    Conventions:             CF-1.7, ACDD-1.3
    publisher_name:          badebussen
    publisher_email:         publisher@badebussen.bb
    publisher_url:           https://badebussen.bb
    license:                 http://spdx.org/licenses/CC-BY-4.0(CC-BY-4.0)
    history:                 Created on jupyterhubStore the dataset¶
You can specify encoding as an dictionary, C&F doesn't use fillvalue in coordinates and some programs doesn't like int64
In [ ]:
ds.to_netcdf(
    "badebussen.nc",
    unlimited_dims=["time"],
    encoding=dict(
        time={"dtype": "int32", "_FillValue": None, "units": "seconds since 1970-01-01 00:00:00"},
        longitude={"_FillValue": None},
        latitude={"_FillValue": None},
    ),
)