Developer notes: Pint and Sparse

Note

This page is for people contributing patches to the xarray_mongodb library itself.

If you just want to use Pint or Sparse, just make sure you satisfy the dependencies (see Installation) and feed the data through! Also read the documentation of the ureg parameter when initialising XarrayMongoDB.

For how pint and sparse objects are stored on the database, see Database Reference.

What is NEP18, and how it impacts xarray_mongodb

Several “numpy-like” libraries support a duck-type interface, specified in NEP18, so that both numpy and other NEP18-compatible libraries can transparently wrap around them.

xarray_mongodb does not, itself, use NEP18. However, it does explicitly support several data types that are possible thanks to NEP18. Namely,

A xarray.Variable can directly wrap:
- a numpy.ndarray, or
- a pint.Quantity, or
- a sparse.COO, or
- a dask.array.Array.
The wrapped object is accessible through the .data property.

Note

xarray.IndexVariable wraps a pandas.Index, but the .data property converts it on the fly to a numpy.ndarray.
A pint.Quantity can directly wrap:
- a numpy.ndarray, or
- a sparse.COO, or
- a dask.array.Array.
Note

Vanilla pint can also wrap int, float, decimal.Decimal, but they are automatically transformed to numpy.ndarray as soon as xarray wraps around the Quantity.

The wrapped object is accessible through the .magnitude property.
A dask.array.Array can directly wrap:
- a numpy.ndarray, or
- a sparse.COO.
The wrapped object cannot be accessed until the dask graph is computed; however the object meta-data is visible without computing through the ._meta property.

Note

dask wrapping pint, while theoretically possible due to how NEP18 works, is not supported.
A sparse.COO is always backed by two numpy.ndarray objects, .data and .coords.

Worst case

The most complicated use case that xarray_mongodb has to deal with is

a xarray.Variable, which wraps around
a pint.Quantity, which wraps around
a dask.array.Array, which wraps around
a sparse.COO, which is built on top of
two numpy.ndarray.

The order is always the one described above. Simpler use cases may remove any of the intermediate layers; at the top there’s always has a xarray.Variable and at the bottom the data is always stored by numpy.ndarray.

Note

At the moment of writing, the example below doesn’t work; see pint#878.

>>> import dask.array as da
>>> import numpy as np
>>> import pint
>>> import sparse
>>> import xarray
>>> ureg = pint.UnitRegistry()
>>> a = xarray.DataArray(
...     ureg.Quantity(
...         da.from_array(
...             sparse.COO.from_numpy(
...                 np.array([0, 0, 1.1])
...             )
...         ), "kg"
...     )
... )
>>> a
<xarray.DataArray (dim_0: 3)>
dask.array<array, shape=(3,), dtype=float64, chunksize=(3,), chunktype=pint.Quantity>
Dimensions without coordinates: dim_0
>>> a.data
<Quantity(<dask.array<array, shape=(3,), dtype=float64, chunksize=(3,),
           chunktype=COO>>, 'kilogram')>
>>> a.data.magnitude
<dask.array<array, shape=(3,), dtype=float64, chunksize=(3,), chunktype=COO>
>>> a.data.units
<Unit('kilogram')>
>>> a.data.magnitude._meta
<COO: shape=(0,), dtype=float64, nnz=0, fill_value=0.0>
>>> a.data.magnitude.compute()
<COO: shape=(3,), dtype=float64, nnz=1, fill_value=0.0>
>>> a.data.magnitude.compute().data
array([1.1])
>>> a.data.magnitude.compute().coords
array([[2]])