Database Reference¶
xarray_mongodb stores data on MongoDB in a format that is agnostic to Python; this allows writing clients in different languages.
Like with GridFS, data is
split across two collections, <prefix>.meta
and <prefix>.chunks
. By
default, these are xarray.meta
and xarray.chunks
.
Note
At the moment of writing, support for units and sparse arrays has not been implemented yet.
xarray.meta¶
The <prefix>.meta
collection contains one document per
xarray.Dataset
or xarray.DataArray
object, formatted as
follows:
{
'_id': bson.ObjectId(...),
'attrs': bson.SON(...),
'chunkSize': 261120,
'coords': bson.SON(...),
'data_vars': bson.SON(...),
'name': '<str>' (optional),
}
Where:
_id
is the unique ID of the xarray objectattrs
,coords
, anddata_vars
arebson.SON
objects with the same order as thecollections.OrderedDict
objects in the xarray object.attrs
are theDataset.attrs
, in native MongoDB format. Python object types that are not recognized by PyMongo are not supported.chunkSize is the number of bytes stored at most in each document in the
<prefix>.chunks
collection. This is not to be confused with dask chunk size; for each dask chunk there are one or more MongoDB documents in the<prefix>.chunks
collection (see later).name
is theDataArray.name
; omit for unnamed arrays and Datasets.coords
anddata_vars
contain one key/value pair for everyxarray.Variable
, where the key is the variable name and the value is a dict defined as follows:{ 'chunks': [[2, 2], [2, 2]], 'dims': ['x'], 'dtype': '<i8', 'shape': [4, 4], 'type': <'ndarray'|'COO'>, 'fill_value': <bytes> (optional), 'units': '<str>' (optional), }
chunks
are the dask chunk sizes at the moment of storing the array, or None if the variable was not backed by dask at the moment of storing the object.dims
are the names of the variable dimensionsdtype
is the dtype of the numpy/dask variable, always in string formatshape
is the overall shape of the numpy/dask arraytype
is the backend array type;ndarray
for dense objects andCOO
forsparse.COO
objects.fill_value
is the default value of the sparse array. It is a bytes buffer in little endian encoding of as many bytes as implied by dtype. This format allows encoding dtypes that are not native to MongoDB, e.g. complex numbers. Omit when type=ndarray.units
is the string representation ofpint.UnitRegistry().Unit
, e.g.kg * m /s ** 2
. The exact meaning of each symbol is deliberately omitted here and remitted to pint (or whatever other engine is used to handle units of measures). Omit for unit-less objects.
xarray.DataArray
objects are identifiable by having exactly one
variable in data_vars
, conventionally named __DataArray__
.
Note
When dealing with dask variables, shape
and/or chunks
may contain NaN instead
of integer sizes when the variable size is unknown at the moment of graph definition.
Also, dtype
, type
, fill_value
, and units
may potentially be wrong in
the meta
document and may be overridden by the chunks
documents (see below).
xarray.chunks¶
The <prefix>.chunks
collection contains the numpy data underlying the
array. There is a N:1 relationship between the chunks and the meta documents.
Each document is formatted as follows:
{
'_id': bson.ObjectId(...),
'meta_id': bson.ObjectId(...),
'name': 'variable name',
'chunk': [0, 0],
'dtype': '<i8',
'shape': [1, 2]},
'n': 0,
'type': <'ndarray'|'COO'>,
'data': <bytes>,
# For COO only; omit in case of ndarray
'coords': <bytes>',
'nnz': <int>,
'fill_value': <bytes>,
# For pint only; omit for unit-less arrays
'units': <str>
}
Where:
meta_id
is the Object Id of the<prefix>.meta
collectionname
is the variable name, matching the one defined in<prefix>.meta
chunk
is the dask chunk ID, or None for variables that were not backed by dask at the moment of storing the objectdtype
is the numpy dtype. It may be mismatched with, and overrides, the one defined in themeta
collection.shape
is the size of the current chunk. Unlike theshape
andchunks
variables defined in<prefix>.meta
, it is never NaN.n
is the sequential document counter for the current variable and chunk (see below)type
is the raw array type;ndarray
for dense arrays;COO
for sparse ones. It may be mismatched with, and overrides, the one defined in themeta
collection.data
is the raw numpy buffer, in row-major (C) order and little endian encoding.units
is the string representation ofpint.UnitRegistry().Unit
, e.g.kg * m /s ** 2
. Omit for unit-less objects. It may be mismatched with, and overrides, the one defined in themeta
collection.
Since numpy arrays and dask chunks can be larger than the maximum size a MongoDB
document can hold (typically 16MB), each numpy array or dask chunk may be split across
multiple documents, much like it happens in GridFS.
If the number of bytes in data
would be larger than chunkSize
, then it is split
across multiple documents, with n=0, n=1, … etc. The split happens after converting
the numpy array into a raw bytes buffer, and may result in having numpy points split
across different documents if chunkSize
is not an exact multiple of the
dtype
size.
Sparse arrays¶
Sparse arrays (constructed using the Python class sparse.COO
) differ from
dense arrays as follows:
In
xarray.meta
,The
type
field has valueCOO
Extra field
fill_value
contains the value for all cells that are not explicitly listed. It is a raw binary blob in little endian encoding containing exactly one element of the indicated dtype.
In
xarray.chunks
,The
type
field has valueCOO
Extra field
fill_value
contains the value for all cells that are not explicitly listedExtra field
nnz
is a non-negative integer (possibly zero) counting the number of cells that differ fromfill_value
.The
data
field contains sparse values. It is a one-dimensional array of the indicated dtype with as many elements asnnz
.Extra field
coords
is a binary blob representing a two-dimensional numpy array, with as many rows as the number of dimensions (seeshape
) and as many columns asnnz
. It always contains unsigned integers in little endian format, regardless of the declared dtype. The word length is:If max(shape) < 256, 1 byte
If 256 <= max(shape) < 2**16, 2 bytes
If 2**16 <= max(shape) < 2**32, 4 bytes
Otherwise, 8 bytes
Each column of
coords
indicates the coordinates of the matching value indata
.
See next section for examples.
When the total of the data
and coords
bytes exceeds chunkSize
, then the information
is split across multiple documents, as follows:
Documents containing slices of
data
; in all but the last one,coords
is a bytes object of size 0Documents containing slices of
coords
; in all but the first one,data
is a bytes object of size 0
Note
When nnz=0, both data and coords are bytes objects of size 0.
Examples¶
xarray object:
xarray.Dataset(
{"x": [[0, 1.1, 0],
[0, 0, 2.2]]
}
)
chunks document (dense):
{
'_id': bson.ObjectId(...),
'meta_id': bson.ObjectId(...),
'name': 'x',
'chunk': [0, 0],
'dtype': '<f8',
'shape': [2, 3],
'n': 0,
'type': 'ndarray',
'data': # 48 bytes buffer that contains [0, 1.1, 0, 0, 0, 2.2]
}
chunks document (sparse):
{
'_id': bson.ObjectId(...),
'meta_id': bson.ObjectId(...),
'name': 'x',
'chunk': [0, 0],
'dtype': '<f8',
'shape': [2, 3]},
'n': 0,
'type': 'COO',
'nnz': 2,
'fill_value': b'\x00\x00\x00\x00\x00\x00\x00\x00',
'data': # 16 bytes buffer that contains [1.1, 2.2]
'coords': # 4 bytes buffer that contains [[0, 1,
# [1, 2]]
}
Indexing¶
Documents in <prefix>.chunks
are identifiable by a unique functional key
(meta_id, name, chunk, n)
. The driver automatically creates a non-unique index
(meta_id, name, chunk)
on the collection. Indexing n
is unnecessary as
all the segments for a chunk are always read back together.
Missing data¶
<prefix>.chunks
may miss some or all of the documents needed to
reconstruct the xarray object. This typically happens when:
the user invokes
put()
, but then does not compute the returned futuresome or all of the dask chunks fail to compute because of a fault at any point upstream in the dask graph
there is a fault in MongoDB, e.g. the database becomes unreachable between the moment
put()
is invoked and the moment the future is computed, or if the disk becomes full.
The document in <prefix>.meta
allows defining the
(meta_id, name, chunk)
search key for all objects in <prefix>.chunks
and identify any missing documents. When a chunk is split across multiple
documents, one can figure out if the retrieved documents (n=0, n=1, …) are
the complete set:
for dense arrays (type=ndarray), the number of bytes in
data
must be the same as the productory ofshape
multiplied bydtype
.size.for sparse arrays(type=COO), the number of bytes in
data
pluscoords
must be the same asnnz * (dtype.size + len(shape) * coords.dtype.size)
wherecoords.dtype.size
is either 1, 2, 4 or 8 depending onmax(shape)
(see above).