Database Reference
xarray_mongodb stores data on MongoDB in a format that is agnostic to Python; this allows writing clients in different languages.
Like with GridFS, data is
split across two collections, <prefix>.meta and <prefix>.chunks. By
default, these are xarray.meta and xarray.chunks.
Note
At the moment of writing, support for sparse arrays has not been implemented yet.
xarray.meta
The <prefix>.meta collection contains one document per
xarray.Dataset or xarray.DataArray object, formatted as
follows:
{
'_id': bson.ObjectId(...),
'attrs': bson.SON(...) (optional),
'chunkSize': 261120,
'coords': bson.SON(...),
'data_vars': bson.SON(...),
'name': '<str>' (optional),
}
Where:
_idis the unique ID of the xarray objectattrs,coords, anddata_varsarebson.SONobjects with the same order as the dictionaries in the xarray object (note how dicts preserve insertion order starting from Python 3.6).attrsare theDataset.attrsorDataArray.attrs, in native MongoDB format. Python object types that are not recognized by PyMongo are not supported. Omit when no attrs are available.chunkSize is the number of bytes stored at most in each document in the
<prefix>.chunkscollection. This is not to be confused with dask chunk size; for each dask chunk there are one or more MongoDB documents in the<prefix>.chunkscollection (see later).nameis theDataArray.name; omit for unnamed arrays and Datasets.coordsanddata_varscontain one key/value pair for everyxarray.Variable, where the key is the variable name and the value is a dict defined as follows:{ 'chunks': [[2, 2], [2, 2]], 'dims': ['x'], 'dtype': '<i8', 'shape': [4, 4], 'type': <'ndarray'|'COO'>, 'attrs': bson.SON(...) (optional), 'units': <str> (optional), # For ndarray only; omit in case of sparse.COO 'data': <bytes> (optional), # For sparse.COO only; omit in case of ndarray 'fill_value': <bytes> (optional), 'sparse_data': <bytes> (optional), 'sparse_coords': <bytes> (optional), 'nnz': <int> (optional), }
chunksare the dask chunk sizes at the moment of storing the array, or None if the variable was not backed by dask at the moment of storing the object.dimsare the names of the variable dimensionsdtypeis the dtype of the numpy/dask variable, always in string formatshapeis the overall shape of the numpy/dask arraytypeis the backend array type;ndarrayfor dense objects andCOOforsparse.COOobjects.attrsare the variable attributes, if anyunitsis the string representation ofpint.Unit, e.g.kg * m /s ** 2. The exact meaning of each symbol is deliberately omitted here and remitted to pint (or whatever other engine is used to handle units of measures). Omit for unit-less objects.datacontains the raw numpy buffer of the variable in the metadata document. It is meant to be used for small variables only. The buffer is in row-major (C) order and little endian encoding. Ifdatais defined,typemust be set to ndarray,chunksmust always be None, and there must not be any documents for the variable in the<prefix>.chunkscollection.fill_valueis the default value of a sparse array. It is a bytes buffer in little endian encoding of as many bytes as implied by dtype. This format allows encoding dtypes that are not native to MongoDB, e.g. complex numbers. Never present when type=ndarray.sparse_data,sparse_coordsandnnzstore embedded sparse arrays. See sparse_arrays.
xarray.DataArray objects are identifiable by having exactly one variable in
data_vars, conventionally named __DataArray__. Note how DataArray.attrs are
the same as the attributes of its data variable; in xarray_mongodb they are only stored
in the top-level attrs key (there is never a data_vars.__DataArray__.attrs key).
Note
When dealing with dask variables, shape and/or chunks may contain NaN instead
of integer sizes when the variable size is unknown at the moment of graph definition.
Also, dtype, type, and fill_value may potentially be wrong in the
meta document and may be overridden by the chunks documents (see below).
xarray.chunks
The <prefix>.chunks collection contains the numpy data underlying the
array. There is a N:1 relationship between the chunks and the meta documents.
Each document is formatted as follows:
{
'_id': bson.ObjectId(...),
'meta_id': bson.ObjectId(...),
'name': 'variable name',
'chunk': [0, 0],
'dtype': '<i8',
'shape': [1, 2]},
'n': 0,
'type': <'ndarray'|'COO'>,
# For ndarray only; omit in case of sparse.COO
'data': <bytes>,
# For sparse.COO only; omit in case of ndarray
'sparse_data': <bytes>,
'sparse_coords': <bytes>',
'nnz': <int>,
'fill_value': <bytes>,
}
Where:
meta_idis the Object Id of the<prefix>.metacollectionnameis the variable name, matching one defined in<prefix>.metachunkis the dask chunk ID, or None for variables that were not backed by dask at the moment of storing the objectdtypeis the numpy dtype. It may be mismatched with, and overrides, the one defined in themetacollection.shapeis the size of the current chunk. Unlike theshapeandchunksvariables defined in<prefix>.meta, it is never NaN.nis the sequential document counter for the current variable and chunk (see below)typeis the raw array type;ndarrayfor dense arrays;COOfor sparse ones. It may be mismatched with, and overrides, the one defined in themetacollection.datais the raw numpy buffer, in row-major (C) order and little endian encoding.
Since numpy arrays and dask chunks can be larger than the maximum size a MongoDB
document can hold (typically 16MB), each numpy array or dask chunk may be split across
multiple documents, much like it happens in GridFS.
If the number of bytes in data would be larger than chunkSize, then it is split
across multiple documents, with n=0, n=1, … etc. The split happens after converting
the numpy array into a raw bytes buffer, and may result in having numpy points split
across different documents if chunkSize is not an exact multiple of the
dtype size.
Note
It is possible for all variables to be embedded into the metadata. In such a case, there won’t be any documents in the chunks collection.
Sparse arrays
Sparse arrays (constructed using the Python class sparse.COO) differ from
dense arrays as follows:
In
xarray.meta,The
typefield has valueCOOExtra field
fill_valuecontains the value for all cells that are not explicitly listed. It is a raw binary blob in little endian encoding containing exactly one element of the indicated dtype.
In
xarray.chunks,The
typefield has valueCOOExtra field
fill_valuecontains the value for all cells that are not explicitly listedExtra field
nnzis a non-negative integer (possibly zero) counting the number of cells that differ fromfill_value.There is no
datafield.The
sparse_datafield contains sparse values. It is a binary blob representing a one-dimensional numpy array of the indicated dtype with as many elements asnnz.The field
sparse_coordsis a binary blob representing a two-dimensional numpy array, with as many rows as the number of dimensions (seeshape) and as many columns asnnz. It always contains unsigned integers in little endian format, regardless of the declared dtype. The word length is:If max(shape) < 256, 1 byte
If 256 <= max(shape) < 2**16, 2 bytes
If 2**16 <= max(shape) < 2**32, 4 bytes
Otherwise, 8 bytes
Each column of
sparse_coordsindicates the coordinates of the matching value insparse_data.
See next section for examples.
When the total of the sparse_data and sparse_coords bytes exceeds chunkSize,
then the information is split across multiple documents, as follows:
Documents containing slices of
sparse_data; in all but the last one,sparse_coordsis a bytes object of size 0Documents containing slices of
sparse_coords; in all but the first one,sparse_datais a bytes object of size 0
Note
When nnz=0, both data and coords are bytes objects of size 0.
Examples
xarray object:
xarray.Dataset(
{"x": [[0, 1.1, 0],
[0, 0, 2.2]]
}
)
chunks document (dense):
{
'_id': bson.ObjectId(...),
'meta_id': bson.ObjectId(...),
'name': 'x',
'chunk': [0, 0],
'dtype': '<f8',
'shape': [2, 3],
'n': 0,
'type': 'ndarray',
'data': # 48 bytes buffer that contains [0, 1.1, 0, 0, 0, 2.2]
}
chunks document (sparse):
{
'_id': bson.ObjectId(...),
'meta_id': bson.ObjectId(...),
'name': 'x',
'chunk': [0, 0],
'dtype': '<f8',
'shape': [2, 3]},
'n': 0,
'type': 'COO',
'nnz': 2,
'fill_value': b'\x00\x00\x00\x00\x00\x00\x00\x00',
'sparse_data': # 16 bytes buffer that contains [1.1, 2.2]
'sparse_coords': # 4 bytes buffer that contains [[0, 1,
# [1, 2]]
}
Indexing
Documents in <prefix>.chunks are identifiable by a unique functional key
(meta_id, name, chunk, n). The driver automatically creates a non-unique index
(meta_id, name, chunk) on the collection. Indexing n is unnecessary as
all the segments for a chunk are always read back together.
Missing data
<prefix>.chunks may miss some or all of the documents needed to
reconstruct the xarray object. This typically happens when:
the user invokes
put(), but then does not compute the returned futuresome or all of the dask chunks fail to compute because of a fault at any point upstream in the dask graph
there is a fault in MongoDB, e.g. the database becomes unreachable between the moment
put()is invoked and the moment the future is computed, or if the disk becomes full.
The document in <prefix>.meta allows defining the
(meta_id, name, chunk) search key for all objects in <prefix>.chunks
and identify any missing documents. When a chunk is split across multiple
documents, one can figure out if the retrieved documents (n=0, n=1, …) are
the complete set:
for dense arrays (type=ndarray), the number of bytes in
datamust be the same as the productory ofshapemultiplied bydtype.size.for sparse arrays(type=COO), the number of bytes in
datapluscoordsmust be the same asnnz * (dtype.size + len(shape) * coords.dtype.size)wherecoords.dtype.sizeis either 1, 2, 4 or 8 depending onmax(shape)(see above).