# Segmentation datasets

`SegmentationDatasets` and `SegmentationObjects` are implemented in `segmentation.py` (`syconn.reps`). 
It is accompanied by helper functions in `segmentation_helper.py` and `rep_helper.py` (`syconn.reps`) for 
basic functionality such as loading and storing and `sd_proc.py` (`syconn.proc`) intensive processing that 
is usually parallelized. 

Typically, the voxel storage of a  `SegmentationDatasets` is created first (eg. by the object extraction). 
Please check the corresponding documentation to learn more about that.

On a fundamental level, each `SegmentationObject` owns voxels, attributes, a skelton and a mesh which 
are stored in different dictionaries (`VoxelDict`, `AttributeDict`, `SkeletonDict`, `MeshDict`; see section 'Backend'). 
Each dictionary consists of the associated data from many objects and compresses it individually for
efficient storage. The number of dictionaries per data type can be defined with `n_folders_fs` (only powers of 10). 
Please note for the general way of creating `SegmentationDatasets` this has to be passed to the object extraction as well.

## Initialization

To load (or create) a `SegmentationDataset` at least the `obj_type` has to be defined. Defaults exist for 
other parameters such as `version` and `working_dir`. These are stored in `config.ini` (eg. `version`) in 
the `working_dir` and project wide in `config.global_params` (eg. `working_dir`). 

```
sd_cell_sv = SegmentationDataset("sv", working_dir="path/to/wd")
```

It is useful to run `sd_proc.dataset_analysis(...)` when loading a `SegmentationDataset` the first time 
(after writing its voxel storage) or after making changes to the attributes. `dataset_analysis` creates global `numpy`
arrays for fast access for each attribute and calculates some attributes itself (such as `size` and `bounding box`). This can 
be viewed as a distributed column store of the underlying database.

```
sd_proc.dataset_analysis(sd_cell_sv)
```

When running `dataset_analysis` one can include only a subset of the attributes to avoid problems with non-consistent 
entries (see below). As most functions, `dataset_analysis` can either run on a single shared memory system or on 
a distributed custer using `qsub`.

It also is recommended to preprocess the meshes of the SegmentationObjects.
See `mesh_proc_chunked` in `syconn/proc/sd_proc.py`.

## Usage

If `sd_proc.dataset_analysis(...)` was applied, the `SegmentationDataset` can access the values of an attribute of all objects 
as an array. For instance, the attribute `size` can be accesses via

```
sizes = sd_cell_sv.load_numpy_data("size")
```

Some attributes, such as `size` and `id`, are also available as object attributes (e.g. `sd_cell_sv.sizes`). Values in 
different attribute arrays are always sorted in the same way. Hence, one can use the id array (`sd_cell_sv.ids`) as a reference.

A `SegmentationDataset` allows easy access to its `SegmentationObjects` by

```
cell_sv_obj = sd_cell_sv.get_segmentation_object(obj_id)
```

There are four additional data structures for each `SegmentationObject`: voxels (`VoxelStorage`), attributes 
(`AttributeDict`), meshes (`MeshStorage`) and skeletons (`SkeletonStorage`). 
Typically, every `SegmentationObject` owns the first three while only supervoxels (`sv`) have a skeleton. While 
voxels, meshes and skeletons are predefined datatypes, attributes are an arbitrary key value store. It is advised though to be consistent in type and
naming of attributes across the `SegmentationDataset` to avoid problems with the aforementioned numpy arrays.

The different data structures can be accessed by e.g.

```
voxels = cell_sv_obj.voxels
mesh = cell_sv_obj.mesh
skeleton = cell_sv_obj.skeleton
attr_value = cell_sv_obj.lookup_in_attribute_dict("attr_key")
```

The attribute dict can also be accessed as a whole

```
cell_sv_obj.load_attr_dict()
attr_dict = cell_sv_obj.attr_dict
```

`SegmentationObjects` cache data that was accessed. This can be disabled by
```
cell_sv_obj.mesh_caching = False
cell_sv_obj.voxel_caching = False
cell_sv_obj.skeleton_caching = False
```

and the cache can be cleared by 

```
cell_sv_obj.clear_cache()
```