Merge Tables¶
Why¶
A pipeline may diverge when we want to process the same data in different ways. Merge Tables allow us to join divergent pipelines together, and unify downstream processing steps. For a more in depth discussion, please refer to this notebook and related discussions here and here.
What¶
A Merge Table is fundamentally a master table with one part for each divergent pipeline. By convention...
-
The master table has one primary key,
merge_id
, a UUID, and one secondary attribute,source
, which gives the part table name. Both are managed with the custominsert
function of this class. -
Each part table has inherits the final table in its respective pipeline, and shares the same name as this table.
from spyglass.utils.dj_merge_tables import _Merge
@schema
class MergeOutput(_Merge):
definition = """
merge_id: uuid
---
source: varchar(32)
"""
class One(dj.Part):
definition = """
-> master
---
-> One
"""
class Two(dj.Part):
definition = """
-> master
---
-> Two
"""
By convention, Merge Tables have been named with the pipeline name plus Output
(e.g., LFPOutput
, PositionOutput
). Using the underscore alias for this class
allows us to circumvent a DataJoint protection that interprets the class as a
table itself.
How¶
Merging¶
The Merge class in Spyglass's utils is a subclass of DataJoint's
Manual Table
and adds functions to make the awkwardness of part tables more manageable. These
functions are described in the API section,
under utils.dj_merge_tables
.
Restricting¶
In short: restrict Merge Tables with arguments, not the &
operator.
- Normally:
Table & "field='value'"
- Instead:
MergeTable.merge_view(restriction="field='value'"
).
Caution. The &
operator may look like it's working when using dict
, but
this is because invalid keys will be ignored. Master & {'part_field':'value'}
is equivalent to Master
alone
(source).
When provided as arguments, methods like merge_get_part
and merge_get_parent
will override the permissive treatment of mappings described above to only
return relevant tables.
Building Downstream¶
A downstream analysis will ideally be able to use all diverget pipelines
interchangeably. If there are parameters that may be required for downstream
processing, they should be included in the final table of the pipeline. In the
example above, both One
and Two
might have a secondary key params
. A
downstream Computed table could do the following:
def make(self, key):
try:
params = MergeTable.merge_get_parent(restriction=key).fetch("params")
except DataJointError:
params = default_params
processed_data = self.processing_func(key, params)
Note that the try/except
above catches a possible error in the event params
is not present in the parent.
Example¶
For example usage, see our Merge Table notebook.