Recompute¶
Where¶
spyglass.spikesorting.v0.spikesorting_recording.SpikeSortingRecording
spyglass.spikesorting.v1.recording.SpikeSortingRecording
Why¶
Some analysis files that are generated by Spyglass are very unlikely to be
reaccessed. Those generated by SpikeSortingRecording
tables were identified as
taking up tens of terabytes of space, while very seldom accessed after their
first generation. By finding a way to recompute these files on demand, we can
save significant server space at the cost of an unlikely 10m of recompute time
per file.
Spyglass 0.5.5 introduces the opportunity to delete and recompute both newly generated files after this release, and old files that were generated before this release.
More practically, this feature allows...
- User A to run a computation, generate associated file(s), and interact with
them (e.g.,
fetch_nwb
) using the IDs stored in the tables. - User B to run the recompute pipeline and delete the file(s) to free up space on disk.
- User A to attempt the same interactions as above. If the relevant file is not
found...
- Spyglass will check for copies on Kachery and Dandi.
- Spyglass will attempt to recompute the files(s) using the
_make_file
method and check the contents match the expected hash.
- If the hash matches, the file(s) will be saved to the appropriate location and the relevant tables updated. If the hash does not match, the file(s) will be deleted and an error raised.
To prevent unexpected hash mismatches, we store a record of the dependencies
used to generate the file(s) in a new UserEnvironment
table. Only files that
have been successfully recomputed should be deleted via
TableRecompute.delete_files
(see below).
How¶
Common methods like fetch_nwb
will now check for the existence of a relevant
file(s) and, if missing, will attempt to recompute the file via the _make_file
method. For current and future cases, this is a table method that will generate
the requisite files, check their hash, and save them if the hash matches the
expected hash. If the hash does not match, the file(s) will be deleted and an
error raised.
To generate and maintain records of these hashes and dependencies required, we've added downstream recompute tables for each places where this feature is implemented.
New files¶
NOTE: This feature relies on all users managing their environments with conda. If any user is not using conda, you may need to treat generated files as if they were old files, replicating after the fact.
Newly generated files will automatically record information about their
dependencies and the code that generated them in UserEnvironment
and
RecomputeSelection
tables. To see the dependencies of a file, you can access
RecordingRecomputeSelection
. The only requirements for setting up this feature
are modifying the existing table structure to include the new fields.
Old files¶
Retroactively demonstrating that files can be recomputed requires some additional work in testing various dependencies and recording the results. To ensure the replicability of old files prior to deletion, we'll need to...
- Update the tables for new fields, as shown above.
- Attempt file recompute, and record dependency info for successful attempts.
- Investigate failed attempts and make environment changes as necessary.
- Delete files that successfully recomputed.
Changing the spikeinterface to comply with an older version of spyglass may
trigger an import error in v0.spikesorting_curation
. There is a built-in
bypass for users listed as admin in common.LabMember
.
Attempting recompute¶
RecomputeSelection
tables will record information about the dependencies of
the environments used to attempt recompute. These attempts are run using the
Recompute
table populate
method.
from spyglass.spikesorting.v0 import spikesorting_recording as v0_recording
from spyglass.spikesorting.v0 import spikesorting_recompute as v0_recompute
# Alter tables to include new fields, updating values
my_keys = (v0_recording.SpikeSortingRecording() & restriction).fetch("KEY")
v0_recompute.RecordingRecomputeVersions().populate(my_keys)
v0_recompute.RecordingRecomputeSelection().insert(my_keys)
v0_recompute.RecordingRecompute().populate()
from spyglass.spikesorting.v1 import recording as v1_recording
from spyglass.spikesorting.v1 import recompute as v1_recompute
# Alter tables to include new fields, updating values
my_keys = (v1_recording.SpikeSortingRecording() & restriction).fetch("KEY")
v1_recompute.RecordingRecomputeVersions().populate(my_keys)
v1_recompute.RecordingRecomputeSelection().insert(my_keys) # (1)!
v1_recompute.RecordingRecompute().populate(my_keys)
- Optionally, you can set your preferred precision for recompute when inserting
into
RecordingRecomputeSelection
. The default is 4.
The respective Versions
tables will record the dependencies of existing files
(i.e., spikeinterface and probeinterface versions for v0 and pynwb dependencies
for v1). By default, these insert methods will generate an entry in
common.UserEnvironment
. This table stores all conda and pip dependencies for
the environment used, and associates them with a unique env_id
identifier
({USER}_{CONDA_ENV_NAME}_{##}
) that increments if the environment is updated.
You can customize the env_id
by inserting an entry prior to recompute.
Newly generated recompute files are stored in a subdirectory according to this
env_id: {SPYGLASS_TEMP_DIR}/{recompute schema name}/{env_id}
.
Investigating failed attempts¶
Recompute tables are set up to have Name
and Hash
part tables that track...
Name
: The name of any file or object that was missing from the old file (i.e., the original) or the new file (i.e., the recompute attempt).Hash
: The names of files or objects for which the hash did not match between the original and recompute attempt.
from spyglass.spikesorting.v1.recompute import RecordingRecompute
RecordingRecompute().Name()
RecordingRecompute().Hash()
old_nwb_obj, new_nwb_obj = RecordingRecompute().Hash().get_objs(key)
hash_key = (RecordingRecompute().Hash() & restriction).fetch1("KEY")
RecordingRecompute().Hash().compare(key)
The compare
methods will print out the differences between the two objects or
files, recursively checking for differences in nested objects. This will
truncate output for large objects and may fail for objects that cannot be read
as JSON.
To expand the functionality of this feature, please either post a GitHub issue
or make a pull request with edits to
spyglas.utils.h5_helper_fn.H5pyComparator
.
With this information, you can make changes to the environment to try another recompute attempt. For the best record keeping of attempts, we recommend cloning your conda environment and making changes to the clone. This will allow you to attempt recomputes with different dependencies without affecting the original environment. In the event of a successful recompute, you can update your base environment to match the clone.
Deleting files¶
Each recompute table has a delete_files
method. Of files with a successful
recompute attempt, this method will delete both the original and the recomputed
files.
These methods have ...
- A
restriction
argument that will delete files matching table entries that match the restriction. - A
dry_run
argument that will print the files that would be deleted without actually deleting them. - A confirmation prompt before deleting files.
Folders vs. NWB files¶
The implementation for tables that generate folders (e.g.,
v0.SpikeSortingRecording
) differs from those that generate NWB files (e.g.,
v1.SpikeSortingRecording
). NWB files record some information about their
dependencies and the code that generated them. We can use this information to
prevent recompute attempts that mismatch. See RecomputeVersions
tables for
more information.
NWBs also store data objects in a more structured way, which allows us to make decisions about the degree of precision required for recompute attempts.
Managing Environments¶
To manage environments, we've added a UserEnvironment
table that stores all
conda and pip dependencies for the environment used in a recompute attempt. To
make a comparison between the current environment and a previous environment...
from spyglass.common import UserEnvironment
UserEnvironment().has_matching_env(
env_id="previous_env_id",
relevant_deps=["pynwb", "spikeinterface"], # (1)!
show_diffs=True,
)
relevant_deps
is an optional argument that will filter the comparison to only the dependencies listed. If the comparison fails, the method will print out the mismatching versions and/or missing dependencies.
To install an environment from a previous recompute attempt, first save a yaml
file of the environment, previous_env_id.yaml
, with ...
from spyglass.common import UserEnvironment
UserEnvironment().write_env_yaml(
env_id="previous_env_id",
dest_path="/your/path/", # Optional. Otherwise, user current directory
)
Then, create a new environment with ...
Misc Notes¶
match
and precision
are MySQL keywords, so we use matched
and rounding
,
respectively.