Skip to content

Recompute

Where

  • spyglass.spikesorting.v0.spikesorting_recording.SpikeSortingRecording
  • spyglass.spikesorting.v1.recording.SpikeSortingRecording

Why

Some analysis files that are generated by Spyglass are very unlikely to be reaccessed. Those generated by SpikeSortingRecording tables were identified as taking up tens of terabytes of space, while very seldom accessed after their first generation. By finding a way to recompute these files on demand, we can save significant server space at the cost of an unlikely 10m of recompute time per file.

Spyglass 0.5.5 introduces the opportunity to delete and recompute both newly generated files after this release, and old files that were generated before this release.

More practically, this feature allows...

  1. User A to run a computation, generate associated file(s), and interact with them (e.g., fetch_nwb) using the IDs stored in the tables.
  2. User B to run the recompute pipeline and delete the file(s) to free up space on disk.
  3. User A to attempt the same interactions as above. If the relevant file is not found...
    1. Spyglass will check for copies on Kachery and Dandi.
    2. Spyglass will attempt to recompute the files(s) using the _make_file method and check the contents match the expected hash.
  4. If the hash matches, the file(s) will be saved to the appropriate location and the relevant tables updated. If the hash does not match, the file(s) will be deleted and an error raised.

To prevent unexpected hash mismatches, we store a record of the dependencies used to generate the file(s) in a new UserEnvironment table. Only files that have been successfully recomputed should be deleted via TableRecompute.delete_files (see below).

How

Common methods like fetch_nwb will now check for the existence of a relevant file(s) and, if missing, will attempt to recompute the file via the _make_file method. For current and future cases, this is a table method that will generate the requisite files, check their hash, and save them if the hash matches the expected hash. If the hash does not match, the file(s) will be deleted and an error raised.

To generate and maintain records of these hashes and dependencies required, we've added downstream recompute tables for each places where this feature is implemented.

New files

NOTE: This feature relies on all users managing their environments with conda. If any user is not using conda, you may need to treat generated files as if they were old files, replicating after the fact.

Newly generated files will automatically record information about their dependencies and the code that generated them in UserEnvironment and RecomputeSelection tables. To see the dependencies of a file, you can access RecordingRecomputeSelection. The only requirements for setting up this feature are modifying the existing table structure to include the new fields.

from spyglass.spikesorting.v0 import spikesorting_recording as v0_recording

# Alter tables to include new fields, updating values
v0_recording.SpikeSortingRecording().alter()
v0_recording.SpikeSortingRecording().update_ids()
from spyglass.spikesorting.v1 import recording as v1_recording

# Alter tables to include new fields, updating values
v1_recording.SpikeSortingRecording().alter()
v1_recording.SpikeSortingRecording().update_ids()

Old files

Retroactively demonstrating that files can be recomputed requires some additional work in testing various dependencies and recording the results. To ensure the replicability of old files prior to deletion, we'll need to...

  1. Update the tables for new fields, as shown above.
  2. Attempt file recompute, and record dependency info for successful attempts.
  3. Investigate failed attempts and make environment changes as necessary.
  4. Delete files that successfully recomputed.

Changing the spikeinterface to comply with an older version of spyglass may trigger an import error in v0.spikesorting_curation. There is a built-in bypass for users listed as admin in common.LabMember.

Attempting recompute

RecomputeSelection tables will record information about the dependencies of the environments used to attempt recompute. These attempts are run using the Recompute table populate method.

from spyglass.spikesorting.v0 import spikesorting_recording as v0_recording
from spyglass.spikesorting.v0 import spikesorting_recompute as v0_recompute

# Alter tables to include new fields, updating values
my_keys = (v0_recording.SpikeSortingRecording() & restriction).fetch("KEY")
v0_recompute.RecordingRecomputeVersions().populate(my_keys)
v0_recompute.RecordingRecomputeSelection().insert(my_keys)
v0_recompute.RecordingRecompute().populate()
from spyglass.spikesorting.v1 import recording as v1_recording
from spyglass.spikesorting.v1 import recompute as v1_recompute

# Alter tables to include new fields, updating values
my_keys = (v1_recording.SpikeSortingRecording() & restriction).fetch("KEY")
v1_recompute.RecordingRecomputeVersions().populate(my_keys)
v1_recompute.RecordingRecomputeSelection().insert(my_keys)  # (1)!
v1_recompute.RecordingRecompute().populate(my_keys)
  1. Optionally, you can set your preferred precision for recompute when inserting into RecordingRecomputeSelection. The default is 4.

The respective Versions tables will record the dependencies of existing files (i.e., spikeinterface and probeinterface versions for v0 and pynwb dependencies for v1). By default, these insert methods will generate an entry in common.UserEnvironment. This table stores all conda and pip dependencies for the environment used, and associates them with a unique env_id identifier ({USER}_{CONDA_ENV_NAME}_{##}) that increments if the environment is updated. You can customize the env_id by inserting an entry prior to recompute.

from spyglass.common import UserEnvironment

UserEnvironment().insert_current_env("my_env_id")

Newly generated recompute files are stored in a subdirectory according to this env_id: {SPYGLASS_TEMP_DIR}/{recompute schema name}/{env_id}.

Investigating failed attempts

Recompute tables are set up to have Name and Hash part tables that track...

  1. Name: The name of any file or object that was missing from the old file (i.e., the original) or the new file (i.e., the recompute attempt).
  2. Hash: The names of files or objects for which the hash did not match between the original and recompute attempt.
from spyglass.spikesorting.v0.spikesorting_recompute import RecordingRecompute

RecordingRecompute().Name()
RecordingRecompute().Hash()
(RecordingRecompute().Hash() & key).compare()  # OR Hash().compare(key)
from spyglass.spikesorting.v1.recompute import RecordingRecompute

RecordingRecompute().Name()
RecordingRecompute().Hash()

old_nwb_obj, new_nwb_obj = RecordingRecompute().Hash().get_objs(key)
hash_key = (RecordingRecompute().Hash() & restriction).fetch1("KEY")
RecordingRecompute().Hash().compare(key)

The compare methods will print out the differences between the two objects or files, recursively checking for differences in nested objects. This will truncate output for large objects and may fail for objects that cannot be read as JSON.

To expand the functionality of this feature, please either post a GitHub issue or make a pull request with edits to spyglas.utils.h5_helper_fn.H5pyComparator.

With this information, you can make changes to the environment to try another recompute attempt. For the best record keeping of attempts, we recommend cloning your conda environment and making changes to the clone. This will allow you to attempt recomputes with different dependencies without affecting the original environment. In the event of a successful recompute, you can update your base environment to match the clone.

conda create --name my_clone --clone my_default

Deleting files

Each recompute table has a delete_files method. Of files with a successful recompute attempt, this method will delete both the original and the recomputed files.

from spyglass.spikesorting.v0.spikesorting_recompute import RecordingRecompute

fails = RecordingRecompute() & "matching=0"
subset = fails & my_restriction
RecordingRecompute().delete_files(my_restriction, dry_run=False)
from spyglass.spikesorting.v1.recompute import RecordingRecompute

fails = v1_recompute.RecordingRecompute() & "matching=0"
subset = fails & my_restriction
RecordingRecompute().delete_files(my_restriction, dry_run=False)

These methods have ...

  1. A restriction argument that will delete files matching table entries that match the restriction.
  2. A dry_run argument that will print the files that would be deleted without actually deleting them.
  3. A confirmation prompt before deleting files.

Folders vs. NWB files

The implementation for tables that generate folders (e.g., v0.SpikeSortingRecording) differs from those that generate NWB files (e.g., v1.SpikeSortingRecording). NWB files record some information about their dependencies and the code that generated them. We can use this information to prevent recompute attempts that mismatch. See RecomputeVersions tables for more information.

NWBs also store data objects in a more structured way, which allows us to make decisions about the degree of precision required for recompute attempts.

Managing Environments

To manage environments, we've added a UserEnvironment table that stores all conda and pip dependencies for the environment used in a recompute attempt. To make a comparison between the current environment and a previous environment...

from spyglass.common import UserEnvironment

UserEnvironment().has_matching_env(
    env_id="previous_env_id",
    relevant_deps=["pynwb", "spikeinterface"],  # (1)!
    show_diffs=True,
)
  1. relevant_deps is an optional argument that will filter the comparison to only the dependencies listed. If the comparison fails, the method will print out the mismatching versions and/or missing dependencies.

To install an environment from a previous recompute attempt, first save a yaml file of the environment, previous_env_id.yaml, with ...

from spyglass.common import UserEnvironment

UserEnvironment().write_env_yaml(
    env_id="previous_env_id",
    dest_path="/your/path/",  # Optional. Otherwise, user current directory
)

Then, create a new environment with ...

conda env create -f /your/path/previous_env_id.yaml

Misc Notes

match and precision are MySQL keywords, so we use matched and rounding, respectively.