[{"content":"Actionability elevates a research object from a static collection of metadata to an operational unit. A research object is actionable when its procedures can be carried out by executing its contents \u0026ndash; not merely read about.\nActionability is cross-cutting: it applies to every other STAMPED dimension. Tracking is more actionable when recorded commands can be re-executed (e.g., datalad rerun), not just inspected. Modularity is more actionable when components can be composed via tooling (e.g., git submodule), not just organized into directories. Portability is more actionable when environments can be instantiated from a specification (e.g., singularity run), not just documented in a README.\nSee the STAMPED paper for the full treatment.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/stamped_principles/a/","section":"STAMPED Properties","summary":"","title":"A — Actionable"},{"content":"The Accessible principle requires that once data is found, users need to know how it can be accessed. This involves:\nMaking data retrievable by their identifier using a standardized communication protocol. Ensuring the protocol is open, free, and universally implementable. Supporting authentication and authorization where necessary. Making metadata accessible even when the data itself is no longer available. STAMPED supports accessibility through practices like using standard transfer protocols (HTTP, SSH), hosting data in established repositories, providing clear access documentation, and maintaining metadata independently of data availability. Tools like git-annex and DataLad enable flexible access to data across multiple storage backends while keeping a consistent interface.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/fair_principles/a/","section":"FAIR Principles","summary":"","title":"Accessible"},{"content":"Aspirations are the higher-level goals that motivate the adoption of STAMPED principles and structured data management. They represent the qualities that researchers strive for in their work:\nReproducibility \u0026ndash; The ability to reproduce results exactly. Re-running the same analysis on the same data yields identical outcomes. Rigor \u0026ndash; Methodological rigor and correctness. Ensuring that data management practices support sound scientific methodology. Transparency \u0026ndash; Openness and clarity of the research process. Making it clear what was done, how, and why. Efficiency \u0026ndash; Practical efficiency of data management workflows. Minimizing friction and overhead so researchers can focus on science. These aspirations are not mutually exclusive; they reinforce each other. Transparent processes tend to be more reproducible. Rigorous methods tend to be more efficient in the long run because they prevent costly errors. STAMPED principles provide concrete practices that advance all of these goals simultaneously.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/aspirations/","section":"Aspirations","summary":"","title":"Aspirations"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/container/","section":"Tags","summary":"","title":"Container"},{"content":"The misconception #A common objection to containers goes like this: \u0026ldquo;I tried Docker, but I have to rebuild the image every time I change my code. I can\u0026rsquo;t use my editor normally. It\u0026rsquo;s too slow for development.\u0026rdquo;\nThis treats containers as monolithic, sealed artifacts — you either bake everything in, or you don\u0026rsquo;t use them at all. The result is that developers either avoid containers entirely (losing reproducibility) or endure painfully slow rebuild-restart cycles (losing productivity).\nThe reality is simpler: containers are reusable environment providers. You can mount your local source code into a running container via a bind mount, and the container would supply only what is hard to set up locally — a specific Python version, system libraries, pre-compiled packages, GPU drivers. Your code stays on the host, editable with your normal tools, and changes are visible inside the container instantly.\nThe pattern # flowchart TB subgraph host[\"HOST FILESYSTEM\"] code[\"📁 Source code + pyproject.toml\"] venv[\"📁 .venv/\n(persistent on host)\"] end runtime[\"Container runtime\"] subgraph container[\"CONTAINER\"] env[\"Python + system libs +\npre-installed packages\"] work[\"📁 /work/ = bind-mounted 📁\"]:::dashed end runtime -.-\u003e|starts| container code ==\u003e|\"bind mount\n(same files, two views)\"| work env --\u003e|base packages| venv work --\u003e|\"pip install .\"| venv classDef dashed stroke-dasharray: 5 5 Both the container runtime and your source code live on the host. The runtime starts the container and sets up bind mounts — it is not inside the container itself. The bind mount creates an overlap: your code appears inside the container at /work/ folder, while physically remaining on your filesystem, editable with your normal tools.\nThe container provides the heavy, slow-to-build parts: a pinned Python version, compiled system libraries, and optionally pre-installed Python packages. The host provides the fast-changing parts: your source code (bind-mounted into the container) and a pyproject.toml (or lock file) that declares project-specific dependencies.\nThe venv bridges the two. It can operate in two modes:\nFresh venv (plain python -m venv or uv venv, without --system-site-packages) — a clean, isolated environment where only explicitly installed packages are available. Use this with minimal containers that provide Python but no pre-installed packages. Overlay venv (python -m venv --system-site-packages) — inherits all packages already installed in the container, and installs only the additional dependencies your project needs on top. Use this to take full advantage of a container that already ships heavy dependencies. Scenario 1: Stock uv container + pyproject.toml #The simplest case: you have a Python project with a pyproject.toml and want a reproducible environment without installing anything on your host beyond a container runtime. A stock uv container provides Python and uv — you supply the code and dependencies.\nDocker / Podman:\ndocker run --rm -v \u0026#34;$(pwd)\u0026#34;:/work -w /work \\ ghcr.io/astral-sh/uv:python3.12-trixie-slim \\ sh -c \u0026#39; uv venv .venv uv pip install . .venv/bin/python -m myproject \u0026#39; Singularity / Apptainer equivalent:\nsingularity exec --cleanenv \\ -B \u0026#34;$(pwd)\u0026#34;:/work --pwd /work \\ docker://ghcr.io/astral-sh/uv:python3.12-trixie-slim \\ sh -c \u0026#39; uv venv .venv uv pip install . .venv/bin/python -m myproject \u0026#39; Because the code is bind-mounted, edits on the host are immediately visible inside the container — no rebuild needed. The .venv/ directory is also on the host (created inside the bind mount), so subsequent runs can reuse it without reinstalling everything.\nThe uv pip install . command reads dependencies directly from pyproject.toml. For interactive development (e.g., with docker run -it), use uv pip install -e . (editable install) so that changes to your Python source files take effect immediately without reinstalling.\nUse pyproject.toml to specify the upper- or/and the lower-bound of your project\u0026rsquo;s dependencies, and uv will resolve the compatibilities during the build. However, if you need to pin exact versions for reproducibility, use a lock file (uv.lock, requirements.txt) instead — the exact versions of packages will be installed.\nA testable example #The following creates a minimal Python project and runs it in the stock uv container. If you don\u0026rsquo;t have docker installed yet, follow \u0026ldquo;Get Docker\u0026rdquo; or your OS instructions to get it running.\nscript#!/bin/sh set -eux PS4=\u0026#39;\u0026gt; \u0026#39; cd \u0026#34;$(mktemp -d \u0026#34;${TMPDIR:-/tmp}/venv-overlay-XXXXXXX\u0026#34;)\u0026#34; # -- create a minimal Python project -- mkdir -p greet # -- declare dependencies of the project -- cat \u0026gt; pyproject.toml \u0026lt;\u0026lt; \u0026#39;EOF\u0026#39; [build-system] requires = [\u0026#34;hatchling\u0026#34;] build-backend = \u0026#34;hatchling.build\u0026#34; [project] name = \u0026#34;greet\u0026#34; version = \u0026#34;0.1.0\u0026#34; requires-python = \u0026#34;\u0026gt;=3.10\u0026#34; dependencies = [\u0026#34;pyyaml\u0026#34;] EOF # -- write a module that prints items listed in the config file with a greeting -- cat \u0026gt; greet/__init__.py \u0026lt;\u0026lt; \u0026#39;PYEOF\u0026#39; import yaml import sys from pathlib import Path def main(): config = yaml.safe_load(Path(\u0026#34;config.yaml\u0026#34;).read_text()) print(config[\u0026#34;greeting\u0026#34;]) for item in config[\u0026#34;items\u0026#34;]: print(f\u0026#34; - {item}\u0026#34;) print(f\u0026#34;Python {sys.version_info.major}.{sys.version_info.minor}\u0026#34;) if __name__ == \u0026#34;__main__\u0026#34;: main() PYEOF # -- edit data on the host, but run in container via bind mount -- cat \u0026gt; config.yaml \u0026lt;\u0026lt; \u0026#39;EOF\u0026#39; greeting: Hello from the container items: - alpha - bravo - charlie EOF # -- run in the stock uv container -- docker run --rm -v \u0026#34;$(pwd)\u0026#34;:/work -w /work \\ ghcr.io/astral-sh/uv:python3.12-trixie-slim \\ sh -c \u0026#39; uv venv .venv uv pip install . .venv/bin/python -c \u0026#34;from greet import main; main()\u0026#34; \u0026#39; # -- edit again, no container rebuild needed -- FAST -- cat \u0026gt; config.yaml \u0026lt;\u0026lt; \u0026#39;EOF\u0026#39; greeting: Edited again on the host items: - alpha - bravo - charlie EOF docker run --rm -v \u0026#34;$(pwd)\u0026#34;:/work -w /work \\ ghcr.io/astral-sh/uv:python3.12-trixie-slim \\ sh -c \u0026#39;.venv/bin/python -c \u0026#34;from greet import main; main()\u0026#34;\u0026#39;The second docker run reuses the existing .venv/ (it persists on the host via the bind mount) and picks up the edited config.yaml without rebuilding anything.\nScenario 2: Reusing an \u0026ldquo;unrelated\u0026rdquo; container (venv overlay) #Sometimes the container you need already exists but was built for a different purpose — a JupyterHub image, a bioinformatics pipeline image, a machine learning training image. These containers typically bundle heavy dependencies (NumPy, SciPy, TensorFlow, CUDA libraries) that are time-consuming to install.\nThe key insight: you don\u0026rsquo;t need to build a custom image. You override the entrypoint and create a venv overlay that inherits the container\u0026rsquo;s packages:\nDocker / Podman:\ndocker run --rm --entrypoint /bin/sh \\ -v \u0026#34;$(pwd)\u0026#34;:/work -w /work \\ jupyter/scipy-notebook:latest \\ -c \u0026#39; python -m venv --system-site-packages .venv . .venv/bin/activate pip install . python -m my_analysis \u0026#39; Singularity / Apptainer equivalent (no entrypoint override needed — singularity exec ignores container entrypoints):\nsingularity exec --cleanenv \\ -B \u0026#34;$(pwd)\u0026#34;:/work --pwd /work \\ docker://jupyter/scipy-notebook:latest \\ sh -c \u0026#39; python -m venv --system-site-packages .venv . .venv/bin/activate pip install . python -m my_analysis \u0026#39; The --system-site-packages flag is what makes this work: the overlay venv can import everything already installed in the container (numpy, scipy, matplotlib, etc.) while pip install . adds only the packages your project needs on top. You get the container\u0026rsquo;s pre-built environment plus your project\u0026rsquo;s specific dependencies, without building a custom image.\nAs in Scenario 1, the .venv/ lives in the bind-mounted directory and persists on the host between runs. Subsequent invocations skip the install step entirely — just activate and run. The venv\u0026rsquo;s Python is a symlink to the container\u0026rsquo;s interpreter, so it only works inside the same (or compatible) container image.\nScenario 3: Ephemeral venv for CI and testing #During development, a persistent .venv/ on the host is convenient — fast restarts, no reinstalling. But for CI pipelines and testing, you want the opposite: a guaranteed clean state every run, with no leftover packages from previous iterations.\nThe solution: place the venv inside the container\u0026rsquo;s filesystem (e.g., /tmp/venv) instead of in the bind-mounted directory. Since container filesystems are ephemeral, the venv is destroyed when the container exits.\nflowchart TB subgraph host[\"HOST FILESYSTEM\"] code[\"📁 Source code + pyproject.toml\"] end runtime[\"Container runtime\"] subgraph container[\"CONTAINER\"] env[\"Python + system libs +\npre-installed packages\"] work[\"📁 /work/ = bind-mounted 📁\"]:::dashed venv-eph[\"📁 /tmp/venv/\n(ephemeral, destroyed on exit)\"] end runtime -.-\u003e|starts| container code ==\u003e|bind mount| work env --\u003e|\"--system-site-packages\"| venv-eph work --\u003e|\"pip install .\"| venv-eph classDef dashed stroke-dasharray: 5 5 Compare with the diagram above: the venv now lives inside the container, not in the project folder. Nothing persists on the host except your source code.\ndocker run --rm --entrypoint /bin/sh \\ -v \u0026#34;$(pwd)\u0026#34;:/work -w /work \\ jupyter/scipy-notebook:latest \\ -c \u0026#39; python -m venv --system-site-packages /tmp/venv . /tmp/venv/bin/activate pip install . python -m my_analysis \u0026#39; Every run starts from the container\u0026rsquo;s base packages and installs project dependencies fresh. This is slower than the persistent approach but guarantees that the environment matches what a new user (or CI runner) would see.\nThe generalizable recipe #The pattern has two independent dimensions:\nIsolated vs Overlay: whether the venv inherits the container\u0026rsquo;s pre-installed packages (--system-site-packages) or starts clean. Persistent vs Ephemeral: whether the venv lives in the bind-mounted directory (persists on host between runs) or inside the container\u0026rsquo;s filesystem (recreated every run). Isolated (fresh venv) Overlay (--system-site-packages) Persistent (.venv/ in bind mount) Scenario 1 Scenario 2 Ephemeral (/tmp/venv in container) (valid but less common) Scenario 3 Persistent + Isolated (Scenario 1 — stock uv container):\nuv venv .venv uv pip install . .venv/bin/python -m myproject Persistent + Overlay (Scenario 2 — reuse heavy container):\npython -m venv --system-site-packages .venv . .venv/bin/activate pip install . python -m myproject Ephemeral + Overlay (Scenario 3 — CI/testing):\npython -m venv --system-site-packages /tmp/venv . /tmp/venv/bin/activate pip install . python -m myproject For projects that are not a full Python package (e.g., a standalone script with a few dependencies), you likely would not have pyproject.toml. Then just create a simple requirements.txt with list of (versioned) dependencies and use pip install -r requirements.txt or uv pip install -r requirements.txt instead of pip install ..\nKey flags:\n--system-site-packages — makes the container\u0026rsquo;s installed packages importable in the overlay venv. pip respects them during dependency resolution (avoids duplicating what is already installed), but uv currently ignores system packages during resolution and may reinstall packages already present — functionally correct but wastes space. --entrypoint /bin/sh (Docker) — overrides the container\u0026rsquo;s default entrypoint so you can run arbitrary commands. Not needed with Singularity/Apptainer, which always uses exec semantics. -v $(pwd):/work -w /work (Docker) or -B $(pwd):/work --pwd /work (Singularity) — bind-mounts your local code into the container. Alternatively, use -v $(pwd):$(pwd) -w $(pwd) to keep the same path inside and outside the container — useful when tools record absolute paths for provenance, or when you want to avoid confusion about where files actually live. Examples in the wild #This pattern is not theoretical — it is deployed in production across multiple projects.\nDANDI JupyterHub — The DANDI Archive JupyterHub uses a lightweight venv overlay on top of its conda base environment. Users get a fully configured scientific Python stack from the container and can install additional packages in their personal overlay without affecting the base image or other users. The overlay is cheap to create, fast to customize, and disposable.\nNeuroDesk — NeuroDesk provides neuroimaging software (FreeSurfer, FSL, ANTs, and dozens more) through transparent Singularity/Apptainer containers. Rather than building one container per tool, NeuroDesk packages related tools into shared containers and makes them accessible via a desktop environment (NeuroDesktop) or command-line modules (NeuroCommand). Users bind-mount their data into whichever container provides the tool they need. The same container image serves many researchers across institutions — reuse (FAIR R) at scale.\nSTAMPED analysis # Property How the pattern embodies it Self-contained pyproject.toml declares all dependencies; combined with a pinned container tag (or digest), the full environment is specified Actionable A single docker run or singularity exec command reproduces the environment — no manual setup steps Portable The container pins the Python version and system libraries; pyproject.toml (or a lock file) pins package versions; the pattern works on any host with a container runtime Ephemeral Each container invocation starts from a clean base; the overlay venv can be ephemeral (Scenario 3 — recreated every run for guaranteed reproducibility) or persistent (Scenarios 1–2 — kept across runs for faster iteration) — the choice is yours ","date":"2 March 2026","permalink":"https://stamped-principles.github.io/stamped-examples/examples/container-venv-overlay-development/","section":"Examples","summary":"Demonstrates how to mount local code into stock or third-party containers and create lightweight venv overlays, bridging the container\u0026rsquo;s fixed environment with project-specific dependencies.","title":"Container venv Overlay for Python Development"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/development/","section":"Tags","summary":"","title":"Development"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/docker/","section":"Tags","summary":"","title":"Docker"},{"content":"If a research object can produce its results in a temporary, disposable environment built solely from its own contents, this provides strong evidence that its other STAMPED properties hold in practice. Inputs must be exhaustively specified (S), outputs deposited correctly (T), and nothing outside the boundary relied upon. Ephemerality is a form of validation: \u0026ldquo;make it a habit to destroy the environment.\u0026rdquo;\nBeyond validation, ephemeral environments enable scaling \u0026ndash; when each computational job runs in an independent, disposable instance, work can be parallelized across subjects, parameters, or datasets.\nAt a minimum, a research object should be able to produce results from a fresh clone on a system that meets its stated requirements (P). At the ideal end, every computation runs in a disposable environment that is created and destroyed per execution.\nSee the STAMPED paper for the full treatment.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/stamped_principles/e/","section":"STAMPED Properties","summary":"","title":"E — Ephemeral"},{"content":"Efficiency in data management means minimizing the friction and overhead of organizing, finding, processing, and sharing data so that researchers can spend their time on science rather than on data wrangling.\nEfficient data management practices include:\nAutomation \u0026ndash; Scripted pipelines that eliminate repetitive manual steps. Clear organization \u0026ndash; Consistent structures that make it easy to find what you need without searching. Reusable components \u0026ndash; Modular datasets and code that can be composed into new analyses without starting from scratch. Streamlined collaboration \u0026ndash; Standard formats and tools that reduce the cost of sharing data between people and systems. STAMPED principles contribute to efficiency by establishing conventions up front that prevent costly disorganization later. The initial investment in structure, tooling, and automation pays dividends as projects grow, teams change, and data is reused across studies. What looks like overhead at the start becomes a significant time saver over the lifetime of a research project.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/aspirations/efficiency/","section":"Aspirations","summary":"","title":"Efficiency"},{"content":"This section collects concrete, runnable examples that illustrate the STAMPED principles in action. Each example focuses on a specific practice or workflow and shows how it contributes to sound dataset version control.\nTaxonomy tags #Every example is tagged along four dimensions:\nSTAMPED principles \u0026ndash; which of the seven principles (Self-containment, Tracking, Actionability, Modularity, Portability, Ephemerality, Distributability) the example demonstrates. FAIR mapping \u0026ndash; which FAIR goals (Findable, Accessible, Interoperable, Reusable) the practice helps achieve. Instrumentation level \u0026ndash; how much tooling the example requires, from plain conventions to infrastructure-dependent workflows. Aspirational goals \u0026ndash; higher-level objectives the practice serves, such as reproducibility, rigor, or transparency. Most examples carry tags in several dimensions, because good practices tend to serve multiple goals at once.\nBrowsing #You can explore the examples in several ways:\nScroll below to see the full list on this page. Use the navigation to filter by any single taxonomy \u0026ndash; for example, view all examples related to the Tracking principle or all examples at a particular instrumentation level. Use the search bar to find examples by keyword. Difficulty range #Examples are arranged to cover a spectrum of complexity:\nBeginner \u0026ndash; simple conventions and directory layouts that require no special tools. Intermediate \u0026ndash; practices that use lightweight scripts, checksums, or standard Git features. Advanced \u0026ndash; workflows involving specialized tools (DataLad, DVC, Git-annex) or multi-step pipelines demonstrating several patterns together. Pick the level that matches your current setup and expand from there.\nExample states #Each example has a state field in its front matter indicating its editorial status:\nState Meaning uncurated-ai AI-generated draft that has not yet been reviewed by a human. Details may be inaccurate or incomplete. wip Work in progress — under active development, content may be incomplete or change significantly. final Reviewed, curated, and ready for use. No banner is shown. Examples without a state field (or with state: final) are considered ready.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/examples/","section":"Examples","summary":"","title":"Examples"},{"content":"The FAIR Principles provide a set of guiding goals for scientific data management:\nFindable \u0026ndash; Data and metadata should be easy to find, with rich metadata and unique persistent identifiers. Accessible \u0026ndash; Data should be retrievable via standardized protocols, with clear access conditions. Interoperable \u0026ndash; Data should use formal, shared vocabularies and reference other data. Reusable \u0026ndash; Data should be richly described with clear provenance and licensing for reuse. FAIR and STAMPED are complementary. FAIR describes what good data management looks like from the consumer\u0026rsquo;s perspective, while STAMPED focuses on the engineering practices that help achieve those goals from the producer\u0026rsquo;s perspective. A project that follows STAMPED principles will naturally tend toward FAIR compliance, because practices like version control, structured metadata, provenance tracking, and standardized organization directly support findability, accessibility, interoperability, and reusability.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/fair_principles/","section":"FAIR Principles","summary":"","title":"FAIR Principles"},{"content":"Instrumentation levels describe the degree of tooling and automation applied to data management practices. They form a spectrum from simple conventions that require no special software to sophisticated automated workflows:\nData Organization \u0026ndash; Directory layouts, naming conventions, file organization patterns. The foundation that requires no special tools. Tool \u0026ndash; Specific software tools that implement principles (git, git-annex, DataLad, etc.). Single-purpose utilities that address particular needs. Workflow \u0026ndash; Multi-step pipelines combining tools. Orchestrated sequences for data processing, analysis, and publication. Pattern \u0026ndash; Architectural design patterns applied to data management. Higher-level organizational strategies that guide how tools and workflows are composed. Not every project needs full automation. The appropriate level of instrumentation depends on the project\u0026rsquo;s scale, complexity, and collaboration requirements. Even the simplest level \u0026ndash; thoughtful data organization \u0026ndash; delivers significant benefits. Each subsequent level builds on the ones below it, adding capabilities while also adding complexity.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/instrumentation_levels/","section":"Instrumentation Levels","summary":"","title":"Instrumentation Levels"},{"content":"A research object that is self-contained, tracked, and modular may still fail to reproduce if it depends on undocumented host environment state \u0026ndash; hardcoded paths, implicitly available tools, or specific OS configurations. Portability requires that procedures can be executed on different hosts, given documented system requirements.\nComputational environments must be explicitly defined (not implicitly assumed), machine-reproducible, and version controlled alongside code and data. Whether via containers (Docker, Singularity/Apptainer) or declarative package managers (Nix, Guix), what matters is that environments are specified, versioned, and available within the project.\nSee the STAMPED paper for the full treatment.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/stamped_principles/p/","section":"STAMPED Properties","summary":"","title":"P — Portable"},{"content":"The Pattern instrumentation level represents architectural design patterns applied to data management. Just as software engineering has patterns like MVC (Model-View-Controller) that guide system design, data management has higher-level organizational strategies that guide how datasets, tools, and workflows are composed.\nExamples of data management patterns:\nYODA (YODAs Organizer of Data Assets) \u0026ndash; A pattern for structuring nested datasets with clear separation of inputs, outputs, and code, enabling modular and portable analyses. BIDS (Brain Imaging Data Structure) \u0026ndash; A community standard for organizing neuroimaging data with prescribed directory structures, naming conventions, and metadata files. Linked dataset graphs \u0026ndash; Connecting datasets as a directed acyclic graph of dependencies, so that provenance flows naturally through the structure. Patterns operate at a higher level of abstraction than individual tools or workflows. They provide a mental model and a set of conventions that make it easier to reason about, communicate about, and maintain complex data management setups over time.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/instrumentation_levels/pattern/","section":"Instrumentation Levels","summary":"","title":"Pattern"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/python/","section":"Tags","summary":"","title":"Python"},{"content":"Reproducibility is the cornerstone aspiration of scientific data management. A result is reproducible when re-running the same analysis on the same data yields identical outcomes. This requires:\nFixed inputs \u0026ndash; The exact data used in an analysis must be identified and retrievable. Captured computation \u0026ndash; The code, software environment, parameters, and execution order must be recorded. Deterministic execution \u0026ndash; Given the same inputs and computation, the outputs must be the same. STAMPED principles support reproducibility through version control (pinning exact states of data and code), provenance tracking (recording what was run and how), containerization (freezing computational environments), and modular dataset organization (making it possible to re-assemble all components of an analysis). When every element of an analysis is tracked and versioned, reproduction becomes a matter of checking out the right state and re-executing.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/aspirations/reproducibility/","section":"Aspirations","summary":"","title":"Reproducibility"},{"content":"The Reusable principle aims to ensure that data and metadata are well-described so they can be used and combined in future research. This requires:\nDescribing data with a plurality of accurate and relevant attributes. Releasing data with a clear and accessible data usage license. Associating data with detailed provenance information. Meeting domain-relevant community standards for data and metadata. STAMPED practices directly support reusability through comprehensive provenance tracking, machine-readable metadata, explicit licensing, and modular data organization. When every processing step is recorded and the full history of a dataset is available, other researchers can confidently reuse and build upon the work.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/fair_principles/r/","section":"FAIR Principles","summary":"","title":"Reusable"},{"content":"A research object must never rely on implicit external state \u0026ndash; the \u0026ldquo;don\u0026rsquo;t look up\u0026rdquo; rule. All modules and components essential to replicate computational execution must be contained within a single top-level boundary.\nComponents may be included literally (files committed directly) or by reference (subdatasets, registered data URLs), provided the references are explicit and tracked. Self-containment is the foundational property upon which the remaining STAMPED properties build.\nSee the STAMPED paper for the full treatment.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/stamped_principles/s/","section":"STAMPED Properties","summary":"","title":"S — Self-contained"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/singularity/","section":"Tags","summary":"","title":"Singularity"},{"content":"This site is a companion resource to the STAMPED paper on properties of a reproducible research object. It provides concrete, pragmatic examples that demonstrate the seven STAMPED properties in practice \u0026ndash; from simple naming conventions to complex multi-tool workflows.\nWhat is STAMPED? #STAMPED defines seven properties that characterize a well-formed reproducible research object \u0026ndash; a collection of data, code, and metadata that together represent a complete unit of research output. See the paper for the full treatment; the table below is a quick reference:\nProperty Core idea S \u0026ndash; Self-contained Everything needed to replicate results is within a single top-level boundary \u0026ndash; the \u0026ldquo;don\u0026rsquo;t look up\u0026rdquo; rule. T \u0026ndash; Tracked All components are content-addressed and version-controlled; provenance of every modification is recorded. A \u0026ndash; Actionable Procedures are executable specifications, not just documentation \u0026ndash; a cross-cutting property that applies to every other STAMPED dimension. M \u0026ndash; Modular Components are organized as independently versioned modules that can be composed, updated, and reused separately. P \u0026ndash; Portable Procedures do not depend on undocumented host state; computational environments are explicitly specified and versioned. E \u0026ndash; Ephemeral Results can be produced in temporary, disposable environments built solely from the research object\u0026rsquo;s contents \u0026ndash; validating that other properties hold. D \u0026ndash; Distributable The research object and all its components are persistently retrievable by others, packaged like a software distribution. These properties reinforce one another. Self-containment makes portability practical, tracking enables actionability, and modularity supports distributability.\nHow examples are organized #Each example on this site is tagged along multiple dimensions so you can explore the collection from whatever angle is most useful to you:\nSTAMPED principles \u0026ndash; which of the seven principles does the example primarily demonstrate? FAIR mapping \u0026ndash; which of the FAIR goals (Findable, Accessible, Interoperable, Reusable) does the practice help achieve? Instrumentation level \u0026ndash; how much tooling does the example require, from plain conventions that need no special software to workflows that depend on specific version-control infrastructure? Aspirational goals \u0026ndash; what higher-level objectives (reproducibility, transparency, rigor, efficiency) does the practice serve? Get started #Head to the Examples section to browse the full collection. You can also explore by taxonomy using the footer links, or use the search bar to find examples relevant to your needs.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/","section":"STAMPED Principles — Examples","summary":"","title":"STAMPED Principles — Examples"},{"content":"STAMPED defines seven properties that characterize a well-formed reproducible research object. The framework originates from the YODA principles and is described in full in the STAMPED paper.\nS \u0026ndash; Self-contained: All modules and components essential to replicate results are within a single top-level boundary.\nT \u0026ndash; Tracked: All components are content-addressed; provenance of every modification is recorded.\nA \u0026ndash; Actionable: Procedures are executable specifications, not just documentation. A cross-cutting property that applies to every other dimension.\nM \u0026ndash; Modular: Components are organized as independently versioned modules that can be composed and reused.\nP \u0026ndash; Portable: Procedures do not depend on undocumented host state; environments are explicitly specified and versioned.\nE \u0026ndash; Ephemeral: Results can be produced in temporary, disposable environments built solely from the research object\u0026rsquo;s contents.\nD \u0026ndash; Distributable: The research object and all its components are persistently retrievable, packaged like a software distribution.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/stamped_principles/","section":"STAMPED Properties","summary":"","title":"STAMPED Properties"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/","section":"Tags","summary":"","title":"Tags"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/uv/","section":"Tags","summary":"","title":"Uv"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/venv/","section":"Tags","summary":"","title":"Venv"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/awk/","section":"Tags","summary":"","title":"Awk"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/containers/","section":"Tags","summary":"","title":"Containers"},{"content":"Self-containment (S) establishes that everything needed is within the research object\u0026rsquo;s boundary. Distributability promises that those references actually deliver \u0026ndash; that the research object and its components can be shared, retrieved, and used by others in a state consistent with reuse.\nThe distinction mirrors the concept of a software distribution: a curated, versioned bundle in which all components are resolved to specific versions and packaged for consumption. Simply sharing scripts with loose dependencies does not constitute distribution in this sense.\nThe spectrum ranges from publicly accessible components with retrieval instructions, through persistent hosting on archival infrastructure (Zenodo, PyPI, conda-forge, DANDI) with frozen versions and content-addressed identifiers, to a fully self-contained archive (e.g., a built container or a zipped RO-Crate).\nSee the STAMPED paper for the full treatment.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/stamped_principles/d/","section":"STAMPED Properties","summary":"","title":"D — Distributable"},{"content":"The task #Sum prices from a tiny CSV — a grocery receipt:\nitem,price apples,1.50 bread,2.30 milk,3.20 Processing: awk -F, 'NR\u0026gt;1 {sum+=$2} END {printf \u0026quot;%.2f\\n\u0026quot;, sum}' prices.csv\nResult: 7.00\nThis is trivially understandable, yet enough to demonstrate every STAMPED property across four progressive scenarios. Each scenario script follows the ephemeral shell reproducer skeleton for portability.\nScenario 1: Self-contained script (S, E, P) #The simplest case: a single script that creates the data inline, sums the prices, and prints the total. Requires only POSIX sh and awk.\nscript#!/bin/sh # Grocery receipt: sum prices from a CSV set -eux PS4=\u0026#39;\u0026gt; \u0026#39; cd \u0026#34;$(mktemp -d \u0026#34;${TMPDIR:-/tmp}/grocery-XXXXXXX\u0026#34;)\u0026#34; # --- generate data --- cat \u0026gt; prices.csv \u0026lt;\u0026lt;\u0026#39;EOF\u0026#39; item,price apples,1.50 bread,2.30 milk,3.20 EOF # --- process: sum the prices --- export LC_ALL=C awk -F, \u0026#39;NR\u0026gt;1 {sum+=$2} END {printf \u0026#34;%.2f\\n\u0026#34;, sum}\u0026#39; prices.csv \u0026gt; total.txt echo \u0026#34;=== Total ===\u0026#34; cat total.txtThis script is self-contained (the data is generated inline) and ephemeral (runs in a fresh temp directory).\nWhy LC_ALL=C? The decimal point is locale-dependent. On a system with LC_ALL=de_DE.UTF-8, awk might interpret 1.50 as 1 (treating . as a thousands separator) or produce output with commas instead of periods. Setting LC_ALL=C forces POSIX numeric conventions — consistent behavior regardless of the host locale makes script more Portable (assuming availability of awk).\nScenario 2: Makefile as actionable specification (+ T, A) #The same analysis, but now organized as a git repository with a Makefile that declares the dependency graph. This adds tracking (git records every change) and actionability (make re-derives results from source).\nscript#!/bin/sh # Grocery receipt as a tracked, actionable git repository set -eux PS4=\u0026#39;\u0026gt; \u0026#39; cd \u0026#34;$(mktemp -d \u0026#34;${TMPDIR:-/tmp}/grocery-XXXXXXX\u0026#34;)\u0026#34; git init grocery-analysis cd grocery-analysis git config user.email \u0026#34;demo@example.com\u0026#34; git config user.name \u0026#34;Demo User\u0026#34; # --- data --- cat \u0026gt; prices.csv \u0026lt;\u0026lt;\u0026#39;EOF\u0026#39; item,price apples,1.50 bread,2.30 milk,3.20 EOF # --- analysis script --- cat \u0026gt; sum-prices.sh \u0026lt;\u0026lt;\u0026#39;SCRIPT\u0026#39; #!/bin/sh set -eu export LC_ALL=C awk -F, \u0026#39;NR\u0026gt;1 {sum+=$2} END {printf \u0026#34;%.2f\\n\u0026#34;, sum}\u0026#39; prices.csv \u0026gt; total.txt SCRIPT chmod +x sum-prices.sh # --- Makefile: the actionable specification --- cat \u0026gt; Makefile \u0026lt;\u0026lt;\u0026#39;MF\u0026#39; .POSIX: all: total.txt total.txt: prices.csv sum-prices.sh ./sum-prices.sh clean: rm -f total.txt .PHONY: all clean MF # --- README --- cat \u0026gt; README.md \u0026lt;\u0026lt;\u0026#39;README\u0026#39; # Grocery Receipt Analysis Run `make` to produce `total.txt` from `prices.csv`. Requires: POSIX sh, awk, make. README # --- .gitignore: outputs are derived, not tracked --- cat \u0026gt; .gitignore \u0026lt;\u0026lt;\u0026#39;GI\u0026#39; total.txt GI git add -A git commit -m \u0026#34;Initial commit: grocery receipt analysis\u0026#34; # --- run it --- make echo \u0026#34;=== Total ===\u0026#34; cat total.txt echo \u0026#34;\u0026#34; echo \u0026#34;=== Provenance: the Makefile + git log ===\u0026#34; git log --onelineBrowse grocery-analysisThe Makefile is the actionable specification: it declares that total.txt depends on prices.csv and sum-prices.sh, and make will only re-run the analysis when an input changes. Git tracks the full history.\nThis is a substantial improvement over a loose script: git clone + make is all anyone needs to reproduce the result. But make records what to run, not what environment to run it in — the host\u0026rsquo;s awk version is still implicit.\nScenario 3: Containerized execution with Alpine (+ P) #To pin the computational environment, we run the analysis inside a minimal Alpine Linux container (~3 MB as a .sif image). Alpine includes BusyBox awk — exactly what our script needs, nothing more.\nThe examples below use Singularity to pull and execute the container. The same approach works with Apptainer (the open-source fork — just replace singularity with apptainer), or with Docker/Podman if you prefer an OCI-native workflow (docker run --rm -v \u0026quot;$PWD:$PWD\u0026quot; -w \u0026quot;$PWD\u0026quot; alpine:3.21 ./sum-prices.sh).\nscript#!/bin/sh # Grocery receipt with containerized execution via Alpine set -eux PS4=\u0026#39;\u0026gt; \u0026#39; cd \u0026#34;$(mktemp -d \u0026#34;${TMPDIR:-/tmp}/grocery-XXXXXXX\u0026#34;)\u0026#34; # --- pull a minimal container image --- singularity pull docker://alpine:3.21 git init grocery-analysis cd grocery-analysis git config user.email \u0026#34;demo@example.com\u0026#34; git config user.name \u0026#34;Demo User\u0026#34; # --- same data and script as Scenario 2 --- cat \u0026gt; prices.csv \u0026lt;\u0026lt;\u0026#39;EOF\u0026#39; item,price apples,1.50 bread,2.30 milk,3.20 EOF cat \u0026gt; sum-prices.sh \u0026lt;\u0026lt;\u0026#39;SCRIPT\u0026#39; #!/bin/sh set -eu export LC_ALL=C awk -F, \u0026#39;NR\u0026gt;1 {sum+=$2} END {printf \u0026#34;%.2f\\n\u0026#34;, sum}\u0026#39; prices.csv \u0026gt; total.txt SCRIPT chmod +x sum-prices.sh # --- Makefile: run inside the container --- cat \u0026gt; Makefile \u0026lt;\u0026lt;\u0026#39;MF\u0026#39; .POSIX: SIF = ../alpine_3.21.sif all: total.txt total.txt: prices.csv sum-prices.sh $(SIF) singularity exec --cleanenv $(SIF) ./sum-prices.sh clean: rm -f total.txt .PHONY: all clean MF cat \u0026gt; .gitignore \u0026lt;\u0026lt;\u0026#39;GI\u0026#39; total.txt GI cat \u0026gt; README.md \u0026lt;\u0026lt;\u0026#39;README\u0026#39; # Grocery Receipt Analysis (containerized) Run `make` to produce `total.txt` from `prices.csv`. The analysis runs inside an Alpine Linux container to guarantee identical results regardless of the host system\u0026#39;s awk version. Requires: POSIX sh, make, singularity (or apptainer). The container image (`alpine_3.21.sif`) must be present in the parent directory — see Makefile for details. README git add -A git commit -m \u0026#34;Initial commit: containerized grocery receipt analysis\u0026#34; # --- run it --- make echo \u0026#34;=== Total ===\u0026#34; cat total.txtBrowse grocery-analysisNow every collaborator gets the same BusyBox awk regardless of whether their host has gawk, mawk, or something else. This demonstrates portability: the script no longer depends on whatever happens to be installed on the host.\nBut the container reference docker://alpine:3.21 is not pinned — the 3.21 tag is mutable (Alpine publishes point releases under the same tag). And the script depends on Docker Hub being available: if the network is down or the registry is unavailable, the pull fails.\nS is weakened — the container lives on Docker Hub, not in our repository. T is weak — we know \u0026ldquo;Alpine 3.21\u0026rdquo; but not which exact build. Scenario 3b: Pinning the container by digest (recovering T) #A simple fix for the tracking problem: reference the image by its content-addressed digest rather than a mutable tag.\nThe only line that changes from Scenario 3:\nsnippet# Before (mutable tag — could change between builds): singularity pull docker://alpine:3.21 # After (pinned digest — immutable): singularity pull docker://alpine@sha256:a8560b36e8b8210634f77d9f7f9efd7ffa463e380b75e2e74aff4511df3ef88cWith a digest, two people running the script a year apart will pull byte-identical image content — the registry is physically unable to serve different bits for the same sha256. This recovers tracking: the provenance now records exactly which environment was used, down to every library version.\nBut self-containment is still missing. The image lives on Docker Hub, not inside our project. If the registry imposes pull rate limits, or the network is simply unavailable (an air-gapped HPC cluster), the script cannot obtain its dependency. The digest is a precise reference, not a local copy.\nThis is the gap that Scenario 4 closes.\nScenario 4: Container committed to git (recovering S, + M, D) #The Alpine .sif image is only ~3 MB — small enough to commit directly to the git repository. Now the container travels with the code and data. No network access needed to reproduce.\nscript#!/bin/sh # Grocery receipt: fully self-contained with container in git set -eux PS4=\u0026#39;\u0026gt; \u0026#39; cd \u0026#34;$(mktemp -d \u0026#34;${TMPDIR:-/tmp}/grocery-XXXXXXX\u0026#34;)\u0026#34; # --- build the container image from a pinned digest --- singularity pull env.sif docker://alpine@sha256:a8560b36e8b8210634f77d9f7f9efd7ffa463e380b75e2e74aff4511df3ef88c git init grocery-analysis cd grocery-analysis git config user.email \u0026#34;demo@example.com\u0026#34; git config user.name \u0026#34;Demo User\u0026#34; # --- commit the container image into the repository --- cp ../env.sif . git add env.sif git commit -m \u0026#34;Add Alpine container image (3 MB, pinned by digest)\u0026#34; # --- raw data as a git submodule (modularity) --- ( cd .. git init --bare raw-data.git git clone raw-data.git raw-data-work cd raw-data-work git config user.email \u0026#34;demo@example.com\u0026#34; git config user.name \u0026#34;Demo User\u0026#34; cat \u0026gt; prices.csv \u0026lt;\u0026lt;\u0026#39;EOF\u0026#39; item,price apples,1.50 bread,2.30 milk,3.20 EOF git add prices.csv git commit -m \u0026#34;Add grocery prices\u0026#34; git push ) # In a real project, use a proper URL (https://... or git@...:...). # For this local demo, we must allow the file:// transport # (restricted by default since Git 2.38.1, CVE-2022-39253). git -c protocol.file.allow=always submodule add ../raw-data.git raw-data # --- analysis script --- cat \u0026gt; sum-prices.sh \u0026lt;\u0026lt;\u0026#39;SCRIPT\u0026#39; #!/bin/sh set -eu export LC_ALL=C awk -F, \u0026#39;NR\u0026gt;1 {sum+=$2} END {printf \u0026#34;%.2f\\n\u0026#34;, sum}\u0026#39; raw-data/prices.csv \u0026gt; total.txt SCRIPT chmod +x sum-prices.sh # --- Makefile: run inside the local container --- cat \u0026gt; Makefile \u0026lt;\u0026lt;\u0026#39;MF\u0026#39; .POSIX: SIF = env.sif all: total.txt total.txt: raw-data/prices.csv sum-prices.sh $(SIF) singularity exec --cleanenv $(SIF) ./sum-prices.sh clean: rm -f total.txt .PHONY: all clean MF cat \u0026gt; .gitignore \u0026lt;\u0026lt;\u0026#39;GI\u0026#39; total.txt GI cat \u0026gt; README.md \u0026lt;\u0026lt;\u0026#39;README\u0026#39; # Grocery Receipt Analysis Run `make` to produce `total.txt` from raw price data. The analysis runs inside an Alpine Linux container (`env.sif`) that is committed to this repository — no network access needed. Raw data lives in the `raw-data/` git submodule. git clone --recurse-submodules \u0026lt;url\u0026gt; make Requires: POSIX sh, make, singularity (or apptainer). README git add -A git commit -m \u0026#34;Add analysis script, Makefile, and README\u0026#34; # --- run it --- make echo \u0026#34;=== Total ===\u0026#34; cat total.txt echo \u0026#34;\u0026#34; echo \u0026#34;=== Repository structure ===\u0026#34; git submodule status git log --onelineBrowse grocery-analysisBrowse raw-data-workThis recovers the full STAMPED stack using only git, make, and singularity — no specialized research data management tools required:\nProperty How it is realized S — Self-contained Container image (env.sif), analysis script, and Makefile are all committed to git. Raw data is pinned via a git submodule at a specific commit. git clone --recurse-submodules + make is all anyone needs. T — Tracked Git records every change to code, data (in the submodule), and even the container image. The Makefile declares the exact dependency graph. A — Actionable make re-derives results from source. The README.md tells a collaborator exactly what to run. M — Modular Raw data is a separate git repository included as a submodule — reusable in other projects, versioned independently. P — Portable The container pins the awk implementation; POSIX shell + LC_ALL=C pins the script behavior. E — Ephemeral The entire analysis runs in a temp directory built from scratch. D — Distributable Standard git push to any remote. The repository can be pushed to multiple hosts (GitHub, GitLab, institutional server) simultaneously. For archival, git bundle creates a single-file snapshot of the entire history. The progression across all four scenarios illustrates a general pattern: each STAMPED property you add removes a class of failure, but introducing an external dependency (the container) can remove properties you already had (self-containment) unless you provision for it explicitly.\nBeyond git: scaling with git-annex and DataLad #For projects where the data or container images outgrow what is practical to commit to git directly, tools like git-annex or DataLad extend this pattern with content-addressed storage and multi-remote availability tracking — the same dataset can be distributed to GitHub, Figshare (with a DOI), S3, or institutional archives, and the availability information (which remotes hold which files) travels with the dataset so that a fresh clone can assemble itself from whichever sources are reachable.\nIn particular, datalad-container simplifies container management within DataLad datasets: it maintains a local catalog of container images (tracked by git-annex), and its datalad containers-run command records which container was used for each computation — adding container identity to the provenance chain automatically.\nFor neuroimaging and other scientific domains, ReproNim/containers provides a ready-made DataLad dataset of popular containerized tools (FreeSurfer, fMRIPrep, BIDS Apps, etc.). It is itself a STAMPED research object: a modular collection that can be included as a git submodule or DataLad sub-dataset, providing portable access to pinned container versions without each project having to manage its own images.\n","date":"20 February 2026","permalink":"https://stamped-principles.github.io/stamped-examples/examples/stamped-awk-evolution/","section":"Examples","summary":"Four scenarios showing how to incrementally add STAMPED properties to a shell-based analysis using git, make, and singularity.","title":"From Script to STAMPED Research Object"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/git/","section":"Tags","summary":"","title":"Git"},{"content":"Rather than managing a research object as one indivisible whole, STAMPED promotes a compositional approach: independently versioned modules (input datasets, processing scripts, computational environments) can be updated or replaced separately, minimizing disruption and maximizing reusability.\nAn idiomatic layout delineates components into structured directories \u0026ndash; code/, inputs/, envs/, docs/, results/ \u0026ndash; clarifying how they interact and supporting domain-specific standards (e.g., BIDS). Components may be included directly or linked as subdatasets (git submodules), each with its own independent version history.\nSee the STAMPED paper for the full treatment.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/stamped_principles/m/","section":"STAMPED Properties","summary":"","title":"M — Modular"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/make/","section":"Tags","summary":"","title":"Make"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/posix/","section":"Tags","summary":"","title":"Posix"},{"content":"Rigor in data management means ensuring that the practices surrounding data collection, processing, and analysis uphold the standards of sound scientific methodology. Rigorous data management helps prevent:\nAccidental data corruption or loss. Silent errors in processing pipelines. Confusion between different versions or stages of data. Unintentional mixing of training and test data, or other methodological mistakes. STAMPED principles promote rigor by enforcing structure and discipline: version control prevents silent overwrites, provenance tracking creates an auditable trail, automated pipelines reduce human error, and clear separation of raw and processed data guards against contamination. When the data management infrastructure itself enforces good practices, researchers are less likely to make mistakes that compromise their results.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/aspirations/rigor/","section":"Aspirations","summary":"","title":"Rigor"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/shell/","section":"Tags","summary":"","title":"Shell"},{"content":"Version information must be recorded for all components, ideally using the same content-addressed version control system. The primary value is not version numbering (\u0026ldquo;v1\u0026rdquo; vs \u0026ldquo;v2\u0026rdquo;) but content-addressed identification \u0026ndash; two datasets with identical content hashes are provably identical.\nTracking encompasses not only version history but also provenance: what actions produced or modified each component, what inputs were consumed, and what versions of code and environment were involved. For code-driven modifications, provenance should be captured programmatically rather than by manual annotation.\nSee the STAMPED paper for the full treatment.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/stamped_principles/t/","section":"STAMPED Properties","summary":"","title":"T — Tracked"},{"content":"Transparency means that the research process is open and comprehensible to others (and to your future self). A transparent project makes it clear:\nWhat data was collected or used. How it was processed, transformed, and analyzed. Why particular decisions were made along the way. Where the data and code can be found. STAMPED principles support transparency through detailed provenance records, human-readable metadata, public version control histories, and well-documented data organization. When every step of a research project is recorded and accessible, collaborators and reviewers can follow the entire chain from raw data to published results. Transparency builds trust and enables meaningful peer review of not just the conclusions, but the process that produced them.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/aspirations/transparency/","section":"Aspirations","summary":"","title":"Transparency"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/bug-report/","section":"Tags","summary":"","title":"Bug-Report"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/data-exploration/","section":"Tags","summary":"","title":"Data-Exploration"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/datalad/","section":"Tags","summary":"","title":"Datalad"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/dataset-organization/","section":"Tags","summary":"","title":"Dataset-Organization"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/datasette/","section":"Tags","summary":"","title":"Datasette"},{"content":"The problem: datasets as inert files #Most research datasets are distributed as flat files \u0026ndash; CSVs, TSVs, JSON dumps, or binary formats. To explore them, a user must:\nDownload the files. Write a script (often in Python or R) to load and query the data. Figure out the schema by reading a README or inspecting column headers. Iterate through increasingly complex queries as they form hypotheses. Each of these steps is a barrier. The dataset is inert: it does not help the user understand its contents. The first meaningful interaction with the data requires writing code.\nThis is at odds with the Actionability principle, which says that dataset operations should be executable, not just documented. If exploring a dataset requires writing bespoke scripts before even seeing the data, the dataset is not actionable \u0026ndash; it is passive.\nThe solution: datasette makes data instantly explorable #Datasette is an open-source tool that takes a SQLite database file and serves it as an interactive web application. With a single command, it provides:\nA browseable web UI for every table, with sorting, filtering, and full-text search. A SQL query interface where users can write and share arbitrary queries. A JSON API for programmatic access to any query result. Export to CSV, JSON, and other formats directly from the browser. The setup cost is minimal: one file (the SQLite database) and one command (datasette serve). There is no database server to configure, no ORM to write, no web application framework to learn.\nThe MVC analogy #Datasette naturally follows a Model-View-Controller (MVC) decomposition, which helps explain why it works so well as a data exploration tool:\nModel: the SQLite database #The SQLite file is the model. It contains the data, the schema (table definitions, column types, indexes), and optionally metadata (views, triggers, comments).\nresearch.db |-- participants (id, age, group, consent_date) |-- measurements (participant_id, session, score, timestamp) |-- stimuli (id, category, filename, duration_ms) SQLite is a particularly good fit for the Self-containment principle:\nSingle file. The entire database is one file. No server process, no connection strings, no configuration. Self-describing schema. The table structure, column types, and relationships are embedded in the file. Cross-platform. SQLite files are portable across operating systems and architectures. Versioned. A SQLite file can be tracked in Git (for small datasets) or git-annex (for larger ones). View: the datasette web UI #Datasette generates a web interface automatically from the database schema. Each table gets a paginated, sortable, filterable page. Each row links to a detail view. The SQL query page provides a scratchpad for ad-hoc analysis.\nThis is the view layer: it presents the model\u0026rsquo;s data in a human-readable form without altering the underlying data.\nController: SQL queries and plugins #The controller layer is the SQL query interface, augmented by datasette plugins. Users interact with the data by:\nApplying column filters (translated to WHERE clauses). Sorting by columns (translated to ORDER BY). Writing custom SQL queries. Using plugins for specialized visualizations (maps, charts, dashboards). The controller does not modify the data (datasette is read-only by default). It translates user intent into queries against the model and passes the results to the view.\nStep-by-step: from CSV to explorable dataset #1. Prepare your data as CSV #Suppose you have a research dataset as CSV files:\ndata/ participants.csv measurements.csv stimuli.csv With participants.csv containing:\nid,age,group,consent_date P001,34,control,2025-06-15 P002,28,treatment,2025-06-16 P003,41,control,2025-06-17 P004,35,treatment,2025-06-18 ... 2. Convert CSV to SQLite #Use the sqlite-utils companion tool (by the same author as datasette) to import CSV files into a SQLite database:\npip install sqlite-utils sqlite-utils insert research.db participants data/participants.csv --csv --detect-types sqlite-utils insert research.db measurements data/measurements.csv --csv --detect-types sqlite-utils insert research.db stimuli data/stimuli.csv --csv --detect-types The --detect-types flag tells sqlite-utils to infer column types (integer, float, text, date) rather than storing everything as text.\nYou can also add metadata to make the database more self-documenting:\n# Add a description to the database sqlite-utils insert research.db _metadata - --json \u0026lt;\u0026lt; \u0026#39;EOF\u0026#39; [{\u0026#34;key\u0026#34;: \u0026#34;title\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;Cognitive Study Dataset\u0026#34;}, {\u0026#34;key\u0026#34;: \u0026#34;description\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;Behavioral measurements from a controlled cognitive study, 2025\u0026#34;}, {\u0026#34;key\u0026#34;: \u0026#34;license\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;CC-BY-4.0\u0026#34;}, {\u0026#34;key\u0026#34;: \u0026#34;contact\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;researcher@university.edu\u0026#34;}] EOF # Create useful indexes for common queries sqlite-utils create-index research.db measurements participant_id sqlite-utils create-index research.db measurements session 3. Serve with datasette #pip install datasette datasette serve research.db Open http://localhost:8001 in a browser. You immediately see:\nA list of tables (participants, measurements, stimuli). Click any table to browse its rows with sorting and filtering. A SQL query page to write arbitrary queries. JSON and CSV export links on every page. 4. Add a metadata file for richer presentation #Datasette supports a metadata.yml (or .json) file that adds descriptions, titles, and licensing information to the web interface:\ntitle: \u0026#34;Cognitive Study Dataset\u0026#34; description: \u0026#34;Behavioral measurements from a controlled cognitive study\u0026#34; license: \u0026#34;CC-BY-4.0\u0026#34; databases: research: description: \u0026#34;Main study database\u0026#34; tables: participants: description: \u0026#34;Demographics and consent information for study participants\u0026#34; columns: id: \u0026#34;Unique participant identifier\u0026#34; age: \u0026#34;Age in years at time of consent\u0026#34; group: \u0026#34;Experimental group assignment (control or treatment)\u0026#34; consent_date: \u0026#34;Date informed consent was obtained\u0026#34; measurements: description: \u0026#34;Behavioral scores recorded per session\u0026#34; columns: participant_id: \u0026#34;Foreign key to participants.id\u0026#34; session: \u0026#34;Session number (1-indexed)\u0026#34; score: \u0026#34;Composite behavioral score (0-100 scale)\u0026#34; timestamp: \u0026#34;UTC timestamp of measurement\u0026#34; stimuli: description: \u0026#34;Stimulus materials used across sessions\u0026#34; Serve with metadata:\ndatasette serve research.db --metadata metadata.yml Now each table page shows its description and column-level documentation directly in the browser.\n5. Example queries users can run immediately #The SQL page allows anyone to explore the data without writing Python or R. Some examples:\nAverage score by group:\nSELECT p.group, AVG(m.score) AS mean_score, COUNT(*) AS n_measurements FROM measurements m JOIN participants p ON m.participant_id = p.id GROUP BY p.group; Participant scores across sessions (for a single participant):\nSELECT session, score, timestamp FROM measurements WHERE participant_id = \u0026#39;P001\u0026#39; ORDER BY session; Stimuli summary by category:\nSELECT category, COUNT(*) AS n_stimuli, AVG(duration_ms) AS mean_duration_ms FROM stimuli GROUP BY category; Each of these queries is a URL. Datasette encodes the query in the URL parameters, so results can be shared by simply copying the browser address bar. This turns ad-hoc data exploration into shareable, reproducible interactions.\nHow this makes data \u0026ldquo;actionable\u0026rdquo; #Traditional datasets require a multi-step setup process before any interaction is possible. Datasette collapses this:\nWorkflow step Traditional CSV Datasette + SQLite Get the data Download files Download one .db file Understand the schema Read README, inspect headers Browse table pages with docs First query Write Python/R script, import pandas Click a table, apply filters Complex query More scripting Write SQL in the browser Share a result Email a script or screenshot Share a URL Programmatic access Parse files with custom code Call the JSON API The dataset goes from \u0026ldquo;inert file that requires programming to explore\u0026rdquo; to \u0026ldquo;interactive application that answers questions immediately.\u0026rdquo;\nSelf-containment: the SQLite file as a self-documenting artifact #The SQLite file embodies the Self-containment principle in a way that CSVs do not:\nSchema is embedded. Column types, constraints, and indexes are part of the file, not a separate data dictionary. Relationships are explicit. Foreign keys connect tables, making the data model navigable. Metadata can be included. Descriptions, licenses, and provenance notes can be stored in dedicated metadata tables. Queries are portable. SQL is a universal language; the queries themselves serve as documentation of what the data contains. A single .db file plus a metadata.yml file is a complete, self-contained, explorable dataset.\nExtending with plugins #Datasette has a rich plugin ecosystem for specialized use cases:\ndatasette-cluster-map: Automatically display rows with latitude/longitude columns on an interactive map. datasette-vega: Add Vega-Lite chart visualizations to any query result. datasette-export-notebook: Export query results as Jupyter notebooks. datasette-graphql: Expose the database as a GraphQL API in addition to REST. datasette-publish-fly: Deploy a datasette instance to Fly.io with a single command. Install and enable plugins to customize the view and controller layers without changing the underlying data model:\ndatasette install datasette-vega datasette serve research.db --metadata metadata.yml Now every query result page includes an option to render the result as a bar chart, line chart, or scatter plot, directly in the browser.\nDeployment options for sharing #Datasette can be deployed in multiple ways, making it practical for both local exploration and public data sharing:\n# Local exploration (default) datasette serve research.db # Static JSON export (no server needed, hostable on GitHub Pages) datasette publish github-pages research.db --metadata metadata.yml # Containerized deployment datasette package research.db --metadata metadata.yml -t my-dataset:latest docker run -p 8001:8001 my-dataset:latest # Cloud deployment datasette publish fly research.db --app my-study-data --metadata metadata.yml The static export option is particularly interesting for research: it generates a set of static JSON files that can be hosted on any web server, including GitHub Pages. The dataset becomes a permanent, citable, explorable web resource with no running server required.\nSummary #Datasette demonstrates that making data actionable does not require building a custom web application. By storing data in SQLite (a self-contained, self-describing format) and serving it with datasette (an off-the-shelf exploration tool), you get:\nImmediate explorability: no code required to start querying. Shareable interactions: every query is a URL. Programmatic access: JSON API for scripted workflows. Self-containment: one file carries data, schema, and metadata. Extensibility: plugins add visualizations and specialized views. The MVC decomposition \u0026ndash; SQLite as model, datasette UI as view, SQL queries and plugins as controller \u0026ndash; is a pattern that can be applied to any tabular research dataset to make it instantly actionable.\n","date":"19 February 2026","permalink":"https://stamped-principles.github.io/stamped-examples/examples/datasette-mvc-actionable/","section":"Examples","summary":"Shows how datasette turns static datasets into interactive, queryable web applications following an MVC-like pattern.","title":"Datasette and MVC Pattern for Actionable Data Exploration"},{"content":"The pattern #When a user encounters a bug or unexpected behavior in a command-line tool, one of the most effective responses is to write a minimal shell script that reproduces the problem from scratch. The script creates a temporary directory, sets up just enough state (repositories, files, configuration) to trigger the issue, runs the offending commands, and exits. The temporary directory can then be inspected — or simply thrown away.\nThis pattern is ubiquitous in the git, git-annex, and DataLad communities.\nAnatomy of a reproducer script #A bare-bone example # script#!/bin/sh # Reproducer for \u0026#34;file not found\u0026#34; — a filename typo # -- setup shell environment -- set -eux PS4=\u0026#39;\u0026gt; \u0026#39; cd \u0026#34;$(mktemp -d \u0026#34;${TMPDIR:-/tmp}/repro-XXXXXXX\u0026#34;)\u0026#34; # -- setup your case -- touch preciouss.dat # -- collect extra information -- ls # -- trigger -- test -e precious.datThe script fails — set -e aborts as soon as test -e precious.dat returns non-zero. The trace (set -x) already tells the full story:\noutput \u0026gt; cd /tmp/repro-m6CM6OZ \u0026gt; touch preciouss.dat \u0026gt; ls preciouss.dat \u0026gt; test -e precious.dat No extra diagnostics needed, although you would know where to look (/tmp/repro-m6CM6OZ) — the ls output and the failing test make the typo obvious. This is maximally portable (POSIX sh + coreutils) and self-contained.\nKey elements #1. Shebang: #!/bin/sh (prefer POSIX)\nUse #!/bin/sh for maximum portability. Only reach for #!/bin/bash when you genuinely need bash-specific features (arrays, [[ ]], process substitution).\n2. Strict mode and tracing: set -eux and PS4='\u0026gt; '\n-e — exit immediately on any non-zero return. If the setup steps fail, there is no point continuing to the \u0026ldquo;trigger\u0026rdquo; phase. -u — treat unset variables as errors. Catches typos and missing configuration. -x — print every command before it executes — invaluable when sharing the script with someone who needs to see exactly what happened. Always pair -x with an explicit PS4 assignment:\nsnippetset -eux PS4=\u0026#39;\u0026gt; \u0026#39;as PS4 controls the prefix printed before each traced command (the default is + ). Setting it explicitly serves two purposes beyond readability:\nReproducibility — the output is identical regardless of what the user\u0026rsquo;s shell profile sets PS4 to, making traces diffable across environments. Portability — some systems define PS4 with shell-specific expansions (timestamps, function names) that can cause errors or garbled output when the script runs under a different shell. A simple literal value avoids this entirely. If a script invokes bash -x script.sh externally, having PS4 defined inside the script ensures consistent output regardless of how it was launched.\n3. Ephemeral workspace: mktemp -d\nsnippetcd \u0026#34;$(mktemp -d \u0026#34;${TMPDIR:-/tmp}/dl-XXXXXXX\u0026#34;)\u0026#34;This is the core of ephemerality: every run starts in a brand-new, empty directory. Using mktemp rather than a hardcoded path like cd /tmp/mytest is also a security measure — on shared systems, a predictable path under /tmp is vulnerable to symlink attacks where another user pre-creates a symlink pointing to a victim location. mktemp generates an unpredictable name atomically.\nThe ${TMPDIR:-/tmp} fallback respects system conventions across Linux and macOS. The prefix (dl-, gx-, ann-) identifies which tool the script tests, making it easy to find (or clean up) leftover directories.\nNo trap ... EXIT cleanup is usually needed — /tmp is cleaned by the OS, and you often want to inspect the result after a failure, and having set -x visualizes initial cd path.\n4. Self-contained setup\nThe script creates everything it needs from scratch — touch, mkdir, echo content \u0026gt; file. It does not depend on pre-existing files on the user\u0026rsquo;s machine. This makes the script self-contained — anyone with POSIX sh can run it.\n5. Tracked externals\nWhen a reproducer must pull in external materials, that is fine — git clone, docker pull, wget are all normal. The key is to reference exact, immutable identifiers so the script stays tracked:\ngit — pin to a commit hash or tag, not a branch: git clone --branch v1.2.3 https://github.com/org/repo containers — pin by digest, not a mutable tag: docker pull alpine@sha256:a8560b36e8... URLs — use version-pinned URLs or archived snapshots (e.g., Wayback Machine links) rather than a \u0026ldquo;latest\u0026rdquo; URL that may change or vanish. The script does not need to contain every byte — it needs to point to an exact, reproducible state of every dependency.\nSTAMPED analysis # Property How the pattern embodies it Self-contained Everything needed is created inline — no external state required beyond the tool under test Tracked The script is the record: copy-pasteable into an issue, attachable to a commit Actionable Running the script is the reproduction — it is an executable specification of the bug, not a prose description Portable POSIX sh + mktemp + ${TMPDIR:-/tmp} works across Linux and macOS; explicit PS4 avoids shell-specific trace behavior; no hardcoded paths Ephemeral Each run operates in a fresh temp directory; the entire workspace can be discarded after inspection From reproducer to test case #A reproducer script is often the first draft of a regression test. The progression is natural:\nBug report — paste the script into a GitHub issue. Anyone can run it. Bisection driver — wrap the script\u0026rsquo;s exit code in git bisect run to find the introducing commit. Red/green test — translate the shell commands into the project\u0026rsquo;s test framework (e.g., pytest). The setup phase becomes a fixture, the trigger becomes the test body, and the inspection becomes an assertion. This progression from throwaway script to permanent test case mirrors the Red/Green cycle of TDD: the reproducer is the \u0026ldquo;red\u0026rdquo; test that fails, the fix makes it \u0026ldquo;green\u0026rdquo;, and the test prevents regressions.\nPractical guidelines # Name scripts after issue numbers: bug-3686.sh, gh-6296.sh, annex-4369.sh. When you return months later, the filename links directly to the discussion.\nUse a descriptive prefix in mktemp: dl- for DataLad, gx- for git-annex, ann- for general annex tests. This makes orphaned temp directories identifiable.\nAlways set PS4: Even if you omit set -x from the script itself, setting PS4='\u0026gt; ' ensures consistent trace output when someone runs bash -x script.sh externally.\nPrint version information early: git --version or python3 --version at the top helps recipients match your environment.\nDo not clean up on success: Leave the temp directory intact so you (or the recipient) can inspect the state. /tmp is cleaned on reboot.\nKeep scripts minimal: Every line that is not strictly necessary to trigger the bug is noise. Minimal scripts are easier to review, faster to bisect, and more likely to be turned into test cases.\nTest your own instructions: Before sharing a reproducer (e.g., in a GitHub issue), copy-paste the invocation instructions you gave the recipient and run them yourself on a different machine or in a fresh shell. This catches implicit assumptions — a forgotten dependency, a path that only exists on your system, or a missing chmod +x — before someone else hits them.\n","date":"19 February 2026","permalink":"https://stamped-principles.github.io/stamped-examples/examples/ephemeral-shell-reproducer/","section":"Examples","summary":"Distills a common practice among open-source developers: writing throwaway shell scripts that set up a fresh environment, reproduce a problem, and can be shared as actionable bug reports or starting points for test cases.","title":"Ephemeral Shell Scripts for Reproducing Issues"},{"content":"The problem: monolithic datasets resist reuse #Research projects rarely work with a single, self-contained blob of data. A typical neuroimaging study might use a standard brain atlas maintained by one group, a set of stimuli curated by another, and raw scanner output that is unique to the study. When all of this is dumped into a single repository, several problems emerge:\nNo independent versioning. The atlas is at version 2.3, but there is no record of that inside the monolithic repo \u0026ndash; just a snapshot of files. When the atlas releases version 2.4 you cannot cleanly upgrade. No reuse across projects. A colleague running a different study that needs the same atlas cannot pull it from your project without manually copying files. Two copies now drift independently. Bloated history. Every project that embeds the atlas carries a full copy of its history (or, worse, no history at all). Cloning becomes slow and storage costs multiply. The root cause is that a flat directory tree conflates composition (assembling components into a project) with ownership (maintaining each component).\nThe solution: git submodules separate composition from ownership #Git submodules let you nest one Git repository inside another. The parent repository records which child repository to include and at which commit, but the child retains its own .git directory, its own history, and its own remote. This is exactly the separation we need:\nThe parent (your research project) controls the composition. Each child (atlas, stimuli, raw data) controls its own content and version history. This maps directly to the YODA principle that the inputs/ directory of a dataset should contain independently versioned subdatasets rather than loose copies of external data.\nStep-by-step walkthrough #1. Create the parent project #Start with a fresh repository that will serve as the top-level research project:\nmkdir my-study \u0026amp;\u0026amp; cd my-study git init Create the YODA-style directory skeleton:\nmkdir -p code inputs outputs Add a minimal README and commit:\ncat \u0026gt; README.md \u0026lt;\u0026lt; \u0026#39;EOF\u0026#39; # My Study Research project following YODA conventions. - `code/` -- analysis scripts (tracked directly) - `inputs/` -- input datasets (git submodules) - `outputs/` -- results (ephemeral, regenerable) EOF git add README.md code inputs outputs git commit -m \u0026#34;Initialize project skeleton\u0026#34; 2. Add an external dataset as a submodule #Suppose the brain atlas lives in its own repository on GitHub. Add it as a submodule under inputs/:\ngit submodule add https://github.com/example-org/brain-atlas.git inputs/brain-atlas Git does three things here:\nClones brain-atlas into inputs/brain-atlas/. Creates (or updates) a .gitmodules file at the project root recording the URL and local path. Stages a special \u0026ldquo;gitlink\u0026rdquo; entry that records the exact commit SHA of the submodule. Inspect what changed:\ngit status # On branch main # Changes to be committed: # new file: .gitmodules # new file: inputs/brain-atlas The .gitmodules file looks like this:\n[submodule \u0026#34;inputs/brain-atlas\u0026#34;] path = inputs/brain-atlas url = https://github.com/example-org/brain-atlas.git Commit the addition:\ngit commit -m \u0026#34;Add brain-atlas v2.3 as input submodule\u0026#34; 3. Add a second submodule #Add a stimulus set the same way:\ngit submodule add https://github.com/example-org/visual-stimuli.git inputs/visual-stimuli git commit -m \u0026#34;Add visual-stimuli as input submodule\u0026#34; 4. Resulting directory structure #After these steps, the project looks like this:\nmy-study/ .git/ .gitmodules README.md code/ analyze.py # tracked directly in the parent inputs/ brain-atlas/ # submodule -\u0026gt; github.com/example-org/brain-atlas @ abc1234 atlas.nii.gz labels.tsv README.md visual-stimuli/ # submodule -\u0026gt; github.com/example-org/visual-stimuli @ def5678 stim_001.png stim_002.png metadata.json outputs/ (empty, will hold results) The key insight: inputs/brain-atlas/ is a complete Git repository with its own history. You can cd inputs/brain-atlas \u0026amp;\u0026amp; git log to see the atlas\u0026rsquo;s full commit history, completely independent of the parent project.\n5. Cloning a project that uses submodules #When a collaborator clones your project, submodule directories will exist but will be empty by default. They need one extra step:\ngit clone https://github.com/you/my-study.git cd my-study git submodule update --init Or, to do both in one command:\ngit clone --recurse-submodules https://github.com/you/my-study.git This fetches the parent and then checks out each submodule at the exact commit recorded by the parent.\n6. Updating a submodule to a newer version #When the atlas releases version 2.4, you can update the submodule pointer:\ncd inputs/brain-atlas git fetch git checkout v2.4 # or: git pull origin main cd ../.. git add inputs/brain-atlas git commit -m \u0026#34;Update brain-atlas to v2.4\u0026#34; The parent now records the new commit SHA. Anyone who runs git submodule update will get the updated atlas. The old version is still accessible via the parent\u0026rsquo;s history \u0026ndash; just check out the previous parent commit and run git submodule update again.\nConnection to YODA principles #The YODA layout convention places input data under inputs/ and analysis code under code/. Git submodules implement the Modularity principle for the inputs/ directory:\nYODA directory Tracked how? Why? code/ Directly in the parent repo Code is authored by the project team inputs/ As submodules (or subdatasets) Input data is maintained by external parties outputs/ Ignored or ephemeral Results are regenerable from code + inputs This separation means you can:\nPin your analysis to a specific version of each input. Upgrade an input independently without touching code or other inputs. Share an input dataset across projects without copying it. Credit the maintainers of each input by pointing to their repository. Limitations and when to prefer DataLad subdatasets #Git submodules are a built-in Git feature and require no additional tools, which makes them a good starting point. However, they have limitations that become significant at scale:\nNo large-file support. Git submodules do not change how Git handles file content. If your atlas contains large binary files, each clone downloads the full history of those files. Git-annex or Git LFS is needed to manage large data efficiently.\nManual management. Adding, updating, and removing submodules requires several commands and careful attention to .gitmodules and .git/config. It is easy to leave a project in an inconsistent state.\nNo partial fetch. You cannot easily fetch only a subset of a submodule\u0026rsquo;s files. For large datasets where you only need a slice, this is wasteful.\nNo recursive save/push. Each submodule must be committed and pushed independently, bottom-up. In a deeply nested hierarchy this becomes tedious.\nDataLad subdatasets build on Git submodules but solve these problems by integrating git-annex for large-file management and providing commands like datalad save (recursive commit across all nesting levels), datalad get (on-demand file retrieval), and datalad push (recursive push). If your project involves large files or deep nesting, DataLad subdatasets are the natural next step from plain Git submodules.\nSummary # Aspect Plain Git submodule DataLad subdataset Tooling required Git only Git + DataLad (+ git-annex) Large file handling None (full clone) git-annex (on-demand fetch) Recursive operations Manual per submodule datalad save, datalad push Metadata integration .gitmodules only .datalad/config, structured metadata Best for Small/medium text-heavy repos Any size, especially large data Start with Git submodules if your data is small and your nesting is shallow. Graduate to DataLad subdatasets when scale or convenience demands it. Either way, the underlying principle is the same: compose your project from independently versioned, reusable components.\n","date":"19 February 2026","permalink":"https://stamped-principles.github.io/stamped-examples/examples/git-submodules-modularity/","section":"Examples","summary":"Demonstrates how git submodules enable independent versioning and composition of dataset components.","title":"Git Submodules for Modular Dataset Composition"},{"content":"The Interoperable principle ensures that data can be integrated with other data and can work with applications and workflows for analysis, storage, and processing. This means:\nUsing a formal, accessible, shared, and broadly applicable language for knowledge representation. Using vocabularies that follow FAIR principles themselves. Including qualified references to other datasets and metadata. In STAMPED terms, interoperability is promoted by adopting standard file formats, using community-accepted metadata schemas, structuring data so that it can be combined across studies, and maintaining explicit links between datasets and their dependencies. Version control and provenance tracking make these cross-references robust and verifiable.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/fair_principles/i/","section":"FAIR Principles","summary":"","title":"Interoperable"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/mvc/","section":"Tags","summary":"","title":"Mvc"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/provenance/","section":"Tags","summary":"","title":"Provenance"},{"content":"The problem: provenance as dead documentation #Many systems record provenance \u0026ndash; the chain of steps that produced a result. Workflow management tools generate DAGs, lab notebooks describe procedures, README files list the commands that were run. But in most cases these records are inert: they describe what happened, but they cannot make it happen again.\nConsider a Git commit message that says:\nNormalize and filter raw survey responses Ran: python code/clean.py --threshold 0.05 inputs/raw/survey.csv outputs/clean/survey.csv This is useful documentation, but it is just text. To re-execute the step, a human must read the message, extract the command, check that the files are in place, and run it manually. If the message has a typo, or the file paths have changed, or the script has been updated, the re-execution will silently produce different results \u0026ndash; or fail entirely.\nThe gap between \u0026ldquo;recorded provenance\u0026rdquo; and \u0026ldquo;executable provenance\u0026rdquo; is the gap between documentation and actionability.\nThe solution: datalad rerun turns records into actions #When a computation is recorded with datalad run (see the companion example), the resulting commit contains a machine-readable run record \u0026ndash; a JSON object specifying the exact command, inputs, and outputs. datalad rerun reads this record and re-executes the command automatically:\ndatalad rerun \u0026lt;commit-hash\u0026gt; That single command does all of the following:\nParses the run record from the specified commit. Gets the declared input files (fetching from a remote annex if necessary). Unlocks the declared output files so they can be overwritten. Executes the recorded command string. Saves the result as a new commit, linking back to the original run record. No manual extraction of commands from commit messages. No guessing about file paths or flags. The provenance record is the execution plan.\nConcrete example: verification workflow #Suppose your dataset has a commit from datalad run that generated statistical results:\ngit log --oneline # a1b2c3d (HEAD -\u0026gt; main) [DATALAD RUNCMD] Compute group statistics # f6e5d4c Add preprocessed data # 9a8b7c6 Initial commit You want to verify that the results are reproducible. First, inspect the run record to understand what was recorded:\ngit log -1 --format=%B a1b2c3d Output:\n[DATALAD RUNCMD] Compute group statistics === Do not change lines below === { \u0026#34;cmd\u0026#34;: \u0026#34;python code/analyze.py outputs/preprocessed/ outputs/statistics/results.json\u0026#34;, \u0026#34;inputs\u0026#34;: [\u0026#34;outputs/preprocessed/\u0026#34;], \u0026#34;outputs\u0026#34;: [\u0026#34;outputs/statistics/results.json\u0026#34;], \u0026#34;exit\u0026#34;: 0, \u0026#34;pwd\u0026#34;: \u0026#34;.\u0026#34; } ^^^ Do not change lines above ^^^ Now re-execute:\ndatalad rerun a1b2c3d DataLad fetches the inputs (if needed), runs the exact same command, and commits the result. If the output is identical to what was there before, you have confirmed reproducibility. If it differs, the git diff will show you exactly what changed, pointing to non-determinism in the computation or a change in the software environment.\nChecking for differences #After the rerun, compare the current output to the original:\n# Did the rerun produce identical files? git diff HEAD~1 -- outputs/statistics/results.json If the diff is empty, the computation is reproducible. If not, you have a concrete starting point for investigation: the exact same command, on the exact same inputs, produced different outputs. That narrows the problem to the software environment (library versions, random seeds, floating-point ordering, etc.).\nConcrete example: updating workflow #A second powerful use of datalad rerun is propagating changes through a pipeline. Suppose the raw input data is corrected (a data entry error is fixed). You want to regenerate all downstream results:\ngit log --oneline # b2c3d4e (HEAD -\u0026gt; main) [DATALAD RUNCMD] Generate figures # a1b2c3d [DATALAD RUNCMD] Compute group statistics # f6e5d4c [DATALAD RUNCMD] Preprocess raw data # 9a8b7c6 Fix data entry error in raw/survey.csv # 1234567 Add raw data To re-execute the full pipeline from the preprocessing step onward:\ndatalad rerun --since 9a8b7c6 The --since flag tells DataLad to re-execute every datalad run commit after the specified commit. It will:\nRerun the preprocessing step (commit f6e5d4c). Rerun the statistics computation (commit a1b2c3d). Rerun the figure generation (commit b2c3d4e). Each step uses the (now corrected) outputs of the previous step as its inputs. The entire pipeline is re-executed in order, and the results reflect the corrected raw data.\nThe difference between datalad run and datalad rerun #These two commands are complementary halves of the same workflow:\nAspect datalad run datalad rerun Purpose Record a new computation Re-execute a previously recorded computation Input A command typed by the user A commit hash (or range) Creates A new commit with a run record A new commit that re-executes an existing run record When to use First time you run a command Verification, updating, or re-execution Think of datalad run as recording and datalad rerun as playback. The recording captures the full specification; playback faithfully reproduces it.\nConnection to Actionability (A) #The STAMPED Actionability principle states that dataset operations should be executable, not just documented. datalad rerun is the mechanism that makes this principle concrete:\nA commit message that says \u0026ldquo;we ran script X\u0026rdquo; is documentation. A datalad run commit that contains a structured run record is actionable documentation. datalad rerun is the action \u0026ndash; it reads the documentation and executes it. Without datalad rerun, run records would be valuable metadata but still require manual interpretation. With it, the entire provenance chain becomes a push-button operation.\nConnection to Ephemerality (E) #The Ephemerality principle states that derived and regenerable content should be treated as ephemeral. datalad rerun is what makes this practical:\nIf every derived file was produced by a datalad run commit, then every derived file can be regenerated by datalad rerun. This means derived files do not need to be permanently stored \u0026ndash; they can be dropped from local storage (using datalad drop) and regenerated on demand. The repository stays lean: it stores the recipes (run records) rather than the products (large derived files). The combination of datalad run (recording), datalad rerun (re-execution), and datalad drop (reclaiming space) forms a complete lifecycle for ephemeral data:\nrun drop rerun raw --\u0026gt; derived (committed) --\u0026gt; pointer only -------\u0026gt; derived (regenerated) [recipe recorded] [space reclaimed] [recipe re-executed] Practical considerations #Software environment matters #datalad rerun re-executes the command, but it does not recreate the software environment. If you ran the original command with Python 3.10 and scikit-learn 1.2, but your current environment has Python 3.12 and scikit-learn 1.4, the results may differ.\nFor full reproducibility, combine datalad run with environment capture:\nContainer images: Use datalad containers-run to execute commands inside a Docker or Singularity container. The container image is recorded in the run record alongside the command. Lock files: Track requirements.txt or conda-lock.yml in the repository so the exact package versions are part of the dataset\u0026rsquo;s version history. Rerunning a single step vs. a range ## Re-execute a single recorded step datalad rerun a1b2c3d # Re-execute all recorded steps after a given commit datalad rerun --since 9a8b7c6 # Re-execute all recorded steps in the entire history datalad rerun --since \u0026#34;\u0026#34; Handling failures #If a rerun fails (non-zero exit code), DataLad will not commit the broken output. The working tree will contain the partial results, and you can inspect what went wrong before deciding how to proceed.\nCombining with --script #You can extract the commands from a range of run records into a shell script without executing them:\ndatalad rerun --since 9a8b7c6 --script pipeline.sh This produces a standalone script containing the exact commands in order. It is useful for review, for running on a cluster, or for porting to a system where DataLad is not installed.\nSummary #datalad rerun closes the loop between provenance and action. When every data transformation is recorded with datalad run, the dataset\u0026rsquo;s history is not just a log of what happened \u0026ndash; it is a complete, re-executable specification of how to produce the current state from the original inputs. This turns provenance from passive metadata into an active tool for verification, updating, and space management.\n","date":"19 February 2026","permalink":"https://stamped-principles.github.io/stamped-examples/examples/datalad-rerun-actionability/","section":"Examples","summary":"Demonstrates how datalad rerun enables re-execution of previously recorded datalad run commands, turning provenance records into actionable recipes.","title":"Re-executing Computations with datalad rerun"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/re-execution/","section":"Tags","summary":"","title":"Re-Execution"},{"content":"The problem: undocumented transformations #A depressingly common scenario in data-intensive research looks like this:\nA researcher downloads raw data into data/raw/. They write a Python script that cleans and transforms it. They run the script from the command line, perhaps tweaking flags along the way. The cleaned output lands in data/processed/. Six months later, a reviewer asks: \u0026ldquo;How exactly was the processed data generated?\u0026rdquo; The researcher checks their notes. The script is there, but which version was actually run? What arguments were passed? Were the raw inputs the same files that are in the repository now, or were they updated since then? The answers are not recorded anywhere because the transformation was run \u0026ldquo;by hand\u0026rdquo; \u0026ndash; outside the version control system.\nThis is a provenance gap: the data is versioned, but the process that produced it is not.\nThe solution: datalad run wraps commands with provenance #DataLad\u0026rsquo;s run command solves this by acting as a thin wrapper around any shell command. Instead of running your script directly, you run it through datalad run, which:\nRecords the exact command string. Records which files were used as inputs. Records which files were produced as outputs. Executes the command. Saves the result as a Git commit with a structured, machine-readable run record. The result is a commit that is not merely \u0026ldquo;some files changed\u0026rdquo; but a complete, re-executable recipe: what was run, on what, producing what.\nConcrete example: converting DICOM to NIfTI #Suppose you have a neuroimaging dataset following YODA conventions:\nmy-study/ code/ convert.py # DICOM-to-NIfTI conversion script inputs/ dicoms/ # raw DICOM files (submodule or subdataset) sub-01/ sub-02/ outputs/ nifti/ # will hold converted NIfTI files Running without provenance (the old way) #python code/convert.py inputs/dicoms/ outputs/nifti/ git add outputs/nifti/ git commit -m \u0026#34;Convert DICOM to NIfTI\u0026#34; This records that something changed, but not how. The commit message is prose, not a machine-readable recipe.\nRunning with datalad run (the provenance-aware way) #datalad run \\ -m \u0026#34;Convert DICOM to NIfTI\u0026#34; \\ -i \u0026#34;inputs/dicoms/\u0026#34; \\ -o \u0026#34;outputs/nifti/\u0026#34; \\ \u0026#34;python code/convert.py inputs/dicoms/ outputs/nifti/\u0026#34; Let us break down the flags:\nFlag Purpose -m \u0026quot;Convert DICOM to NIfTI\u0026quot; Human-readable commit message -i \u0026quot;inputs/dicoms/\u0026quot; Declare input files \u0026ndash; DataLad will get them if not yet available -o \u0026quot;outputs/nifti/\u0026quot; Declare output files \u0026ndash; DataLad will unlock them before the command runs and save them afterward \u0026quot;python code/convert.py ...\u0026quot; The actual command to execute DataLad executes the command, then creates a commit that bundles two things: the usual file changes and a machine-readable run record.\nAnatomy of the resulting commit #After datalad run finishes, git log -1 shows something like:\ncommit a1b2c3d4e5f6... Author: Jane Researcher \u0026lt;jane@university.edu\u0026gt; Date: Wed Feb 19 14:30:00 2026 +0000 [DATALAD RUNCMD] Convert DICOM to NIfTI === Do not change lines below === { \u0026#34;chain\u0026#34;: [], \u0026#34;cmd\u0026#34;: \u0026#34;python code/convert.py inputs/dicoms/ outputs/nifti/\u0026#34;, \u0026#34;dsid\u0026#34;: \u0026#34;abcd1234-5678-...\u0026#34;, \u0026#34;exit\u0026#34;: 0, \u0026#34;extra_inputs\u0026#34;: [], \u0026#34;inputs\u0026#34;: [\u0026#34;inputs/dicoms/\u0026#34;], \u0026#34;outputs\u0026#34;: [\u0026#34;outputs/nifti/\u0026#34;], \u0026#34;pwd\u0026#34;: \u0026#34;.\u0026#34; } ^^^ Do not change lines above ^^^ The block between the delimiters is a JSON run record embedded directly in the commit message. It contains:\nField Meaning cmd The exact command string that was executed inputs Files/directories the command read from outputs Files/directories the command wrote to exit The exit code of the command (0 = success) pwd The working directory relative to the dataset root dsid The unique identifier of the dataset This run record is what makes the commit actionable rather than merely informative. It is not just a note saying \u0026ldquo;files were converted\u0026rdquo; \u0026ndash; it is a complete recipe that can be re-executed mechanically.\nHow datalad run satisfies STAMPED principles #Tracking (T) #The run record captures the full provenance chain:\nWhat was run: the exact command string. On what inputs: the declared input files. Producing what outputs: the declared output files. With what result: the exit code. When: the commit timestamp. By whom: the commit author. This is far richer than a manual git commit -m \u0026quot;processed data\u0026quot;.\nActionability (A) #Because the run record is structured and machine-readable, it is not just documentation \u0026ndash; it is an executable specification. DataLad can parse the run record from any commit and re-execute the exact command using datalad rerun (covered in a separate example).\nHandling large files transparently #When DataLad is used with git-annex (its default configuration for binary files), datalad run integrates seamlessly:\nBefore execution: -i flags trigger datalad get to ensure input files are present locally. If they are stored in a remote annex, they are fetched on demand. Before execution: -o flags trigger datalad unlock on output files so they can be overwritten. After execution: DataLad adds the new/changed output files to git-annex and commits them. This means datalad run works correctly even when your dataset is partially cloned and most files exist only as lightweight annex pointers.\nA more complex example: multi-step pipeline #Real analyses often involve several steps. Each step can be a separate datalad run invocation, building a chain of provenance:\n# Step 1: Preprocess raw data datalad run \\ -m \u0026#34;Preprocess: normalize and filter\u0026#34; \\ -i \u0026#34;inputs/raw/*.csv\u0026#34; \\ -o \u0026#34;outputs/preprocessed/*.csv\u0026#34; \\ \u0026#34;python code/preprocess.py inputs/raw/ outputs/preprocessed/\u0026#34; # Step 2: Run statistical analysis datalad run \\ -m \u0026#34;Analyze: compute group statistics\u0026#34; \\ -i \u0026#34;outputs/preprocessed/*.csv\u0026#34; \\ -o \u0026#34;outputs/statistics/results.json\u0026#34; \\ \u0026#34;python code/analyze.py outputs/preprocessed/ outputs/statistics/results.json\u0026#34; # Step 3: Generate figures datalad run \\ -m \u0026#34;Plot: generate publication figures\u0026#34; \\ -i \u0026#34;outputs/statistics/results.json\u0026#34; \\ -o \u0026#34;outputs/figures/*.pdf\u0026#34; \\ \u0026#34;python code/plot.py outputs/statistics/results.json outputs/figures/\u0026#34; Each step produces a commit with a run record. The full pipeline is captured in the Git history as a sequence of machine-readable, re-executable steps. A colleague can read the git log to understand not just what files exist, but the exact chain of computations that produced them.\nPractical tips #Use glob patterns in -i and -o #DataLad supports glob patterns for input and output specifications. This is useful when you do not know the exact filenames in advance:\ndatalad run -i \u0026#34;inputs/scans/sub-*/*.dcm\u0026#34; -o \u0026#34;outputs/nifti/sub-*/*.nii.gz\u0026#34; ... Keep the command self-contained #The command string should be self-contained: it should not rely on shell variables, aliases, or environment-specific paths. Anyone re-running the command on a different machine should get the same result (assuming the same software versions).\nGood:\ndatalad run -m \u0026#34;Convert\u0026#34; \u0026#34;python code/convert.py inputs/ outputs/\u0026#34; Avoid:\ndatalad run -m \u0026#34;Convert\u0026#34; \u0026#34;$MY_SCRIPT $INPUT_DIR $OUTPUT_DIR\u0026#34; Declare all inputs and outputs explicitly #It is tempting to omit -i and -o and let DataLad simply commit whatever changes. This works mechanically but loses the key provenance benefit: the run record will not list what the command consumed and produced. Always declare inputs and outputs explicitly for maximum provenance value.\nConnection to datalad rerun #A datalad run commit is useful on its own as documentation, but its real power is that it can be re-executed using datalad rerun. This enables verification (\u0026ldquo;do I get the same results?\u0026rdquo;) and updating (\u0026ldquo;I changed an input; rerun the pipeline\u0026rdquo;). See the datalad rerun example for a detailed walkthrough of re-execution workflows.\n","date":"19 February 2026","permalink":"https://stamped-principles.github.io/stamped-examples/examples/datalad-run-provenance/","section":"Examples","summary":"Shows how datalad run wraps arbitrary commands to record inputs, outputs, and the exact command in machine-reexecutable form.","title":"Recording Computational Provenance with datalad run"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/reproducer/","section":"Tags","summary":"","title":"Reproducer"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/reproducibility/","section":"Tags","summary":"","title":"Reproducibility"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/sqlite/","section":"Tags","summary":"","title":"Sqlite"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/submodules/","section":"Tags","summary":"","title":"Submodules"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/tags/testing/","section":"Tags","summary":"","title":"Testing"},{"content":"The Tool instrumentation level involves adopting specific software that implements one or more data management principles. These are typically single-purpose utilities that address particular needs:\ngit \u0026ndash; Version control for code and small files, providing history tracking and collaboration. git-annex \u0026ndash; Large file management integrated with git, enabling tracking of files without storing them directly in the repository. DataLad \u0026ndash; A data management tool built on git and git-annex that adds dataset nesting, provenance capture, and streamlined access to remote data. containers (Docker, Singularity/Apptainer) \u0026ndash; Computational environment encapsulation for reproducibility. make, doit, snakemake \u0026ndash; Build systems and task runners for automating data processing steps. Adopting individual tools is a natural next step after establishing good data organization. Each tool addresses a specific gap \u0026ndash; version control, large file handling, environment reproducibility \u0026ndash; and can be introduced incrementally.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/instrumentation_levels/tool/","section":"Instrumentation Levels","summary":"","title":"Tool"},{"content":"Companion to the STAMPED paper #This site is the companion resource to the STAMPED paper (\u0026ldquo;STAMPED: Properties of a Reproducible Research Object\u0026rdquo;). While the paper develops the framework and evaluates existing tools against it, this site provides the practical side: concrete, runnable examples that show how the properties look when applied to real data management tasks.\nSTAMPED and YODA #The YODA principles (YODA\u0026rsquo;s Organigram on Data Analysis) established foundational conventions for structuring DataLad datasets. STAMPED extends and generalizes those ideas beyond any single tool:\nWhere YODA focuses on DataLad dataset organization, STAMPED formalizes properties of a research object that make YODA effective and expresses them in a tool-agnostic way. STAMPED adds properties (such as Ephemerality and Distributability) that were implicit in YODA practice but not explicitly named. By decoupling the properties from a specific tool, STAMPED provides a vocabulary for evaluating any dataset management approach \u0026ndash; whether built on DataLad, DVC, Git LFS, Hugging Face Datasets, or plain Git with conventions. Multi-dimensional taxonomy #Examples on this site are organized along four independent dimensions:\nSTAMPED properties \u0026ndash; Self-contained, Tracked, Actionable, Modular, Portable, Ephemeral, and Distributable. FAIR mapping \u0026ndash; which of the FAIR goals (Findable, Accessible, Interoperable, Reusable) the practice supports. Instrumentation level \u0026ndash; ranging from conventions that require no special tooling, through lightweight tools, to full version-control workflows. Aspirational goals \u0026ndash; higher-level objectives such as reproducibility, rigor, transparency, and efficiency. An example may be tagged with multiple values in each dimension. A README convention, for instance, supports Self-containment and Findability and Reusability while requiring no special instrumentation.\nRange of examples #Examples span a wide range of complexity:\nSimple conventions \u0026ndash; directory layouts, naming schemes, README templates that require nothing more than discipline. Lightweight practices \u0026ndash; using checksums, manifests, or small scripts to add provenance without heavy infrastructure. Tool-assisted workflows \u0026ndash; leveraging Git, Git-annex, DataLad, DVC, or similar tools for automated tracking and distribution. Advanced pipelines \u0026ndash; multi-step, multi-tool workflows that demonstrate several STAMPED principles working together. Contributing #We welcome contributions of new examples, corrections, and improvements. The source for this site lives at https://github.com/stamped-principles/stamped-examples. To contribute:\nFork the repository. Add or edit example pages under content/examples/. Tag your example with the appropriate STAMPED principles, FAIR mappings, instrumentation level, and aspirational goals in the front matter. Open a pull request with a brief description of what the example demonstrates. See the repository README for details on front-matter fields and the taxonomy values available.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/about/","section":"About","summary":"","title":"About"},{"content":"","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/categories/","section":"Categories","summary":"","title":"Categories"},{"content":"Data Organization is the foundational instrumentation level. It encompasses the conventions, naming schemes, and directory structures that bring order to research data without requiring any specialized software.\nExamples include:\nConsistent directory hierarchies that separate raw data, processed data, code, and documentation. File naming conventions that encode meaningful metadata (subject IDs, dates, conditions). README files and data dictionaries that describe the contents and structure of a dataset. Separation of inputs and outputs to clarify data flow. These practices are universally applicable. Every researcher can adopt them immediately, and they form the necessary foundation upon which higher instrumentation levels (tools, workflows, patterns) are built. Even when using sophisticated tooling, a clear organizational scheme remains essential.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/instrumentation_levels/data-organization/","section":"Instrumentation Levels","summary":"","title":"Data Organization"},{"content":"The first FAIR principle states that data and metadata should be easy to find for both humans and machines. This requires:\nAssigning globally unique and persistent identifiers (e.g., DOIs) to data and metadata. Describing data with rich metadata. Registering or indexing data in searchable resources. Including the identifier in the metadata so the connection is explicit. In the context of STAMPED, findability is supported by practices such as structured directory layouts, consistent naming conventions, machine-readable metadata files, and the use of version control systems that provide stable references to specific states of a dataset.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/fair_principles/f/","section":"FAIR Principles","summary":"","title":"Findable"},{"content":"The Workflow instrumentation level combines multiple tools into coordinated, multi-step pipelines. Rather than using tools in isolation, workflows define how data flows through a sequence of processing, analysis, and publication steps.\nCharacteristics of workflow-level instrumentation:\nMultiple tools are chained together in a defined order. Inputs and outputs of each step are explicitly declared. Execution can be partially or fully automated. Provenance is captured across the entire pipeline, not just individual steps. Workflows can be re-run to reproduce results or updated when inputs change. Examples include using DataLad\u0026rsquo;s run and rerun commands to capture entire analysis pipelines, CI/CD systems that automatically validate data upon submission, and orchestration tools like Nextflow or Snakemake that manage complex dependency graphs across compute environments.\n","date":null,"permalink":"https://stamped-principles.github.io/stamped-examples/instrumentation_levels/workflow/","section":"Instrumentation Levels","summary":"","title":"Workflow"}]