From Files to Trusted Records: Securing Your Digital Collection's Future
Simple steps to prove authenticity and maintain integrity for long-term preservation
According to ISO 15489, the international standard for records management, authoritative records must demonstrate four key characteristics: reliability, integrity, usability, and authenticity. For anyone preserving digital collections with the USC Digital Repository and their newly-operational Filecoin node – whether you're an archivist, researcher, or cultural heritage professional managing photographs, recordings, documents, or other digital artifacts – understanding these principles is essential.
This walkthrough offers concrete steps to strengthen the integrity and provenance of your materials. If you want to ensure your valuable digital assets remain authentic and verifiable for years to come, but aren't necessarily a cryptography expert, these straightforward practices will prepare your collection for proper long-term preservation.
At its simplest, this means hashing and signing your data before placing a copy of it at the Repository. Think of it as sealing a letter with a wax stamp – ensuring that, as it travels, the letter’s provenance is unbrokenly attached to it.
The Essentials: What We Need
The files themselves.
These are your original digital artifacts, whether photographs, recordings, documents, or any other data types. Your collection is only as trustworthy as its unaltered source, so it’s critical to start with the most original (raw, unmodified) files.Signed hashes of the files.
Hashes act like digital fingerprints – they are unique cryptographic identifiers computed for each file. By signing these hashes, you attach a verifiable mark of authenticity and provenance to them: “this public key was in possession of this specific file.”
These authenticity markers serve three purposes:
Prove data integrity: Comparing hashes before and after transport confirms the file hasn't been altered—if the hashes match perfectly, you're looking at an exact copy of the original. This verification process ensures that every piece of your collection retains its evidentiary or cultural value.
Establish chain of custody: Digital signatures cryptographically link each hash to a specific key and its owner, creating an unbroken record of who possessed the file. This verifiable trail of ownership strengthens the legal and archival standing of your materials.
Enable long-term verification: Timestamped hashes can be anchored to immutable record systems like blockchains for enhanced legal robustness. This approach provides independent, third-party verification that can withstand scrutiny years or decades into the future.
By combining the raw files with their signed hashes, you can create a robust method for maintaining data integrity and establishing provenance. This is the key to ensuring that every piece of your collection is preserved exactly as it was originally produced, safeguarding its evidentiary or cultural value for years to come.
Grouping & Organizing Your Data
Before diving into cryptographic operations, it’s vital to group and organize your data in a thoughtful, deliberate way. Performance considerations aside, the idea is to preserve the semantic relationships between files.
Grouping data also minimizes the number of cryptographic operations needed. Instead of computing and verifying hundreds or thousands of individual hashes, you can authenticate entire clusters of related files with a single hash and signature. This is particularly useful when dealing with large datasets that can be gigabytes of information.
In addition to this reduction in computational overhead, it also simplifies future verification: With clearly defined data clusters, one hash can authenticate an entire group, speeding up the verification process and making it easier to pinpoint any discrepancies
To determine these groupings, consider the inherent value and context of your files:
Individual Units: For important singular files, such as field photographs or witness testimony recordings, it’s often best to treat them as standalone units. Each file’s individual authenticity can then be verified directly, ensuring high evidentiary value.
Semantic Clusters: Other data only gains meaning when viewed alongside its peers. For example, code dependencies, build files, or a directory of thumbnails only tell their full story when examined as a group. Grouping these files preserves their semantic proximity, ensuring that their collective integrity is maintained.
Choosing Your Approach: Individual Hashes vs. Merkle Trees
For smaller collections or when you need to verify individual files independently, traditional hashing tools like sha256sum work well – simply generate one hash per file and sign each one.
However, for larger datasets or collections with natural groupings, Merkle trees (using tools like merkdir) offer significant advantages: they create a single "root hash" that can verify an entire directory structure, dramatically reducing the number of signatures you need to manage while still allowing you to pinpoint exactly which files have changed.
Choose individual hashes when simplicity matters and file counts are manageable; choose Merkle trees when you're dealing with hundreds of files, complex folder structures, or want to authenticate entire project directories as unified collections.
In Practice: How It Might Work
Before transferring your collection, you will:
✅ Group & organize files meaningfully (e.g., individual vs. grouped datasets).
✅ Hash files using tools like sha256sum or merkdir (Merkle trees).
✅ Sign hashes using lightweight cryptographic tools (minisign, PGP, or x509 certs).
✅ Verify the integrity after transport—confirming that your hash from t0 (before transfer) matches the hash at t1 (post-transfer).
This simple but powerful process keeps history intact. It ensures photographs, recordings, or other digital assets retain their evidentiary value, even years from now.
To get started 🚀, check out this GitHub Gist from our software engineer Cole Capilongo, outlines the use of the above hashing and signing tools. We’d also be happy work hand-in-hand with you to support your project. Just reach out to <info@starlinglab.org>
As we work together to safeguard truth in the digital age, your data’s integrity matters. Let’s make sure it’s preserved—securely and verifiably.
Limitations: What This Guide Doesn’t Cover
While this guide lays out the essentials for ensuring data integrity and establishing provenance through hashing and signing, it doesn’t cover every aspect of digital preservation. In particular, it stops short of detailing advanced verification methods and external validation systems. For instance, the complexities of Public Key Infrastructure (PKI)—with its certificate management and intricate trust hierarchies—are beyond the scope of this primer. Similarly, emerging standards like C2PA attestations that provide enhanced content provenance by linking digital assets to verified identities are not discussed in depth here. Moreover, while third-party record holders, such as blockchain services or other immutable ledger systems, can offer an additional layer of public verification, this guide doesn’t elaborate on how to integrate these external record-keeping systems into your preservation workflow. Consider this a focused introduction to the core practices of data integrity, with many advanced topics available for further exploration.

