Dataset

Every tagged release of sohamhamso publishes a dataset bundle to Zenodo. It contains the same verses, glosses, and translations the site renders — packaged for analysis, archival, and downstream reuse.

What's in it

CSV — flat tables: texts.csv, verses.csv, word_glosses.csv, translations.csv. Easiest to load into pandas, DuckDB, or a spreadsheet.
JSON shards — one file per text, with nested verse/word/translation structure preserved.
TEI-XML — one document per text, structured to interoperate with SARIT and other Indological TEI corpora.
checksums.sha256 — SHA-256 digests of every file in the bundle, signed alongside the release.

Loading the data

The CSV bundle loads directly with pandas:

import pandas as pd

texts        = pd.read_csv("texts.csv")
verses       = pd.read_csv("verses.csv")
translations = pd.read_csv("translations.csv")

ss = verses[verses.text_id == texts.loc[texts.slug == "shiva-sutras", "id"].iloc[0]]
print(ss.head())

Versioning

Release tags follow vYYYY.MM.DD. Schema changes are additive within a year; any breaking change is announced in the release notes and the changelog. The Translation Status Contract documents the stability guarantees for the badge-relevant provenance fields (ai_assisted, status, model, model_version, prompt_version, judge_score).

Integrity

Every bundle ships with checksums.sha256. Verify before use:

sha256sum --check checksums.sha256

Citation

The Zenodo deposit issues a versioned DOI per release and a concept DOI that always resolves to the latest. Until the first release lands, the DOI is a placeholder.

@dataset{sohamhamso_vYYYY_MM_DD,
  author    = {sohamhamso contributors},
  title     = {sohamhamso: Tantric Sanskrit canon dataset},
  year      = {YYYY},
  version   = {vYYYY.MM.DD},
  doi       = {10.5281/zenodo.PLACEHOLDER},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.PLACEHOLDER},
  license   = {CC-BY-SA-4.0}
}

Additional citation formats — Chicago, MLA, plain — live on /cite and are regenerated on each release.

License

The dataset is released under CC-BY-SA 4.0: free to share and adapt, including commercially, provided you credit the source and license derivatives under the same terms. Where an upstream source carries stricter terms (e.g., Muktabodha pending-permission), the most-restrictive applicable license governs that file. Full details on the License page; per-source attribution on Sources.

Where to get it

Zenodo deposit (concept DOI): TBD on first release
GitHub release tags: github.com/sohamhamso/sohamhamso/releases

Last revised: 2026-05-31.