Methodology & Sources

How we select, structure, and validate the regulatory knowledge that powers Anvil Index RAG (Retrieval-Augmented Generation) Packs.

Sourcing Criteria

Every document included in an Anvil Index pack meets a rigorous set of sourcing standards. This ensures that the chunks you ingest into your vector database are authoritative, current, and legally sound.

Government and Institutional Sources Only

We select documents from federal agencies, standards bodies, and international regulatory organizations. This includes the FDA, NIST, SEC, EU Parliament, EMA, OWASP, GAO, and peer bodies. We do not include blog posts, vendor commentary, or secondary analysis.

Publicly Available

All source documents must be freely accessible without paywalls or subscriptions. Regulatory guidance must be published by the issuing authority and available for unrestricted download or web access.

Authoritative and Official

Sources must be official guidance, binding regulations, established frameworks, or published recommendations from the issuing body. We prioritize binding regulatory text over non-binding guidance, though both are included when relevant to the domain.

Primary Sources Over Secondary

We source from original documents: the actual regulation, framework, or guidance issued by the regulatory body. We do not create packs from interpretations, summaries, or analyses of those documents.

Licensing Verification

Every source document is licensed for distribution. This includes public domain documents, Creative Commons licensed materials with appropriate attribution, and materials explicitly published under open distribution by the issuing authority. The OWASP Top 10 for LLM Applications is dual-licensed under CC BY-SA 4.0, for example, and is included with full attribution metadata.

Chunking Methodology

The integrity of a RAG pack depends on how source documents are segmented into chunks. Our approach balances atomic semantic completeness with practical token constraints for embedding and retrieval.

Atomic Information Principle

Each chunk contains one self-contained concept, requirement, or rule. A chunk on data minimization, for example, stands alone and can be understood without reading surrounding sections. This maximizes the chunk's utility in vector search and reduces noise when the chunk is retrieved alongside other results.

Token Ceiling and Segmentation

Chunks are capped at approximately 1,500 tokens to ensure compatibility with common embedding models (OpenAI, Claude, Cohere, Ollama) and to keep context windows manageable in retrieval workflows. In practice, the average chunk across all packs is 400 to 500 tokens, providing ample detail while maintaining precision.

Natural Document Boundaries

Chunks follow the structure of the source document: sections, articles, subsections, and individual requirements are natural breaking points. This preserves the document's logic and makes it easy to trace a chunk back to its source.

Self-Sufficient Context

Each chunk preserves enough surrounding context to stand alone. References to related sections are included as explicit cross-references in the metadata. A chunk on AI auditing requirements will include enough detail about the entity being audited and the audit scope, even if the source document defines those terms elsewhere.

Dual Format Delivery

Every chunk is delivered in two formats. Markdown (.md) is human-readable, suitable for review and documentation. JSON (.json) includes full metadata and is optimized for programmatic ingestion into vector databases and search systems.

Schema Walkthrough

Each chunk carries 17+ metadata fields that enable hybrid search, faceted filtering, and full traceability to the source. This structured metadata is what makes RAG packs more powerful than raw documents.

chunk_id

Unique identifier following pack-specific conventions. Example: EU-AI-ACT-001, OWASP-LLM01-DEF. IDs are human-readable and sortable.

semantic_title

A concise, descriptive title for the chunk written for human comprehension. Suitable for display in search results or documentation.

content

The full text of the chunk in Markdown format. Includes inline formatting, lists, and structure while remaining plaintext for embedding.

domain

Controlled vocabulary field: AI_Governance, AI_Security, AI_Privacy, AI_FinServ, or AI_Healthcare. Enables domain-specific filtering and faceted search.

source_document

The name of the originating document, exactly as published. Example: "EU AI Act (Regulation EU 2024/1689)".

source_authority

The issuing body. Example: NIST, EU Parliament, FDA, OWASP, SEC. Enables filtering by regulatory jurisdiction.

section_reference

Original section, article, subsection, or requirement number from the source. Used to locate the chunk in the original document.

cross_references

Array of chunk IDs from other packs or documents that address related topics. Enables semantic linkage across the pack ecosystem.

pack_version

Semantic version of the pack (e.g., "1.2.0"). Allows customers to track pack age and request updates if needed.

entity_type

Security pack specific. Classifies whether the chunk addresses risks to models, data, systems, or operators.

affected_asset

Security pack specific. Identifies the asset class (LLM, vectordb, embedding model, application) addressed by the requirement.

tactical_phase

Security pack specific. Maps to MITRE ATLAS phases: reconnaissance, resource development, initial access, execution, persistence, and others.

obligation_type

Privacy pack specific. Classifies the requirement as data subject right, controller obligation, processor obligation, or cross-border restriction.

jurisdiction

Privacy pack specific. Indicates the applicable legal jurisdiction: EU, US Federal, US State (with state code), or international.

data_category

Privacy pack specific. Specifies the type of personal data addressed: general, special category, biometric, or sensitive.

regulatory_domain

FinServ pack specific. Indicates the financial sector: banking, capital markets, insurance, payments, or cross-sector.

financial_sector

FinServ pack specific. Specifies the applicable sector: retail, institutional, infrastructure, or market participants.

Hybrid Search Capability

The schema enables two simultaneous search strategies. Semantic search finds chunks by meaning and similarity in the vector space. Metadata filtering narrows results by domain, source authority, jurisdiction, and other structured fields. Together, they allow you to ask questions like: "Find all security requirements from NIST that apply to LLMs" or "List all EU privacy obligations for customer data in FinServ."

Controlled vocabularies ensure consistent filtering across all chunks. All sources, authorities, entities, and obligations use the same standardized terminology, making faceted search reliable and predictable.

Source Index

All 44 source documents organized by pack and domain. Each entry includes the document name, issuing authority, and chunk count.

Pack 01: AI Governance

339 chunks / 8 sources
  • EU AI Act (Regulation EU 2024/1689)
    European Parliament & Council
    158 chunks
  • NIST AI RMF 1.0 (AI 100-1)
    NIST
    16 chunks
  • NIST AI RMF Playbook
    NIST
    66 chunks
  • GPAI Code of Practice: Transparency
    EU AI Office
    6 chunks
  • NIST AI 600-1 GenAI Profile
    NIST
    34 chunks
  • GPAI Code of Practice: Safety & Security
    EU AI Office
    35 chunks
  • GPAI Code of Practice: Copyright
    EU AI Office
    8 chunks
  • OECD Due Diligence Guidance for Responsible AI (2026)
    OECD
    16 chunks

Pack 02: AI Security

190 chunks / 5 sources
  • OWASP Top 10 for LLM Applications 2025 (v2.0)
    OWASP Foundation
    30 chunks
  • MITRE ATLAS STIX Bundle
    MITRE Corporation
    132 chunks
  • CISA/NSA Joint AI Data Security Guidelines
    CISA/NSA
    13 chunks
  • NIST IR 8596 CSF AI Profile
    NIST
    6 chunks
  • NIST AI 100-2e2025 AML Taxonomy
    NIST
    9 chunks

Pack 03: AI Privacy

111 chunks / 6 sources
  • Colorado SB 24-205 (Consumer Protections for AI)
    Colorado Legislature
    21 chunks
  • Texas HB 149 / TRAIGA
    Texas Legislature
    29 chunks
  • California SB 53 / TFAIA
    California Legislature
    16 chunks
  • EDPB-EDPS Joint Opinion 1/2026 (AI Omnibus)
    EDPB/EDPS
    17 chunks
  • EU AI Act Articles 10, 26, 27 (Privacy Cross-Reference)
    European Parliament & Council
    15 chunks
  • CISA 2025 SBOM Minimum Elements
    CISA
    13 chunks

Pack 04: Financial Services

206 chunks / 10 sources
  • CRI FS AI RMF Guidebook v1.0
    Cyber Risk Institute / Treasury
    40 chunks
  • FINRA 2026 Annual Regulatory Oversight Report
    FINRA
    29 chunks
  • GAO-25-107197 AI Use & Oversight in FinServ
    GAO
    28 chunks
  • IOSCO AI in Capital Markets CR/01/2025
    IOSCO
    26 chunks
  • Treasury AI in Financial Services (Dec 2024)
    U.S. Treasury
    24 chunks
  • SR 11-7 Model Risk Management
    Federal Reserve/OCC
    17 chunks
  • SEC FY2026 Examination Priorities
    SEC
    15 chunks
  • CFPB Supervisory Highlights: Advanced Tech Issue 38
    CFPB
    10 chunks
  • NYDFS Industry Letter: AI Cyber Risks
    NYDFS
    10 chunks
  • FinCEN Alert: Deepfake Fraud (FIN-2024-Alert004)
    FinCEN
    7 chunks

Pack 05: Healthcare AI

190 chunks / 13 sources
  • FDA AI-Enabled Device Software Functions
    FDA
    28 chunks
  • GAO Healthcare AI Review
    GAO
    21 chunks
  • FDA PCCP Guidance
    FDA
    18 chunks
  • ONC HTI-1 Final Rule
    ONC
    18 chunks
  • EMA AI Reflection Paper
    EMA
    17 chunks
  • AHRQ AI Patient Safety & CDS
    AHRQ
    15 chunks
  • HHS AI Strategy
    HHS
    13 chunks
  • FDA ML Transparency
    FDA/Health Canada/MHRA
    12 chunks
  • GMLP Principles
    FDA/Health Canada/MHRA
    11 chunks
  • VA Trustworthy AI Framework
    VA
    11 chunks
  • CMS AI Guidance
    CMS
    10 chunks
  • FDA SaMD Action Plan
    FDA
    9 chunks
  • HHS OCR AI Nondiscrimination
    HHS OCR
    7 chunks

Try Before You Buy

Inspect 20 sample chunks from across all five packs. Review the schema, chunk quality, and metadata structure to confirm Anvil Index packs meet your needs.

Explore Sample Dataset on Hugging Face

Versioning & Updates

Anvil Index packs are living products. We monitor source documents monthly and release updates as regulatory guidance evolves.

Semantic Versioning

Each pack uses semantic versioning (MAJOR.MINOR.PATCH). Major versions indicate significant structural changes or substantial new source additions. Minor versions add new chunks or sources. Patch versions correct errors or improve metadata without adding new content.

Change Documentation

Every pack includes a CHANGELOG documenting all changes between versions. You can see exactly what changed, which chunks were added or modified, and when.

Monthly Monitoring

We review all source documents monthly for updates, amendments, and new releases. When a source document is updated, the affected chunks are regenerated with current content and refreshed metadata.

New Sources and Cross-References

New source documents are evaluated against our sourcing criteria. When a new document meets the threshold for inclusion, it is added to the appropriate pack. Cross-references are maintained across packs, so updates in one pack may trigger crosswalk updates in others.

Update Notifications

Customers receive update notifications for packs they have purchased. You can review the changelog and decide whether to upgrade.