Methodology & Sources
How we select, structure, and validate the regulatory knowledge that powers Anvil Index RAG (Retrieval-Augmented Generation) Packs.
Sourcing Criteria
Every document included in an Anvil Index pack meets a rigorous set of sourcing standards. This ensures that the chunks you ingest into your vector database are authoritative, current, and legally sound.
Government and Institutional Sources Only
We select documents from federal agencies, standards bodies, and international regulatory organizations. This includes the FDA, NIST, SEC, EU Parliament, EMA, OWASP, GAO, and peer bodies. We do not include blog posts, vendor commentary, or secondary analysis.
Publicly Available
All source documents must be freely accessible without paywalls or subscriptions. Regulatory guidance must be published by the issuing authority and available for unrestricted download or web access.
Authoritative and Official
Sources must be official guidance, binding regulations, established frameworks, or published recommendations from the issuing body. We prioritize binding regulatory text over non-binding guidance, though both are included when relevant to the domain.
Primary Sources Over Secondary
We source from original documents: the actual regulation, framework, or guidance issued by the regulatory body. We do not create packs from interpretations, summaries, or analyses of those documents.
Licensing Verification
Every source document is licensed for distribution. This includes public domain documents, Creative Commons licensed materials with appropriate attribution, and materials explicitly published under open distribution by the issuing authority. The OWASP Top 10 for LLM Applications is dual-licensed under CC BY-SA 4.0, for example, and is included with full attribution metadata.
Chunking Methodology
The integrity of a RAG pack depends on how source documents are segmented into chunks. Our approach balances atomic semantic completeness with practical token constraints for embedding and retrieval.
Atomic Information Principle
Each chunk contains one self-contained concept, requirement, or rule. A chunk on data minimization, for example, stands alone and can be understood without reading surrounding sections. This maximizes the chunk's utility in vector search and reduces noise when the chunk is retrieved alongside other results.
Token Ceiling and Segmentation
Chunks are capped at approximately 1,500 tokens to ensure compatibility with common embedding models (OpenAI, Claude, Cohere, Ollama) and to keep context windows manageable in retrieval workflows. In practice, the average chunk across all packs is 400 to 500 tokens, providing ample detail while maintaining precision.
Natural Document Boundaries
Chunks follow the structure of the source document: sections, articles, subsections, and individual requirements are natural breaking points. This preserves the document's logic and makes it easy to trace a chunk back to its source.
Self-Sufficient Context
Each chunk preserves enough surrounding context to stand alone. References to related sections are included as explicit cross-references in the metadata. A chunk on AI auditing requirements will include enough detail about the entity being audited and the audit scope, even if the source document defines those terms elsewhere.
Dual Format Delivery
Every chunk is delivered in two formats. Markdown (.md) is human-readable, suitable for review and documentation. JSON (.json) includes full metadata and is optimized for programmatic ingestion into vector databases and search systems.
Schema Walkthrough
Each chunk carries 17+ metadata fields that enable hybrid search, faceted filtering, and full traceability to the source. This structured metadata is what makes RAG packs more powerful than raw documents.
chunk_id
Unique identifier following pack-specific conventions. Example: EU-AI-ACT-001, OWASP-LLM01-DEF. IDs are human-readable and sortable.
semantic_title
A concise, descriptive title for the chunk written for human comprehension. Suitable for display in search results or documentation.
content
The full text of the chunk in Markdown format. Includes inline formatting, lists, and structure while remaining plaintext for embedding.
domain
Controlled vocabulary field: AI_Governance, AI_Security, AI_Privacy, AI_FinServ, or AI_Healthcare. Enables domain-specific filtering and faceted search.
source_document
The name of the originating document, exactly as published. Example: "EU AI Act (Regulation EU 2024/1689)".
source_authority
The issuing body. Example: NIST, EU Parliament, FDA, OWASP, SEC. Enables filtering by regulatory jurisdiction.
section_reference
Original section, article, subsection, or requirement number from the source. Used to locate the chunk in the original document.
cross_references
Array of chunk IDs from other packs or documents that address related topics. Enables semantic linkage across the pack ecosystem.
pack_version
Semantic version of the pack (e.g., "1.2.0"). Allows customers to track pack age and request updates if needed.
entity_type
Security pack specific. Classifies whether the chunk addresses risks to models, data, systems, or operators.
affected_asset
Security pack specific. Identifies the asset class (LLM, vectordb, embedding model, application) addressed by the requirement.
tactical_phase
Security pack specific. Maps to MITRE ATLAS phases: reconnaissance, resource development, initial access, execution, persistence, and others.
obligation_type
Privacy pack specific. Classifies the requirement as data subject right, controller obligation, processor obligation, or cross-border restriction.
jurisdiction
Privacy pack specific. Indicates the applicable legal jurisdiction: EU, US Federal, US State (with state code), or international.
data_category
Privacy pack specific. Specifies the type of personal data addressed: general, special category, biometric, or sensitive.
regulatory_domain
FinServ pack specific. Indicates the financial sector: banking, capital markets, insurance, payments, or cross-sector.
financial_sector
FinServ pack specific. Specifies the applicable sector: retail, institutional, infrastructure, or market participants.
Hybrid Search Capability
The schema enables two simultaneous search strategies. Semantic search finds chunks by meaning and similarity in the vector space. Metadata filtering narrows results by domain, source authority, jurisdiction, and other structured fields. Together, they allow you to ask questions like: "Find all security requirements from NIST that apply to LLMs" or "List all EU privacy obligations for customer data in FinServ."
Controlled vocabularies ensure consistent filtering across all chunks. All sources, authorities, entities, and obligations use the same standardized terminology, making faceted search reliable and predictable.
Source Index
All 44 source documents organized by pack and domain. Each entry includes the document name, issuing authority, and chunk count.
Pack 01: AI Governance
339 chunks / 8 sources-
158 chunksEU AI Act (Regulation EU 2024/1689)
-
16 chunksNIST AI RMF 1.0 (AI 100-1)
-
66 chunksNIST AI RMF Playbook
-
6 chunksGPAI Code of Practice: Transparency
-
34 chunksNIST AI 600-1 GenAI Profile
-
35 chunksGPAI Code of Practice: Safety & Security
-
8 chunksGPAI Code of Practice: Copyright
-
16 chunksOECD Due Diligence Guidance for Responsible AI (2026)
Pack 02: AI Security
190 chunks / 5 sources-
30 chunksOWASP Top 10 for LLM Applications 2025 (v2.0)
-
132 chunksMITRE ATLAS STIX Bundle
-
13 chunksCISA/NSA Joint AI Data Security Guidelines
-
6 chunksNIST IR 8596 CSF AI Profile
-
9 chunksNIST AI 100-2e2025 AML Taxonomy
Pack 03: AI Privacy
111 chunks / 6 sources-
21 chunksColorado SB 24-205 (Consumer Protections for AI)
-
29 chunksTexas HB 149 / TRAIGA
-
16 chunksCalifornia SB 53 / TFAIA
-
17 chunksEDPB-EDPS Joint Opinion 1/2026 (AI Omnibus)
-
15 chunksEU AI Act Articles 10, 26, 27 (Privacy Cross-Reference)
-
13 chunksCISA 2025 SBOM Minimum Elements
Pack 04: Financial Services
206 chunks / 10 sources-
40 chunksCRI FS AI RMF Guidebook v1.0
-
29 chunksFINRA 2026 Annual Regulatory Oversight Report
-
28 chunksGAO-25-107197 AI Use & Oversight in FinServ
-
26 chunksIOSCO AI in Capital Markets CR/01/2025
-
24 chunksTreasury AI in Financial Services (Dec 2024)
-
17 chunksSR 11-7 Model Risk Management
-
15 chunksSEC FY2026 Examination Priorities
-
10 chunksCFPB Supervisory Highlights: Advanced Tech Issue 38
-
10 chunksNYDFS Industry Letter: AI Cyber Risks
-
7 chunksFinCEN Alert: Deepfake Fraud (FIN-2024-Alert004)
Pack 05: Healthcare AI
190 chunks / 13 sources-
28 chunksFDA AI-Enabled Device Software Functions
-
21 chunksGAO Healthcare AI Review
-
18 chunksFDA PCCP Guidance
-
18 chunksONC HTI-1 Final Rule
-
17 chunksEMA AI Reflection Paper
-
15 chunksAHRQ AI Patient Safety & CDS
-
13 chunksHHS AI Strategy
-
12 chunksFDA ML Transparency
-
11 chunksGMLP Principles
-
11 chunksVA Trustworthy AI Framework
-
10 chunksCMS AI Guidance
-
9 chunksFDA SaMD Action Plan
-
7 chunksHHS OCR AI Nondiscrimination
Try Before You Buy
Inspect 20 sample chunks from across all five packs. Review the schema, chunk quality, and metadata structure to confirm Anvil Index packs meet your needs.
Explore Sample Dataset on Hugging FaceVersioning & Updates
Anvil Index packs are living products. We monitor source documents monthly and release updates as regulatory guidance evolves.
Semantic Versioning
Each pack uses semantic versioning (MAJOR.MINOR.PATCH). Major versions indicate significant structural changes or substantial new source additions. Minor versions add new chunks or sources. Patch versions correct errors or improve metadata without adding new content.
Change Documentation
Every pack includes a CHANGELOG documenting all changes between versions. You can see exactly what changed, which chunks were added or modified, and when.
Monthly Monitoring
We review all source documents monthly for updates, amendments, and new releases. When a source document is updated, the affected chunks are regenerated with current content and refreshed metadata.
New Sources and Cross-References
New source documents are evaluated against our sourcing criteria. When a new document meets the threshold for inclusion, it is added to the appropriate pack. Cross-references are maintained across packs, so updates in one pack may trigger crosswalk updates in others.
Update Notifications
Customers receive update notifications for packs they have purchased. You can review the changelog and decide whether to upgrade.