Methodology & Sources

How we select, structure, and validate the regulatory knowledge that powers Anvil Index RAG (Retrieval-Augmented Generation) Packs.

Sourcing Criteria

Every document included in an Anvil Index pack meets a rigorous set of sourcing standards. This ensures that the chunks you ingest into your vector database are authoritative, current, and legally sound.

Government and Institutional Sources Only

We select documents from federal agencies, standards bodies, and international regulatory organizations. This includes the FDA, NIST, SEC, EU Parliament, EMA, OWASP, GAO, and peer bodies. We do not include blog posts, vendor commentary, or secondary analysis.

Publicly Available

All source documents must be freely accessible without paywalls or subscriptions. Regulatory guidance must be published by the issuing authority and available for unrestricted download or web access.

Authoritative and Official

Sources must be official guidance, binding regulations, established frameworks, or published recommendations from the issuing body. We prioritize binding regulatory text over non-binding guidance, though both are included when relevant to the domain.

Primary Sources Over Secondary

We source from original documents: the actual regulation, framework, or guidance issued by the regulatory body. We do not create packs from interpretations, summaries, or analyses of those documents.

Licensing Verification

Every source document is licensed for distribution. This includes public domain documents, Creative Commons licensed materials with appropriate attribution, and materials explicitly published under open distribution by the issuing authority. The OWASP Top 10 for LLM Applications is dual-licensed under CC BY-SA 4.0, for example, and is included with full attribution metadata.

Chunking Methodology

The integrity of a RAG pack depends on how source documents are segmented into chunks. Our approach balances atomic semantic completeness with practical token constraints for embedding and retrieval.

Atomic Information Principle

Each chunk contains one self-contained concept, requirement, or rule. A chunk on data minimization, for example, stands alone and can be understood without reading surrounding sections. This maximizes the chunk's utility in vector search and reduces noise when the chunk is retrieved alongside other results.

Token Ceiling and Segmentation

Chunks are capped at approximately 1,500 tokens to ensure compatibility with common embedding models (OpenAI, Claude, Cohere, Ollama) and to keep context windows manageable in retrieval workflows. In practice, the average chunk across all packs is 400 to 500 tokens, providing ample detail while maintaining precision.

Natural Document Boundaries

Chunks follow the structure of the source document: sections, articles, subsections, and individual requirements are natural breaking points. This preserves the document's logic and makes it easy to trace a chunk back to its source.

Self-Sufficient Context

Each chunk preserves enough surrounding context to stand alone. References to related sections are included as explicit cross-references in the metadata. A chunk on AI auditing requirements will include enough detail about the entity being audited and the audit scope, even if the source document defines those terms elsewhere.

Dual Format Delivery

Every chunk is delivered in two formats. Markdown (.md) is human-readable, suitable for review and documentation. JSON (.json) includes full metadata and is optimized for programmatic ingestion into vector databases and search systems.

Schema Walkthrough

Each chunk carries 17+ metadata fields that enable hybrid search, faceted filtering, and full traceability to the source. This structured metadata is what makes RAG packs more powerful than raw documents.

chunk_id

Unique identifier following pack-specific conventions. Example: EU-AI-ACT-001, OWASP-LLM01-DEF. IDs are human-readable and sortable.

semantic_title

A concise, descriptive title for the chunk written for human comprehension. Suitable for display in search results or documentation.

content

The full text of the chunk in Markdown format. Includes inline formatting, lists, and structure while remaining plaintext for embedding.

domain

Controlled vocabulary field: AI_Governance, AI_Security, AI_Privacy, AI_FinServ, or AI_Healthcare. Enables domain-specific filtering and faceted search.

source_document

The name of the originating document, exactly as published. Example: "EU AI Act (Regulation EU 2024/1689)".

source_authority

The issuing body. Example: NIST, EU Parliament, FDA, OWASP, SEC. Enables filtering by regulatory jurisdiction.

section_reference

Original section, article, subsection, or requirement number from the source. Used to locate the chunk in the original document.

cross_references

Array of chunk IDs from other packs or documents that address related topics. Enables semantic linkage across the pack ecosystem.

pack_version

Semantic version of the pack (e.g., "1.2.0"). Allows customers to track pack age and request updates if needed.

entity_type

Security pack specific. Classifies whether the chunk addresses risks to models, data, systems, or operators.

affected_asset

Security pack specific. Identifies the asset class (LLM, vectordb, embedding model, application) addressed by the requirement.

tactical_phase

Security pack specific. Maps to MITRE ATLAS phases: reconnaissance, resource development, initial access, execution, persistence, and others.

obligation_type

Privacy pack specific. Classifies the requirement as data subject right, controller obligation, processor obligation, or cross-border restriction.

jurisdiction

Privacy pack specific. Indicates the applicable legal jurisdiction: EU, US Federal, US State (with state code), or international.

data_category

Privacy pack specific. Specifies the type of personal data addressed: general, special category, biometric, or sensitive.

regulatory_domain

FinServ pack specific. Indicates the financial sector: banking, capital markets, insurance, payments, or cross-sector.

financial_sector

FinServ pack specific. Specifies the applicable sector: retail, institutional, infrastructure, or market participants.

Hybrid Search Capability

The schema enables two simultaneous search strategies. Semantic search finds chunks by meaning and similarity in the vector space. Metadata filtering narrows results by domain, source authority, jurisdiction, and other structured fields. Together, they allow you to ask questions like: "Find all security requirements from NIST that apply to LLMs" or "List all EU privacy obligations for customer data in FinServ."

Controlled vocabularies ensure consistent filtering across all chunks. All sources, authorities, entities, and obligations use the same standardized terminology, making faceted search reliable and predictable.

Source Index

All 44 source documents organized by pack and domain. Each entry includes the document name, issuing authority, and chunk count.

Pack 01: AI Governance

339 chunks / 8 sources

EU AI Act (Regulation EU 2024/1689)
European Parliament & Council
158 chunks
NIST AI RMF 1.0 (AI 100-1)
NIST
16 chunks
NIST AI RMF Playbook
NIST
66 chunks
GPAI Code of Practice: Transparency
EU AI Office
6 chunks
NIST AI 600-1 GenAI Profile
NIST
34 chunks
GPAI Code of Practice: Safety & Security
EU AI Office
35 chunks
GPAI Code of Practice: Copyright
EU AI Office
8 chunks
OECD Due Diligence Guidance for Responsible AI (2026)
OECD
16 chunks

Pack 02: AI Security

190 chunks / 5 sources

OWASP Top 10 for LLM Applications 2025 (v2.0)
OWASP Foundation
30 chunks
MITRE ATLAS STIX Bundle
MITRE Corporation
132 chunks
CISA/NSA Joint AI Data Security Guidelines
CISA/NSA
13 chunks
NIST IR 8596 CSF AI Profile
NIST
6 chunks
NIST AI 100-2e2025 AML Taxonomy
NIST
9 chunks

Pack 03: AI Privacy

111 chunks / 6 sources

Colorado SB 24-205 (Consumer Protections for AI)
Colorado Legislature
21 chunks
Texas HB 149 / TRAIGA
Texas Legislature
29 chunks
California SB 53 / TFAIA
California Legislature
16 chunks
EDPB-EDPS Joint Opinion 1/2026 (AI Omnibus)
EDPB/EDPS
17 chunks
EU AI Act Articles 10, 26, 27 (Privacy Cross-Reference)
European Parliament & Council
15 chunks
CISA 2025 SBOM Minimum Elements
CISA
13 chunks

Pack 04: Financial Services

206 chunks / 10 sources

CRI FS AI RMF Guidebook v1.0
Cyber Risk Institute / Treasury
40 chunks
FINRA 2026 Annual Regulatory Oversight Report
FINRA
29 chunks
GAO-25-107197 AI Use & Oversight in FinServ
GAO
28 chunks
IOSCO AI in Capital Markets CR/01/2025
IOSCO
26 chunks
Treasury AI in Financial Services (Dec 2024)
U.S. Treasury
24 chunks
SR 11-7 Model Risk Management
Federal Reserve/OCC
17 chunks
SEC FY2026 Examination Priorities
SEC
15 chunks
CFPB Supervisory Highlights: Advanced Tech Issue 38
CFPB
10 chunks
NYDFS Industry Letter: AI Cyber Risks
NYDFS
10 chunks
FinCEN Alert: Deepfake Fraud (FIN-2024-Alert004)
FinCEN
7 chunks

Pack 05: Healthcare AI

190 chunks / 13 sources

FDA AI-Enabled Device Software Functions
FDA
28 chunks
GAO Healthcare AI Review
GAO
21 chunks
FDA PCCP Guidance
FDA
18 chunks
ONC HTI-1 Final Rule
ONC
18 chunks
EMA AI Reflection Paper
EMA
17 chunks
AHRQ AI Patient Safety & CDS
AHRQ
15 chunks
HHS AI Strategy
HHS
13 chunks
FDA ML Transparency
FDA/Health Canada/MHRA
12 chunks
GMLP Principles
FDA/Health Canada/MHRA
11 chunks
VA Trustworthy AI Framework
VA
11 chunks
CMS AI Guidance
CMS
10 chunks
FDA SaMD Action Plan
FDA
9 chunks
HHS OCR AI Nondiscrimination
HHS OCR
7 chunks

Try Before You Buy

Inspect 20 sample chunks from across all five packs. Review the schema, chunk quality, and metadata structure to confirm Anvil Index packs meet your needs.

Explore Sample Dataset on Hugging Face

Versioning & Updates

Anvil Index packs are living products. We monitor source documents monthly and release updates as regulatory guidance evolves.

Semantic Versioning

Each pack uses semantic versioning (MAJOR.MINOR.PATCH). Major versions indicate significant structural changes or substantial new source additions. Minor versions add new chunks or sources. Patch versions correct errors or improve metadata without adding new content.

Change Documentation

Every pack includes a CHANGELOG documenting all changes between versions. You can see exactly what changed, which chunks were added or modified, and when.

Monthly Monitoring

We review all source documents monthly for updates, amendments, and new releases. When a source document is updated, the affected chunks are regenerated with current content and refreshed metadata.

New Sources and Cross-References

New source documents are evaluated against our sourcing criteria. When a new document meets the threshold for inclusion, it is added to the appropriate pack. Cross-references are maintained across packs, so updates in one pack may trigger crosswalk updates in others.

Update Notifications

Customers receive update notifications for packs they have purchased. You can review the changelog and decide whether to upgrade.