Executive Summary
This document presents the architectural validation and deployment results of an enterprise RAG system (Retrieval-Augmented Generation) within an air-gapped perimeter of a Top-20 bank. The system enables intelligent search across 100,000+ internal documents without transmitting data to external clouds.
Key Business Outcomes:
- Privacy: 100% of data remains within the perimeter — regulatory compliance achieved
- Accuracy: 92% Faithfulness (RAGAS) — hallucination minimization is critical for compliance
- Performance: TTFT < 0.4 sec (instant UI response), E2E 3–5 sec for detailed legal queries on proprietary GPU infrastructure (NVIDIA A100)
- Security: Active Directory integration — users only see documents they're authorized to access
Market Context: Gartner forecasts that by 2026, 80% of enterprise organizations will migrate from public LLM APIs to self-hosted solutions due to regulatory requirements.
1. Problem Statement: Public LLM Risks
1.1 Initial State
The bank utilized full-text search (Elasticsearch) across its corporate knowledge base. When querying "What is the transfer limit for corporate accounts?", the system returned 200+ document links — employees spent up to 30 minutes locating the correct regulation.
1.2 Why OpenAI/Anthropic Cannot Be Used
Table 1. Regulatory Constraints for Financial Sector
| Requirement | OpenAI API | Self-Hosted LLM |
|---|---|---|
| GDPR / Data Residency | Data transmitted to US | Within perimeter |
| Financial Regulations | No storage control | Full control |
| Banking Secrecy Laws | Leak risk | Air-gapped |
| Audit and Traceability | Black box | Complete logging |
| SLA and Availability | Vendor-dependent | Own infrastructure |
1.3 Limitations of Pure Vector Search
Table 2. Dense Retrieval Constraints
| Scenario | Vector Search | Problem |
|---|---|---|
| "Account limit 40817..." | Low recall | Account numbers lack semantics |
| "ISO 20022" | Inaccurate | Abbreviations embed poorly |
| "Appendix 3 to Regulation #42" | Misses | Exact references require keyword match |
Conclusion: Banking documents require Hybrid Search (BM25 + Vectors).
2. Architectural Decisions
2.1 Self-Hosted Inference Stack
Fig. 1. Secure RAG Architecture. The red dashed boundary denotes the Air-Gapped perimeter. All ML components are deployed on the bank's proprietary servers.
2.2 Technology Stack Rationale
2.2.1 LLM Selection: Llama 3 vs Alternatives
Table 3. Self-Hosted LLM Comparison
| Model | Parameters | License | Multilingual | VRAM (FP16) |
|---|---|---|---|---|
| Llama 3 70B | 70B | Meta License | Good | 140 GB |
| Mistral 7B | 7B | Apache 2.0 | Moderate | 14 GB |
| Qwen 72B | 72B | Qwen License | Excellent | 144 GB |
| GPT-J | 6B | Apache 2.0 | Limited | 12 GB |
LLM Selection
Llama 3 70B + AWQ
Balance of quality and GPU requirements. 4-bit quantization (AWQ) fits the 70B model on 4x A100.
Mistral 7B
Insufficient quality for legal texts.
2.2.2 Vector DB Selection: Qdrant vs Alternatives
Table 4. Vector Database Comparison
| Criterion | Pinecone | Weaviate | Qdrant |
|---|---|---|---|
| Self-Hosted | SaaS only | Yes | Yes |
| Pre-Filtering (ACL) | Limited | Yes | Native |
| Hybrid Search | No | Yes | BM25 + Dense |
| Production Ready | Yes | Partial | Yes |
| License | Proprietary | BSD-3 | Apache 2.0 |
Vector DB Selection
Qdrant
Native pre-filtering support by metadata — critical for ACL. Filtering occurs at the index level, not post-processing.
Pinecone
SaaS only — data leaves the bank's perimeter.
2.2.3 Inference Engine: vLLM vs Alternatives
Table 5. Inference Server Comparison
| Engine | Throughput (tok/s) | Paged Attention | Continuous Batching |
|---|---|---|---|
| HuggingFace TGI | ~30 | Yes | Yes |
| vLLM | ~50 | PagedAttention | Yes |
| NVIDIA Triton | ~45 | Manual | Yes |
| llama.cpp | ~20 | No | No |
Inference Engine Selection
vLLM
Best throughput (~50 tok/s) via PagedAttention. OpenAI-compatible API simplifies integration.
HuggingFace TGI
Lower throughput (~30 tok/s).
3. Reliability and Accuracy Mechanisms
3.1 Hybrid Search Pipeline
Fig. 2. Hybrid Search Pipeline. ACL filter is applied at the retrieval stage, not post-filtering — critical for security and performance.
Key Mechanism: ACL filter is applied at the index level (pre-filtering), not after obtaining results. This guarantees that users physically cannot retrieve documents without appropriate access.
async def secure_hybrid_search(query: str, user_groups: list[str], top_k: int = 20):
# ACL Filter at index level
acl_filter = models.Filter(
must=[models.FieldCondition(
key="allowed_groups",
match=models.MatchAny(any=user_groups)
)]
)
# Dense + Sparse search with pre-filtering
dense_results = await qdrant_client.search(
collection_name="bank_docs",
query_vector=await embed_query(query),
query_filter=acl_filter,
limit=top_k
)
# RRF + Reranking
merged = reciprocal_rank_fusion(dense_results, sparse_results, k=60)
return await reranker.rerank(query, merged[:20])[:3]3.2 Reranking for Accuracy Enhancement
Bi-Encoder Problem: Embedding models (E5, BGE) are fast but less accurate — query and document are embedded independently.
Cross-Encoder Solution: BGE-Reranker processes the (query, document) pair together, yielding better relevance understanding.
Table 6. Reranking Impact on Metrics
| Metric | Without Reranking | With BGE-Reranker | Improvement |
|---|---|---|---|
| MRR@10 | 0.61 | 0.89 | +46% |
| NDCG@10 | 0.58 | 0.85 | +47% |
| Recall@3 | 0.72 | 0.94 | +31% |
3.3 Structured Citations (Hallucination Prevention)
To minimize hallucinations, a strict system prompt with rules is used:
- Answers only based on provided context
- Mandatory source citation in format
[Document: name, section X.X] - Prohibition on inferring information
Output is structured via Pydantic models: response, list of citations with document/section/page, and confidence score based on semantic similarity.
4. Results and Metrics
4.1 Comparative Analysis
Table 7. Key Metrics: OpenAI Wrapper vs Secure RAG
| Metric | OpenAI API (Baseline) | Secure RAG (On-Premise) |
|---|---|---|
| Data Privacy | Data in US | 100% within bank |
| Latency (p50) | 1.2s | 1.8s |
| Latency (p99) | 3.5s | 2.5s (more stable) |
| Faithfulness (RAGAS) | 78% | 92% (+Reranking) |
| Answer Relevancy | 81% | 88% |
| ACL Compliance | Ignored | Native |
| Cost per Query | $0.03 | $0 (CapEx) |
| Availability SLA | 99.9% (vendor) | 99.95% (own infra) |
4.2 Economic Impact
Table 8. TCO Analysis (3 years)
| Item | OpenAI API | Self-Hosted |
|---|---|---|
| API Costs (1M queries/month) | $1,080,000 | $0 |
| Infrastructure (4x A100) | $0 | $400,000 (Client-side Hardware Investment) |
| Engineering (setup + support) | $50,000 | $200,000 |
| Compliance Risk | BLOCKING / HIGH ¹ | ZERO |
| Total 3Y TCO | $1,130,000 | $600,000 |
¹ Compliance Risk = BLOCKING: Risk of regulatory sanctions, fines up to 6% of revenue under GDPR. TCO calculation is based on GPT-4 class models (High Intelligence). Using simplified models (gpt-4o-mini and analogues) in FinTech is unacceptable due to critical hallucination risk when working with legal and financial data.
4.3 Business Results
- Search Time: from 30 minutes to 15 seconds (120x faster)
- L2 Support: ticket processing time reduced by 40%
- Compliance: successfully passed security audit and Penetration Test
- Adoption: 3,500 active users within 3 months
5. Infrastructure and Scaling
5.1 Hardware Requirements
Table 9. Server Specifications
| Component | Configuration | Purpose |
|---|---|---|
| LLM Server | 4x NVIDIA A100 80GB, 512GB RAM | vLLM inference |
| Embedding Server | 1x NVIDIA A10, 64GB RAM | E5-large-v2 |
| Reranker | 1x NVIDIA A10, 64GB RAM | BGE-Reranker |
| Qdrant Cluster | 3 nodes, 128GB RAM, NVMe SSD | Vector storage |
| Elasticsearch | 3 nodes, 64GB RAM, SSD | BM25 index |
5.2 Kubernetes Deployment
For high availability in the banking perimeter, vLLM is deployed with a minimum of 2 replicas (SPOF elimination). Horizontal scaling is configured on vllm_requests_running metric with a threshold of 10 concurrent requests.
6. Conclusions and Recommendations
This deployment confirmed that Self-Hosted RAG based on Llama 3, Qdrant, and vLLM is the only path to enterprise-grade AI search in regulated industries.
Key Takeaways:
- Self-Hosted LLM is mandatory for GDPR compliance and financial regulations
- Hybrid Search (BM25 + Vectors) is critical for accurate search across structured documents
- Reranking improves accuracy by 30-50% — a mandatory component
- ACL Pre-Filtering must be at retrieval level, not post-processing
- vLLM with PagedAttention provides production-grade throughput
Recommendation: This architecture is applicable to any organization with data residency requirements: government sector, healthcare, defense, telecom.