Secure On-Premise RAG: Intelligent Search for Enterprise FinTech (Air-Gapped)

Executive Summary

This document presents the architectural validation and deployment results of an enterprise RAG system (Retrieval-Augmented Generation) within an air-gapped perimeter of a Top-20 bank. The system enables intelligent search across 100,000+ internal documents without transmitting data to external clouds.

Key Business Outcomes:

Privacy: 100% of data remains within the perimeter — regulatory compliance achieved
Accuracy: 92% Faithfulness (RAGAS) — hallucination minimization is critical for compliance
Performance: TTFT < 0.4 sec (instant UI response), E2E 3–5 sec for detailed legal queries on proprietary GPU infrastructure (NVIDIA A100)
Security: Active Directory integration — users only see documents they're authorized to access

Market Context: Gartner forecasts that by 2026, 80% of enterprise organizations will migrate from public LLM APIs to self-hosted solutions due to regulatory requirements.

1. Problem Statement: Public LLM Risks

1.1 Initial State

The bank utilized full-text search (Elasticsearch) across its corporate knowledge base. When querying "What is the transfer limit for corporate accounts?", the system returned 200+ document links — employees spent up to 30 minutes locating the correct regulation.

1.2 Why OpenAI/Anthropic Cannot Be Used

Table 1. Regulatory Constraints for Financial Sector

Requirement	OpenAI API	Self-Hosted LLM
GDPR / Data Residency	Data transmitted to US	Within perimeter
Financial Regulations	No storage control	Full control
Banking Secrecy Laws	Leak risk	Air-gapped
Audit and Traceability	Black box	Complete logging
SLA and Availability	Vendor-dependent	Own infrastructure

1.3 Limitations of Pure Vector Search

Table 2. Dense Retrieval Constraints

Scenario	Vector Search	Problem
"Account limit 40817..."	Low recall	Account numbers lack semantics
"ISO 20022"	Inaccurate	Abbreviations embed poorly
"Appendix 3 to Regulation #42"	Misses	Exact references require keyword match

Conclusion: Banking documents require Hybrid Search (BM25 + Vectors).

2. Architectural Decisions

2.1 Self-Hosted Inference Stack

100%

Ctrl+Колесо или перетаскивание

Fig. 1. Secure RAG Architecture. The red dashed boundary denotes the Air-Gapped perimeter. All ML components are deployed on the bank's proprietary servers.

2.2 Technology Stack Rationale

2.2.1 LLM Selection: Llama 3 vs Alternatives

Table 3. Self-Hosted LLM Comparison

Model	Parameters	License	Multilingual	VRAM (FP16)
Llama 3 70B	70B	Meta License	Good	140 GB
Mistral 7B	7B	Apache 2.0	Moderate	14 GB
Qwen 72B	72B	Qwen License	Excellent	144 GB
GPT-J	6B	Apache 2.0	Limited	12 GB

LLM Selection

Архитектурное решение

Llama 3 70B + AWQ

Balance of quality and GPU requirements. 4-bit quantization (AWQ) fits the 70B model on 4x A100.

Strong multilingual capabilities

70B parameters for complex legal texts

AWQ quantization without quality loss

Отклонённый вариант

Mistral 7B

Insufficient quality for legal texts.

Critical errors with financial terminology

7B parameters insufficient for complex documents

2.2.2 Vector DB Selection: Qdrant vs Alternatives

Table 4. Vector Database Comparison

Criterion	Pinecone	Weaviate	Qdrant
Self-Hosted	SaaS only	Yes	Yes
Pre-Filtering (ACL)	Limited	Yes	Native
Hybrid Search	No	Yes	BM25 + Dense
Production Ready	Yes	Partial	Yes
License	Proprietary	BSD-3	Apache 2.0

Vector DB Selection

Архитектурное решение

Qdrant

Native pre-filtering support by metadata — critical for ACL. Filtering occurs at the index level, not post-processing.

Self-Hosted (Apache 2.0)

Native ACL pre-filtering

Hybrid Search (BM25 + Dense)

Отклонённый вариант

Pinecone

SaaS only — data leaves the bank's perimeter.

No self-hosted option

Limited pre-filtering

Proprietary license

2.2.3 Inference Engine: vLLM vs Alternatives

Table 5. Inference Server Comparison

Engine	Throughput (tok/s)	Paged Attention	Continuous Batching
HuggingFace TGI	~30	Yes	Yes
vLLM	~50	PagedAttention	Yes
NVIDIA Triton	~45	Manual	Yes
llama.cpp	~20	No	No

Inference Engine Selection

Архитектурное решение

vLLM

Best throughput (~50 tok/s) via PagedAttention. OpenAI-compatible API simplifies integration.

~50 tok/s throughput

PagedAttention for efficient memory usage

OpenAI-compatible API

Отклонённый вариант

HuggingFace TGI

Lower throughput (~30 tok/s).

~30 tok/s — 40% slower

Less efficient GPU memory utilization

3. Reliability and Accuracy Mechanisms

3.1 Hybrid Search Pipeline

100%

Ctrl+Колесо или перетаскивание

Fig. 2. Hybrid Search Pipeline. ACL filter is applied at the retrieval stage, not post-filtering — critical for security and performance.

Key Mechanism: ACL filter is applied at the index level (pre-filtering), not after obtaining results. This guarantees that users physically cannot retrieve documents without appropriate access.

async def secure_hybrid_search(query: str, user_groups: list[str], top_k: int = 20):
    # ACL Filter at index level
    acl_filter = models.Filter(
        must=[models.FieldCondition(
            key="allowed_groups",
            match=models.MatchAny(any=user_groups)
        )]
    )
    
    # Dense + Sparse search with pre-filtering
    dense_results = await qdrant_client.search(
        collection_name="bank_docs",
        query_vector=await embed_query(query),
        query_filter=acl_filter,
        limit=top_k
    )
    
    # RRF + Reranking
    merged = reciprocal_rank_fusion(dense_results, sparse_results, k=60)
    return await reranker.rerank(query, merged[:20])[:3]

3.2 Reranking for Accuracy Enhancement

Bi-Encoder Problem: Embedding models (E5, BGE) are fast but less accurate — query and document are embedded independently.

Cross-Encoder Solution: BGE-Reranker processes the (query, document) pair together, yielding better relevance understanding.

Table 6. Reranking Impact on Metrics

Metric	Without Reranking	With BGE-Reranker	Improvement
MRR@10	0.61	0.89	+46%
NDCG@10	0.58	0.85	+47%
Recall@3	0.72	0.94	+31%

3.3 Structured Citations (Hallucination Prevention)

To minimize hallucinations, a strict system prompt with rules is used:

Answers only based on provided context
Mandatory source citation in format [Document: name, section X.X]
Prohibition on inferring information

Output is structured via Pydantic models: response, list of citations with document/section/page, and confidence score based on semantic similarity.

4. Results and Metrics

4.1 Comparative Analysis

Table 7. Key Metrics: OpenAI Wrapper vs Secure RAG

Metric	OpenAI API (Baseline)	Secure RAG (On-Premise)
Data Privacy	Data in US	100% within bank
Latency (p50)	1.2s	1.8s
Latency (p99)	3.5s	2.5s (more stable)
Faithfulness (RAGAS)	78%	92% (+Reranking)
Answer Relevancy	81%	88%
ACL Compliance	Ignored	Native
Cost per Query	$0.03	$0 (CapEx)
Availability SLA	99.9% (vendor)	99.95% (own infra)

4.2 Economic Impact

Table 8. TCO Analysis (3 years)

Item	OpenAI API	Self-Hosted
API Costs (1M queries/month)	$1,080,000	$0
Infrastructure (4x A100)	$0	$400,000 (Client-side Hardware Investment)
Engineering (setup + support)	$50,000	$200,000
Compliance Risk	BLOCKING / HIGH ¹	ZERO
Total 3Y TCO	$1,130,000	$600,000

¹ Compliance Risk = BLOCKING: Risk of regulatory sanctions, fines up to 6% of revenue under GDPR. TCO calculation is based on GPT-4 class models (High Intelligence). Using simplified models (gpt-4o-mini and analogues) in FinTech is unacceptable due to critical hallucination risk when working with legal and financial data.

4.3 Business Results

Search Time: from 30 minutes to 15 seconds (120x faster)
L2 Support: ticket processing time reduced by 40%
Compliance: successfully passed security audit and Penetration Test
Adoption: 3,500 active users within 3 months

5. Infrastructure and Scaling

5.1 Hardware Requirements

Table 9. Server Specifications

Component	Configuration	Purpose
LLM Server	4x NVIDIA A100 80GB, 512GB RAM	vLLM inference
Embedding Server	1x NVIDIA A10, 64GB RAM	E5-large-v2
Reranker	1x NVIDIA A10, 64GB RAM	BGE-Reranker
Qdrant Cluster	3 nodes, 128GB RAM, NVMe SSD	Vector storage
Elasticsearch	3 nodes, 64GB RAM, SSD	BM25 index

5.2 Kubernetes Deployment

For high availability in the banking perimeter, vLLM is deployed with a minimum of 2 replicas (SPOF elimination). Horizontal scaling is configured on vllm_requests_running metric with a threshold of 10 concurrent requests.

6. Conclusions and Recommendations

This deployment confirmed that Self-Hosted RAG based on Llama 3, Qdrant, and vLLM is the only path to enterprise-grade AI search in regulated industries.

Key Takeaways:

Self-Hosted LLM is mandatory for GDPR compliance and financial regulations
Hybrid Search (BM25 + Vectors) is critical for accurate search across structured documents
Reranking improves accuracy by 30-50% — a mandatory component
ACL Pre-Filtering must be at retrieval level, not post-processing
vLLM with PagedAttention provides production-grade throughput

Recommendation: This architecture is applicable to any organization with data residency requirements: government sector, healthcare, defense, telecom.