Skip to content

Secure On-Premise RAG: Intelligent Search for Enterprise FinTech (Air-Gapped)

Deploying an autonomous RAG system within a bank's air-gapped perimeter. Complete migration from cloud LLMs to Self-Hosted Llama 3. Hybrid search (BM25 + Vectors) with ACL-based filtering.

100%
Data Privacy (Zero bytes leave the perimeter)
92%
Accuracy via reranking pipeline
RBAC
Directory service-based search filtering
< 0.4 sec
Time to first token — instant response
3–5 sec
End-to-end latency (full response for legal queries)

Executive Summary

This document presents the architectural validation and deployment results of an enterprise RAG system (Retrieval-Augmented Generation) within an air-gapped perimeter of a Top-20 bank. The system enables intelligent search across 100,000+ internal documents without transmitting data to external clouds.

Key Business Outcomes:

  • Privacy: 100% of data remains within the perimeter — regulatory compliance achieved
  • Accuracy: 92% Faithfulness (RAGAS) — hallucination minimization is critical for compliance
  • Performance: TTFT < 0.4 sec (instant UI response), E2E 3–5 sec for detailed legal queries on proprietary GPU infrastructure (NVIDIA A100)
  • Security: Active Directory integration — users only see documents they're authorized to access

Market Context: Gartner forecasts that by 2026, 80% of enterprise organizations will migrate from public LLM APIs to self-hosted solutions due to regulatory requirements.


1. Problem Statement: Public LLM Risks

1.1 Initial State

The bank utilized full-text search (Elasticsearch) across its corporate knowledge base. When querying "What is the transfer limit for corporate accounts?", the system returned 200+ document links — employees spent up to 30 minutes locating the correct regulation.

1.2 Why OpenAI/Anthropic Cannot Be Used

Table 1. Regulatory Constraints for Financial Sector

RequirementOpenAI APISelf-Hosted LLM
GDPR / Data ResidencyData transmitted to USWithin perimeter
Financial RegulationsNo storage controlFull control
Banking Secrecy LawsLeak riskAir-gapped
Audit and TraceabilityBlack boxComplete logging
SLA and AvailabilityVendor-dependentOwn infrastructure

1.3 Limitations of Pure Vector Search

Table 2. Dense Retrieval Constraints

ScenarioVector SearchProblem
"Account limit 40817..."Low recallAccount numbers lack semantics
"ISO 20022"InaccurateAbbreviations embed poorly
"Appendix 3 to Regulation #42"MissesExact references require keyword match

Conclusion: Banking documents require Hybrid Search (BM25 + Vectors).


2. Architectural Decisions

2.1 Self-Hosted Inference Stack

UserAPIRAGSearchML
100%
Ctrl+Колесо или перетаскивание

Fig. 1. Secure RAG Architecture. The red dashed boundary denotes the Air-Gapped perimeter. All ML components are deployed on the bank's proprietary servers.

2.2 Technology Stack Rationale

2.2.1 LLM Selection: Llama 3 vs Alternatives

Table 3. Self-Hosted LLM Comparison

ModelParametersLicenseMultilingualVRAM (FP16)
Llama 3 70B70BMeta LicenseGood140 GB
Mistral 7B7BApache 2.0Moderate14 GB
Qwen 72B72BQwen LicenseExcellent144 GB
GPT-J6BApache 2.0Limited12 GB

LLM Selection

Архитектурное решение
Llama 3 70B + AWQ

Balance of quality and GPU requirements. 4-bit quantization (AWQ) fits the 70B model on 4x A100.

Strong multilingual capabilities
70B parameters for complex legal texts
AWQ quantization without quality loss
Отклонённый вариант
Mistral 7B

Insufficient quality for legal texts.

Critical errors with financial terminology
7B parameters insufficient for complex documents

2.2.2 Vector DB Selection: Qdrant vs Alternatives

Table 4. Vector Database Comparison

CriterionPineconeWeaviateQdrant
Self-HostedSaaS onlyYesYes
Pre-Filtering (ACL)LimitedYesNative
Hybrid SearchNoYesBM25 + Dense
Production ReadyYesPartialYes
LicenseProprietaryBSD-3Apache 2.0

Vector DB Selection

Архитектурное решение
Qdrant

Native pre-filtering support by metadata — critical for ACL. Filtering occurs at the index level, not post-processing.

Self-Hosted (Apache 2.0)
Native ACL pre-filtering
Hybrid Search (BM25 + Dense)
Отклонённый вариант
Pinecone

SaaS only — data leaves the bank's perimeter.

No self-hosted option
Limited pre-filtering
Proprietary license

2.2.3 Inference Engine: vLLM vs Alternatives

Table 5. Inference Server Comparison

EngineThroughput (tok/s)Paged AttentionContinuous Batching
HuggingFace TGI~30YesYes
vLLM~50PagedAttentionYes
NVIDIA Triton~45ManualYes
llama.cpp~20NoNo

Inference Engine Selection

Архитектурное решение
vLLM

Best throughput (~50 tok/s) via PagedAttention. OpenAI-compatible API simplifies integration.

~50 tok/s throughput
PagedAttention for efficient memory usage
OpenAI-compatible API
Отклонённый вариант
HuggingFace TGI

Lower throughput (~30 tok/s).

~30 tok/s — 40% slower
Less efficient GPU memory utilization

3. Reliability and Accuracy Mechanisms

3.1 Hybrid Search Pipeline

QACLSearchRRFTop
100%
Ctrl+Колесо или перетаскивание

Fig. 2. Hybrid Search Pipeline. ACL filter is applied at the retrieval stage, not post-filtering — critical for security and performance.

Key Mechanism: ACL filter is applied at the index level (pre-filtering), not after obtaining results. This guarantees that users physically cannot retrieve documents without appropriate access.

async def secure_hybrid_search(query: str, user_groups: list[str], top_k: int = 20):
    # ACL Filter at index level
    acl_filter = models.Filter(
        must=[models.FieldCondition(
            key="allowed_groups",
            match=models.MatchAny(any=user_groups)
        )]
    )
    
    # Dense + Sparse search with pre-filtering
    dense_results = await qdrant_client.search(
        collection_name="bank_docs",
        query_vector=await embed_query(query),
        query_filter=acl_filter,
        limit=top_k
    )
    
    # RRF + Reranking
    merged = reciprocal_rank_fusion(dense_results, sparse_results, k=60)
    return await reranker.rerank(query, merged[:20])[:3]

3.2 Reranking for Accuracy Enhancement

Bi-Encoder Problem: Embedding models (E5, BGE) are fast but less accurate — query and document are embedded independently.

Cross-Encoder Solution: BGE-Reranker processes the (query, document) pair together, yielding better relevance understanding.

Table 6. Reranking Impact on Metrics

MetricWithout RerankingWith BGE-RerankerImprovement
MRR@100.610.89+46%
NDCG@100.580.85+47%
Recall@30.720.94+31%

3.3 Structured Citations (Hallucination Prevention)

To minimize hallucinations, a strict system prompt with rules is used:

  • Answers only based on provided context
  • Mandatory source citation in format [Document: name, section X.X]
  • Prohibition on inferring information

Output is structured via Pydantic models: response, list of citations with document/section/page, and confidence score based on semantic similarity.


4. Results and Metrics

4.1 Comparative Analysis

Table 7. Key Metrics: OpenAI Wrapper vs Secure RAG

MetricOpenAI API (Baseline)Secure RAG (On-Premise)
Data PrivacyData in US100% within bank
Latency (p50)1.2s1.8s
Latency (p99)3.5s2.5s (more stable)
Faithfulness (RAGAS)78%92% (+Reranking)
Answer Relevancy81%88%
ACL ComplianceIgnoredNative
Cost per Query$0.03$0 (CapEx)
Availability SLA99.9% (vendor)99.95% (own infra)

4.2 Economic Impact

Table 8. TCO Analysis (3 years)

ItemOpenAI APISelf-Hosted
API Costs (1M queries/month)$1,080,000$0
Infrastructure (4x A100)$0$400,000 (Client-side Hardware Investment)
Engineering (setup + support)$50,000$200,000
Compliance RiskBLOCKING / HIGH ¹ZERO
Total 3Y TCO$1,130,000$600,000

¹ Compliance Risk = BLOCKING: Risk of regulatory sanctions, fines up to 6% of revenue under GDPR. TCO calculation is based on GPT-4 class models (High Intelligence). Using simplified models (gpt-4o-mini and analogues) in FinTech is unacceptable due to critical hallucination risk when working with legal and financial data.

4.3 Business Results

  • Search Time: from 30 minutes to 15 seconds (120x faster)
  • L2 Support: ticket processing time reduced by 40%
  • Compliance: successfully passed security audit and Penetration Test
  • Adoption: 3,500 active users within 3 months

5. Infrastructure and Scaling

5.1 Hardware Requirements

Table 9. Server Specifications

ComponentConfigurationPurpose
LLM Server4x NVIDIA A100 80GB, 512GB RAMvLLM inference
Embedding Server1x NVIDIA A10, 64GB RAME5-large-v2
Reranker1x NVIDIA A10, 64GB RAMBGE-Reranker
Qdrant Cluster3 nodes, 128GB RAM, NVMe SSDVector storage
Elasticsearch3 nodes, 64GB RAM, SSDBM25 index

5.2 Kubernetes Deployment

For high availability in the banking perimeter, vLLM is deployed with a minimum of 2 replicas (SPOF elimination). Horizontal scaling is configured on vllm_requests_running metric with a threshold of 10 concurrent requests.


6. Conclusions and Recommendations

This deployment confirmed that Self-Hosted RAG based on Llama 3, Qdrant, and vLLM is the only path to enterprise-grade AI search in regulated industries.

Key Takeaways:

  1. Self-Hosted LLM is mandatory for GDPR compliance and financial regulations
  2. Hybrid Search (BM25 + Vectors) is critical for accurate search across structured documents
  3. Reranking improves accuracy by 30-50% — a mandatory component
  4. ACL Pre-Filtering must be at retrieval level, not post-processing
  5. vLLM with PagedAttention provides production-grade throughput

Recommendation: This architecture is applicable to any organization with data residency requirements: government sector, healthcare, defense, telecom.

Secure On-Premise RAG: Intelligent Search for Enterprise FinTech (Air-Gapped) — Softenq