Executive Summary
This document presents the architectural validation and results of migrating the logistics platform of a federal 3PL operator (fleet of 5000+ vehicles) from monolithic PHP architecture to an event-driven system (EDA) based on Go/Kafka/MQTT stack.
Key Business Results:
- Throughput: growth from 800 to 50,000+ RPS — capacity for 10x fleet scaling
- Operating Costs: 6x reduction in mobile traffic costs (~$200K/year savings on 5000 vehicle fleet)
- Reliability: transition from 98.5% SLA to 99.9% (from ~130 hours downtime/year to ~9 hours). Architectural readiness for 99.99% with Multi-AZ deployment
- Investment Protection: Legacy ERP (SAP) preserved, load reduced by 85%
Downtime Cost in Logistics: According to Gartner, one hour of downtime for a large logistics enterprise costs $100,000–$300,000. The implemented architecture eliminates cascading failures typical of monolithic systems.
1. Problem Statement: Legacy Architecture Risk Analysis
1.1 Initial System State
The customer operated a distributed monolith on LAMP stack (PHP 7.4 / MySQL 5.7 / Apache). GPS trackers sent coordinates via direct HTTP POST requests to REST API, which synchronously wrote data to the main transactional database.
1.2 Identified Technical Risks
Table 1. Legacy Architecture Risk Matrix
| Risk Category | Manifestation | Business Consequences |
|---|---|---|
| DB Locks | Lock Wait Timeout during telemetry/manager transaction contention | Order processing failures during peak hours |
| Race Conditions | Status conflicts: cargo "Delivered" but "Not Shipped" | Financial discrepancies, customer claims |
| Data Loss | GPS point loss during connection breaks (tunnels, highways) | Route reconstruction impossible, insurance disputes |
| Thundering Herd | Simultaneous reconnection of 5000 devices after network failure | Cascading system failure |
| Vertical Scaling Limit | PHP-FPM: 20-30 MB RAM per request × 1000 workers = 30 GB RAM | Exponential infrastructure cost growth |
1.3 Load Quantitative Assessment
Throughput calculation for High-Frequency Telematics:
Polling frequency: 10 Hz (GPS + accelerometer + CAN-bus)
Fleet: 5,000 vehicles
Target growth: 20,000 vehicles
Current load: 10 × 5,000 = 50,000 events/sec
Target load: 10 × 20,000 = 200,000 events/sec10 Hz Frequency Justification: This isn't just GPS tracking. The system collects:
- Raw accelerometer data — shock detection, potholes, sudden maneuvers for insurance scoring
- CAN-bus data — engine RPM, pedal positions, fuel consumption
- ML driving style scoring — requires granular data for accurate accident reconstruction
This is a competitive advantage: standard GPS tracker (0.2 Hz) doesn't allow building driver behavior ML models. For regular monitoring, data is downsampled, but raw stream is stored for insurance cases and retrospective analysis.
Conclusion: Synchronous Blocking I/O model of the original monolith (limit ~800 RPS) exhausted scaling limit. Architectural transformation required.
2. Architectural Solutions
2.1 Design Principles
We applied data flow separation into Hot Path (real-time telemetry) and Cold Path (reporting, ERP synchronization), implementing Event-Driven Architecture (EDA) pattern.
2.2 Target Architecture Components
Key Components:
- MQTT Broker: Lightweight protocol for IoT devices with QoS guarantees
- Apache Kafka: Event streaming platform for durable message storage
- Go Services: High-performance microservices for event processing
- ClickHouse: Columnar database for telemetry analytics
- Anti-Corruption Layer: Protection layer for legacy ERP integration
2.3 Transactional Outbox Pattern
To ensure exactly-once delivery semantics between services, we implemented the Transactional Outbox pattern:
- Business transaction and outbox event written atomically to PostgreSQL
- CDC (Debezium) captures outbox events and publishes to Kafka
- Consumers process events with idempotency keys
3. Results and Metrics
3.1 Performance Improvements
| Metric | Before | After | Improvement |
|---|---|---|---|
| Throughput | 800 RPS | 50,000+ RPS | 62× |
| Latency (p99) | 2-5 sec | < 50 ms | 100× |
| SLA | 98.5% | 99.9% | 14× less downtime |
| Traffic Cost | $X/year | $X/6 year | -83% |
3.2 Architectural Benefits
The event-driven architecture enables independent scaling of each component, fault isolation, and seamless addition of new event consumers without affecting existing systems.