How We Cut Threat Response Time to 8ms
The architectural decisions that took ARIA's median response time from 880ms to 8ms — and why latency is a moral commitment.
Eighteen months ago ARIA's median end-to-end response was 880ms. Today it's 8ms. That number isn't a marketing milestone — it's an architectural commitment we made because lateral-movement campaigns are over in seconds, not minutes.
Where the 872ms went
Three places consumed 90% of the budget: (1) network round-trips to the central inference plane, (2) cold-start of inference workers, (3) serialization overhead between detection and policy plane. None of them were individually shocking. Together they made ARIA unusable for real-time blocking.
What changed
- We moved ARIA inference to the edge — every region now runs a quantized variant of the model.
- We collapsed the detection-policy boundary; decisions and enforcement live in the same process.
- We replaced JSON with FlatBuffers for hot-path serialization.
- We rewrote the policy decision point in Rust with zero-allocation hot paths.
What we gave up
Honesty: edge inference uses a smaller model than the central one. We accept ~0.7% lower detection rate on the edge variant in exchange for the 100x latency improvement. ARIA still escalates ambiguous decisions to the central model. The two-stage architecture is a deliberate tradeoff.
The moral case for low-latency security
Ransomware encrypts at gigabytes per second. Token-replay attacks move in milliseconds. Insider exfiltration runs at line rate. If your security plane operates in seconds, you are not defending — you are documenting what happened.