Inference — Independent Journal of Automation & Knowledge Work

OPTIMIZATION

Speculative Decoding at Scale: How DeepSeek-Style Drafting Cuts LLM Latency by 60%

LLM inference is notoriously slow and hardware-intensive due to memory bottlenecks. Speculative decoding solves this by using a lightweight draft model to propose tokens, slashing latency and hosting costs.

BY JAMES OSEI · 9 MIN READ · JUNE 27, 2026

Large Language Model inference is notoriously slow and resource-heavy. Because autoregressive models generate text token-by-token sequentially—requiring a full forward pass of the inside-a-100-automated-accounting-department" class="internal-link">automated-her-entire-department--and-kept-her-job" class="internal-link">entire model parameter space for every single character—inference is bounded by memory bandwidth rather than raw compute...

Continue reading →

SECURITY

Repo-Jacking the Agent: How Malicious Codebases Can Hijack Your Local AI Coding Tool

BY DEVRAJ MEHTA · 8 MIN READ

A new class of prompt-injection attacks leverages clean-looking git repositories to hijack local autonomous coding tools. When your agent audits a repo, hidden instructions can trigger silent shell executions.

FRONTIER

GPT-5.6 Sol, Terra, and Luna: Everything Developers Need to Know About the Government-Gated Release

BY SARAH CHEN · 11 MIN READ

OpenAI's GPT-5.6 family introduces Sol, Terra, and Luna — but the U.S. government decides who gets access first. Inside the benchmarks, the cybersecurity triggers, and what developers should do right now.

SILICON

Inside Jalapeño: OpenAI's Custom Chip That Could Cut Your API Costs in Half

BY DEVRAJ MEHTA · 9 MIN READ

OpenAI and Broadcom unveil Jalapeño, a purpose-built AI inference ASIC designed in nine months using AI-accelerated chip design. Here's why it matters for every developer building on LLM APIs.

LOOPS

The Rise of Harness Engineering: Why Loop-Based Orchestration Trumps Agent Autonomy

BY ANIKA ROSENBERG · 7 MIN READ

As autonomous coding agents fail to meet production quality standards, software teams are shifting focus from raw model capability to building 'harness loops'—wrapping type check validation, safety sandboxes, and test runners around LLMs.

The Crisis of Proof: AI in Mathematics and the Battle Against 'Vibe-Coded' Theorems

Mathematicians are rallying behind the Leiden Declaration to defend scientific rigor from neural network hallucinations. Inside the conflict between black-box AI logic and formal verification systems like Lean.

BY SARAH CHEN · 6 MIN READ

GEOPOLITICS

The Sovereign LLM Era: Comparing GPT-5.6 Sol and Anthropic Mythos under US Government Vetting

OpenAI's GPT-5.6 Sol and Anthropic's Mythos AI marks a major pivot: the transition from public model APIs to nation-state audited, restricted-access frontier models. Here is the technical comparison.

BY DEVRAJ MEHTA · 9 MIN READ

DEEP DIVE

Speculative Decoding in Production: How to Cut LLM Latency and GPU Costs by 60%

Autoregressive text generation is slow and expensive. Speculative decoding speeds up inference by running a lightweight 'draft' model alongside your target model. Here is the production-grade architecture and benchmarking code.

BY DEVRAJ MEHTA · 9 MIN READ

FROM THE ARCHIVES

Beyond Cursor & Claude Code: Why the July 2026 MCP Spec is the Real Battleground for Agentic IDEs

BY DEVRAJ MEHTA · JUNE 27, 2026 · 9 MIN READ

“

The tools got better than the processes. Now the processes have to catch up.

— FROM 'THE AUTOMATION PARADOX,' ISSUE NO. 19

BROWSE BY TOPIC

AI Writing Tools Prompt Engineering No-Code Automation LLM Comparisons Workflow Design Personal Productivity Case Studies Opinion Tool Reviews Interviews

The Futures of Work, Decoded.

Speculative Decoding at Scale: How DeepSeek-Style Drafting Cuts LLM Latency by 60%

The Crisis of Proof: AI in Mathematics and the Battle Against 'Vibe-Coded' Theorems

The Sovereign LLM Era: Comparing GPT-5.6 Sol and Anthropic Mythos under US Government Vetting

Speculative Decoding in Production: How to Cut LLM Latency and GPU Costs by 60%

Beyond Cursor & Claude Code: Why the July 2026 MCP Spec is the Real Battleground for Agentic IDEs

Category Name

The Futures of Work, Decoded.

Category Name

Thinking carefully about AI, delivered every Thursday.