Back to Blog

Navigating LLM Token Usage: The Hidden Streaming Gotchas

EN 🇺🇸Article9 min read
#LLM#API#Streaming#Cost Optimization#Observability

If you're building applications powered by large language models, you know that managing costs is crucial. But have you ever wondered if your token usage numbers are truly accurate, especially when dealing with streaming responses from multiple providers? The reality is, what seems like a straightforward metric can hide significant complexities, leading to silent billing discrepancies that only surface weeks later.

Understanding and accurately tracking token usage across different LLM providers like OpenAI, Anthropic, and Gemini isn't just trivia; it's a fundamental aspect of financial accountability and system optimization. This article dives deep into the non-obvious differences in how these platforms report token consumption, particularly with streaming, and offers practical strategies to ensure your metrics are reliable and your budgets stay on track.

What LLM Token Usage Tracking actually is

LLM token usage tracking is the process of measuring the number of input tokens (from the prompt) and output tokens (generated by the model) consumed during an interaction with a Large Language Model. This metric is fundamental for managing costs, enforcing quotas, and understanding the performance characteristics of your AI-powered applications. Think of it like a car's odometer and fuel gauge, but instead of miles and gallons, we're counting "tokens." The tricky part is that each car manufacturer (LLM provider) puts their gauges in different places and measures things slightly differently.

The core mechanism involves your application sending a request to an LLM API and then parsing the response to extract the reported token counts. While simple for single, non-streaming requests, the challenge intensifies significantly with streaming responses, where data arrives in chunks over time, and the usage information might be fragmented or delayed.

Key components

Here's a concrete, step-by-step example of how this concept plays out in a typical streaming interaction:

  1. Your application sends a prompt to an LLM endpoint, requesting a streaming response.
  2. The LLM processes your prompt. If part of the prompt was previously cached, it might use the cached version.
  3. The LLM begins sending back partial responses (chunks) as it generates output tokens. These chunks primarily contain content, not usage data.
  4. At specific points during the stream (e.g., at the very start, within a delta, or at the very end), the LLM provider sends a chunk containing usage metadata.
  5. Your application must aggregate these usage details across all relevant chunks to reconstruct the total input and output token counts for that request.
  6. Once the stream completes, your system reconciles all collected usage data to derive the final token count, which is then used for cost calculations and monitoring.

Why engineers choose it

Implementing robust token usage tracking isn't optional for serious LLM applications; it's a strategic necessity. It underpins several critical operational and financial goals.

The trade-offs you need to know

While essential, comprehensive token usage tracking isn't a silver bullet; it introduces its own set of complexities and challenges. It moves complexity from opaque billing into your application's logic.

When to use it (and when not to)

Deciding when to invest heavily in token usage tracking depends on your project's maturity, scale, and specific requirements.

Use it when:

Avoid it when:

Best practices that make the difference

Navigating the intricacies of LLM token usage tracking successfully requires discipline and a structured approach. These practices will help you avoid the common pitfalls and build a reliable system.

Parse Provider-Specifically, Then Normalize

Resist the urge to create a single, generic parser too early. Start by building dedicated parsers for each provider (OpenAI, Anthropic, Gemini). Understand their specific JSON structures, event types, and where usage information is embedded within streaming chunks. Once you have robust, isolated parsers, then map their outputs to a common, internal data model. This approach ensures you capture all unique nuances before attempting a unified abstraction, preventing subtle bugs from being hidden.

Account for Streaming and Cache Explicitly

These are the two biggest "gotchas." For streaming, you must correctly identify and aggregate usage information from potentially disparate chunks (e.g., input tokens at message_start, output tokens at message_delta for Anthropic, or a single final chunk for OpenAI). For caching, be acutely aware of how each provider reports it: does input_tokens include cached tokens (OpenAI) or exclude them (Anthropic)? Your normalization logic must explicitly handle these opposite conventions to prevent incorrect cost calculations.

Validate Billing Tier Information

LLM providers often have different service tiers (e.g., default, flex, priority), each with different pricing. Crucially, the tier you request might not be the tier you receive. Under load, a priority request could be silently downgraded to default. Always trust the billing tier reported in the LLM's response over what you sent in the request, as this is what you'll actually be charged for. Incorporate this validation into your tracking to ensure accurate cost attribution.

Robust Error Handling and Assertions

Treat token usage numbers as critical, financial data. Implement strong validation, unit tests, and integration tests for your parsing and aggregation logic. Assert that expected usage fields are present and that calculated totals make sense. A token count that is off by the cache size won't throw an error; it will silently corrupt your financial records. Use logs and alerts to flag unexpected usage patterns or missing data, ensuring quick detection of issues.

Wrapping up

The world of Large Language Models offers incredible capabilities, but beneath the surface of seemingly simple APIs lies a surprising amount of complexity, particularly when it comes to accurate token usage tracking. Ignoring these intricacies, especially with streaming responses and varying cache accounting, is a direct path to unexpected bills and unreliable financial data.

As engineers, our role isn't just to integrate new technologies, but to ensure their responsible and predictable operation. By understanding the unique behaviors of each LLM provider, building flexible parsing layers, and prioritizing robust validation, you can transform a potential headache into a well-understood, manageable system. This diligence ensures you maintain accurate cost control and build trust in your LLM-powered applications.

The key takeaway is that the "devil is in the details" when dealing with LLM usage. Proactive, disciplined engineering in this area not only saves money but also provides the foundational data needed for intelligent optimization and scaling. Don't let silent discrepancies become your next production incident; embrace the complexity and master your token metrics.


Newsletter

Stay ahead of the curve

Deep technical insights on software architecture, AI and engineering. No fluff. One email per week.

No spam. Unsubscribe anytime.

Navigating LLM Token Usage: The Hidden Streaming Gotchas | Antonio Ferreira