Alibaba Qwen 3.7-Max launches automatic implicit caching, reducing input costs by up to 80%.

This article is machine translated

Show original

According to Beating's monitoring, Alibaba's Qwen team announced that automatic implicit caching has been enabled by default for its flagship model Qwen3.7-Max on the Alibaba Cloud Bailian platform. Developers can directly benefit from caching and cost reduction without modifying code or specifying additional parameters. Under the new billing mechanism, the system automatically identifies and extracts repeated context prefixes in requests. Once a cache hit occurs, the cost of the input tokens for the hit portion is only 20% of the original unit price, directly eliminating 80% of the input cost. Implicit caching directly addresses the huge overhead in long text and agent scenarios. Qwen3.7-Max, with a long context window of 1 million tokens, needs to frequently and repeatedly read large code bases or knowledge documents when running high-level tasks such as autonomous coding. One developer reported after testing Qwen3.7 that it consumed nearly 1 million tokens in less than an hour to build a tank battle webpage demo. If the agent is allowed to autonomously perform code review and iterative cycles in the background, the daily usage can easily reach hundreds of millions of tokens. The fierce competition among peers in caching pricing was another direct factor prompting Alibaba's price reduction. Previously, DeepSeek V4-Pro attracted a large number of developers with its extremely low cache hit price. After announcing a permanent price reduction at the end of May, DeepSeek V4-Pro's cache hit billing was reduced to only $0.003625 per million tokens (approximately RMB 0.025), equivalent to a direct cost reduction of 99.17% compared to the standard input price. Many developers, using dedicated tools like Reasonix, pushed the cache hit rate for a single session to a maximum of 99%, making the runtime cost of long-session AI agents almost zero. Faced with competitive pressure, Qwen3.7-Max not only launched an implicit caching mode requiring no configuration but also retained an explicit caching mode that required manually declaring the `cache_control` flag. Compared to automatic caching, explicit caching has higher hit certainty and a hit cost as low as 10% of the standard input unit price (a 10% discount). However, a 125% premium must be paid when creating the cache for the first time, and the cache block has a lifespan of only 5 minutes (which can be reset each time a hit occurs).

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content