QStash has built-in support for calling LLM APIs. This allows you to take advantage of QStash features such as retries, callbacks, and batching while using LLM APIs.

QStash is especially useful for LLM processing because LLM response times are often highly variable. When accessing LLM APIs from serverless runtimes, invocation timeouts are a common issue. QStash offers an HTTP timeout of 2 hours, which is sufficient for most LLM use cases. By using callbacks and the workflows, you can easily manage the asynchronous nature of LLM APIs.

QStash LLM API

You can publish (or enqueue) single LLM request or batch LLM requests using all existing QStash features natively. To do this, specify the destination api as llm with a valid provider. The body of the published or enqueued message should contain a valid chat completion request. For these integrations, you must specify the Upstash-Callback header so that you can process the response asynchronously. Note that streaming chat completions cannot be used with them. Use the chat API for streaming completions.

All the examples below can also be used with OpenAI-compatible LLM providers, in addition to Upstash-hosted models. Soon, we will add LLM providers other than OpenAI and Upstash.

Publishing a Chat Completion Request

Enqueueing a Chat Completion Request

Sending Chat Completion Requests in Batches

Retrying After Rate Limit Resets

When the rate limits are exceeded, QStash automatically schedules the retry of publish or enqueue of chat completion tasks depending on the reset time of the rate limits. That helps with not doing retries prematurely when it is definitely going to fail due to exceeding rate limits.

Upstash Hosted Models

You can use Upstash-hosted models for LLM completions. Upstash offers a hosted LLM service compatible with OpenAI, currently supporting the following models:

  • meta-llama/Meta-Llama-3-8B-Instruct
  • mistralai/Mistral-7B-Instruct-v0.2

Based on adoption, we plan to add more models in the future. Upstash-hosted models are priced under QStash at $0.3 per 1M tokens. Use the upstash provider to access these models.

Chat API

While the use of publish is encouraged, a synchronous chat API is also available for immediate completions. Chat API can be used with both Upstash and OpenAI providers.

It is also possible to simulate a chat session by passing the history of user and assistant messages to the request.

Chat completion responses can also be delivered in chunk streams.

The code snippets above can also be run with other LLM providers like OpenAI, just by specifying it as a provider and passing the API key.

The chat completions endpoint is also compatible with the OpenAI REST API, so you can use wide variety of tools and libraries, including the official OpenAI Python client with it.

from openai import OpenAI

client = OpenAI(
    base_url="https://qstash.upstash.io/llm/v1",
    api_key="<QSTASH_TOKEN>",
)

completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Test message from the OpenAI client",
        }
    ],
    model="meta-llama/Meta-Llama-3-8B-Instruct",
)

print(completion)

Analytics via Helicone

Helicone is a powerful observability platform that provides valuable insights into your LLM usage. Integrating Helicone with QStash is straightforward.

To enable Helicone observability in QStash, you simply need to pass your Helicone API key when initializing your model. Here’s how to do it for both custom models and OpenAI:

import { Client, custom } from "@upstash/qstash";

const client = new Client({
  token: "<QSTASH_TOKEN>",
});

await client.publishJSON({
  api: {
    name: "llm",
    provider: custom({
      token: "XXX",
      baseUrl: "https://api.together.xyz",
    }),
    analytics: { name: "helicone", token: process.env.HELICONE_API_KEY! },
  },
  body: {
    model: "meta-llama/Llama-3-8b-chat-hf",
    messages: [
      {
        role: "user",
        content: "hello",
      },
    ],
  },
  callback: "https://oz.requestcatcher.com/",
});