Calling LLM APIs with Python

March 28, 2026

#genai #ai #llm #python

This tutorial covers calling LLM APIs using Python — the same concepts from Calling LLM APIs with JavaScript, but with Python’s SDK and idioms. If you’ve already read the JavaScript version, this will feel familiar. If Python is your primary language, start here.

You should be familiar with What is Generative AI? and Tokens, Context Windows & Model Parameters.

Setup

Prerequisites

Python 3.9+
An OpenAI API key (sign up at platform.openai.com)

Install the SDK

pip install openai

Set Your API Key

export OPENAI_API_KEY="your-key-here"

Never hardcode API keys in your source code. Use environment variables or a secrets manager.

Your First API Call

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is a list comprehension in Python?"}],
)

print(response.choices[0].message.content)

Run it:

python main.py

Understanding the Response

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Say hello"}],
)

print(response.choices[0].message.content)   # The generated text
print(response.choices[0].finish_reason)      # "stop", "length", etc.
print(response.usage.prompt_tokens)           # Tokens in your input
print(response.usage.completion_tokens)       # Tokens in the output
print(response.usage.total_tokens)            # Total

Managing Conversations

The API is stateless — you send the full conversation history with each request:

from openai import OpenAI

client = OpenAI()
messages = [
    {"role": "system", "content": "You are a concise Python tutor."},
]


def chat(user_message: str) -> str:
    messages.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(model="gpt-4o", messages=messages)

    reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": reply})
    return reply


print(chat("What is a decorator?"))
print(chat("Show me a simple example."))
# The model remembers the first question because the full history is sent

Tuning Parameters

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,       # Deterministic output
    max_tokens=500,      # Limit response length
    messages=[{"role": "user", "content": "Write a haiku about Python."}],
)

Streaming Responses

Stream the response token by token for a better user experience:

stream = client.chat.completions.create(
    model="gpt-4o",
    stream=True,
    messages=[{"role": "user", "content": "Explain Python generators."}],
)

for chunk in stream:
    text = chunk.choices[0].delta.content
    if text:
        print(text, end="", flush=True)
print()

Error Handling

import time
from openai import OpenAI, RateLimitError, APIStatusError

client = OpenAI()


def safe_chat(messages: list, retries: int = 2):
    for attempt in range(retries + 1):
        try:
            return client.chat.completions.create(model="gpt-4o", messages=messages)
        except RateLimitError:
            wait = 2**attempt
            print(f"Rate limited. Retrying in {wait}s...")
            time.sleep(wait)
        except APIStatusError as e:
            if e.status_code >= 500:
                wait = 2**attempt
                print(f"Server error. Retrying in {wait}s...")
                time.sleep(wait)
            else:
                raise
    raise RuntimeError("Max retries exceeded")

Async Support

Python’s OpenAI SDK has first-class async support, which is useful for web servers and concurrent applications:

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()


async def chat(prompt: str) -> str:
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content


async def main():
    # Run multiple calls concurrently
    results = await asyncio.gather(
        chat("What is a list?"),
        chat("What is a dict?"),
        chat("What is a set?"),
    )
    for r in results:
        print(r[:80], "...\n")


asyncio.run(main())

This makes three API calls in parallel instead of sequentially — much faster when you have multiple independent requests.

Building a CLI Chatbot

from openai import OpenAI

client = OpenAI()
messages = [{"role": "system", "content": "You are a helpful, concise coding assistant."}]

print('Chat started. Type "quit" to exit.\n')

while True:
    user_input = input("You: ")
    if user_input.lower() == "quit":
        break

    messages.append({"role": "user", "content": user_input})

    stream = client.chat.completions.create(
        model="gpt-4o", stream=True, messages=messages
    )

    print("AI: ", end="")
    reply = ""
    for chunk in stream:
        text = chunk.choices[0].delta.content
        if text:
            print(text, end="", flush=True)
            reply += text
    print("\n")

    messages.append({"role": "assistant", "content": reply})

Using Other Providers

Anthropic

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a helpful coding assistant.",
    messages=[{"role": "user", "content": "What is a decorator?"}],
)

print(response.content[0].text)

OpenAI-Compatible Providers

Many local and cloud providers expose an OpenAI-compatible API:

from openai import OpenAI

# Example: Ollama running locally
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Hello!"}],
)

What’s Next?

You’re now calling LLM APIs in both JavaScript and Python. The next section of this series covers RAG — combining LLM calls with your own data to build applications that can answer questions about documents, codebases, and databases. Start with Introduction to RAG.

Table of Contents