Calling LLM APIs with Python
Table of Contents
This tutorial covers calling LLM APIs using Python — the same concepts from Calling LLM APIs with JavaScript, but with Python’s SDK and idioms. If you’ve already read the JavaScript version, this will feel familiar. If Python is your primary language, start here.
You should be familiar with What is Generative AI? and Tokens, Context Windows & Model Parameters.
Setup
Prerequisites
- Python 3.9+
- An OpenAI API key (sign up at platform.openai.com)
Install the SDK
pip install openai
Set Your API Key
export OPENAI_API_KEY="your-key-here"
Your First API Call
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is a list comprehension in Python?"}],
)
print(response.choices[0].message.content)
Run it:
python main.py
Understanding the Response
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Say hello"}],
)
print(response.choices[0].message.content) # The generated text
print(response.choices[0].finish_reason) # "stop", "length", etc.
print(response.usage.prompt_tokens) # Tokens in your input
print(response.usage.completion_tokens) # Tokens in the output
print(response.usage.total_tokens) # Total
Managing Conversations
The API is stateless — you send the full conversation history with each request:
from openai import OpenAI
client = OpenAI()
messages = [
{"role": "system", "content": "You are a concise Python tutor."},
]
def chat(user_message: str) -> str:
messages.append({"role": "user", "content": user_message})
response = client.chat.completions.create(model="gpt-4o", messages=messages)
reply = response.choices[0].message.content
messages.append({"role": "assistant", "content": reply})
return reply
print(chat("What is a decorator?"))
print(chat("Show me a simple example."))
# The model remembers the first question because the full history is sent
Tuning Parameters
response = client.chat.completions.create(
model="gpt-4o",
temperature=0, # Deterministic output
max_tokens=500, # Limit response length
messages=[{"role": "user", "content": "Write a haiku about Python."}],
)
Streaming Responses
Stream the response token by token for a better user experience:
stream = client.chat.completions.create(
model="gpt-4o",
stream=True,
messages=[{"role": "user", "content": "Explain Python generators."}],
)
for chunk in stream:
text = chunk.choices[0].delta.content
if text:
print(text, end="", flush=True)
print()
Error Handling
import time
from openai import OpenAI, RateLimitError, APIStatusError
client = OpenAI()
def safe_chat(messages: list, retries: int = 2):
for attempt in range(retries + 1):
try:
return client.chat.completions.create(model="gpt-4o", messages=messages)
except RateLimitError:
wait = 2**attempt
print(f"Rate limited. Retrying in {wait}s...")
time.sleep(wait)
except APIStatusError as e:
if e.status_code >= 500:
wait = 2**attempt
print(f"Server error. Retrying in {wait}s...")
time.sleep(wait)
else:
raise
raise RuntimeError("Max retries exceeded")
Async Support
Python’s OpenAI SDK has first-class async support, which is useful for web servers and concurrent applications:
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def chat(prompt: str) -> str:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
async def main():
# Run multiple calls concurrently
results = await asyncio.gather(
chat("What is a list?"),
chat("What is a dict?"),
chat("What is a set?"),
)
for r in results:
print(r[:80], "...\n")
asyncio.run(main())
This makes three API calls in parallel instead of sequentially — much faster when you have multiple independent requests.
Building a CLI Chatbot
from openai import OpenAI
client = OpenAI()
messages = [{"role": "system", "content": "You are a helpful, concise coding assistant."}]
print('Chat started. Type "quit" to exit.\n')
while True:
user_input = input("You: ")
if user_input.lower() == "quit":
break
messages.append({"role": "user", "content": user_input})
stream = client.chat.completions.create(
model="gpt-4o", stream=True, messages=messages
)
print("AI: ", end="")
reply = ""
for chunk in stream:
text = chunk.choices[0].delta.content
if text:
print(text, end="", flush=True)
reply += text
print("\n")
messages.append({"role": "assistant", "content": reply})
Using Other Providers
Anthropic
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You are a helpful coding assistant.",
messages=[{"role": "user", "content": "What is a decorator?"}],
)
print(response.content[0].text)
OpenAI-Compatible Providers
Many local and cloud providers expose an OpenAI-compatible API:
from openai import OpenAI
# Example: Ollama running locally
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Hello!"}],
)
What’s Next?
You’re now calling LLM APIs in both JavaScript and Python. The next section of this series covers RAG — combining LLM calls with your own data to build applications that can answer questions about documents, codebases, and databases. Start with Introduction to RAG.