WSGI vs ASGI: Serving Python Models with Gunicorn or Uvicorn

2025-09-27

You’ve trained a brilliant model and/or built a nice web app with FastAPI or Flask in Python. Now comes the less glamorous step: getting predictions to users in real time. How do we translate web requests into something that our Python backend can process? We’ll need a web server to run our web app so people or other systems can send HTTP requests like “here’s some data, give me a prediction”.

There’s two main choices for types of servers we can use: WSGI (Web server gateway interface) or ASGI (Asynchronous Server Gateway Interface). This is often something I’ve glosses over in my experience, but I’ve learned more about the differences and started to understand where one might be better suited than the other.

A quick metaphor that helped me:

WSGI is like a sit-down restaurant: each table (request) gets its own waiter until the meal is served
ASGI is like a food truck with a smart queue: customers place orders, and the kitchen keeps cooking while people wait.

Gunicorn: The WSGI Workhorse

Gunicorn (“Green Unicorn”) speaks WSGI, the classic Python web standard. It’s the go-to choice for traditional frameworks like Flask or Django.

Key traits:

Multiple workers: Gunicorn launches several independent worker processes. Each worker handles one request at a time (and if one crashes, the others keep running)
CPU-friendly: Great when each request needs real CPU time — for example, a big scikit-learn model doing heavy feature engineering or a PyTorch model making predictions.
Battle-tested: If your API doesn’t use async/await, Gunicorn is the safe, well-worn path.

For example (using Flask):

# main.py
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load("model.pkl")  # CPU-heavy model

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json["features"]
    yhat = model.predict([data])
    return jsonify(prediction=yhat[0])

Launch with:

gunicorn -w 4 main:app   # 4 workers, good for multi-core CPUs

Uvicorn: The Async Speedster

Uvicorn speaks ASGI, the newer standard built for asynchronous (non-blocking) code. It shines when your app spends most of its time waiting — for example, making API calls or database queries.

Key traits:

Event loop: By default Uvicorn runs a single-threaded “event loop”, juggling many requests without blocking.
I/O-friendly: Perfect for apps built with FastAPI or Starlette, where you can await network calls and keep serving others while one request is waiting.
Workers optional: You can run multiple event-loop workers if you want more CPU coverage.

For example (using FastAPI):

# main.py
from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load("model.pkl")  # could also call async APIs inside

@app.post("/predict")
async def predict(features: list[float]):
    yhat = model.predict([features])
    return {"prediction": yhat[0]}

Launch with:

uvicorn main:app --reload --workers 1   # add --workers N if CPU-heavy

But there’s a way to combine these approaches if you’re using FastAPI but need both async I/O and CPU scaling: use Gunicorn workers for the Unicorn event loops:

gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app

This gives you Gunicorns multi-process power plus Uvicorns async support — handy for ML APIs that fetch remote data and crunch numbers.

Which Should You Use for Models?

CPU-heavy inference (e.g., scikit-learn, PyTorch, TensorFlow): Gunicorn (multiple processes spread the CPU load)
I/O-heavy tasks (fetching data, chaining APIs, streaming): Uvicorn (async shines when waiting is the bottleneck)
Hybrid: Run Uvicorn inside Gunicorn to get both multiprocessing and async support

For many data scientists, the choice of server is an afterthought. But the right one can mean the difference between a sluggish API and one that scales smoothly as requests pile up.

Reply to this post by email ↪