ProbableOdyssey

WSGI vs ASGI: Serving Python Models with Gunicorn or Uvicorn

You’ve trained a brilliant model and/or built a nice web app with FastAPI or Flask in Python. Now comes the less glamorous step: getting predictions to users in real time. How do we translate web requests into something that our Python backend can process? We’ll need a web server to run our web app so people or other systems can send HTTP requests like “here’s some data, give me a prediction”.

There’s two main choices for types of servers we can use: WSGI (Web server gateway interface) or ASGI (Asynchronous Server Gateway Interface). This is often something I’ve glosses over in my experience, but I’ve learned more about the differences and started to understand where one might be better suited than the other.

A quick metaphor that helped me:

Gunicorn: The WSGI Workhorse

Gunicorn (“Green Unicorn”) speaks WSGI, the classic Python web standard. It’s the go-to choice for traditional frameworks like Flask or Django.

Key traits:

For example (using Flask):

# main.py
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load("model.pkl")  # CPU-heavy model

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json["features"]
    yhat = model.predict([data])
    return jsonify(prediction=yhat[0])

Launch with:

gunicorn -w 4 main:app   # 4 workers, good for multi-core CPUs

Uvicorn: The Async Speedster

Uvicorn speaks ASGI, the newer standard built for asynchronous (non-blocking) code. It shines when your app spends most of its time waiting — for example, making API calls or database queries.

Key traits:

For example (using FastAPI):

# main.py
from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load("model.pkl")  # could also call async APIs inside

@app.post("/predict")
async def predict(features: list[float]):
    yhat = model.predict([features])
    return {"prediction": yhat[0]}

Launch with:

uvicorn main:app --reload --workers 1   # add --workers N if CPU-heavy

But there’s a way to combine these approaches if you’re using FastAPI but need both async I/O and CPU scaling: use Gunicorn workers for the Unicorn event loops:

gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app

This gives you Gunicorns multi-process power plus Uvicorns async support — handy for ML APIs that fetch remote data and crunch numbers.

Which Should You Use for Models?

For many data scientists, the choice of server is an afterthought. But the right one can mean the difference between a sluggish API and one that scales smoothly as requests pile up.

Reply to this post by email ↪