Deploying AI Models with FastAPI
Build production-ready AI APIs
Learn how to deploy AI and machine learning models as scalable, high-performance APIs using FastAPI and best practices for production environments.
FastAPI AI Deployment
Learn how to deploy AI models as scalable, high-performance APIs using FastAPI.
Tutorial Series
Introduction to FastAPI for AI Deployment
FastAPI is a modern, high-performance web framework for building APIs with Python. Its combination of speed, automatic documentation, and type checking makes it an excellent choice for deploying AI models in production.
Key advantages of FastAPI for AI deployment include:
- High performance based on Starlette and Pydantic
- Automatic API documentation with Swagger UI
- Type checking and validation with Python type hints
- Asynchronous support for handling concurrent requests
- Easy integration with machine learning frameworks
Setting Up Your Development Environment
Let's start by setting up a development environment for our AI API:
# Create a virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install required packages pip install fastapi uvicorn pydantic python-multipart pip install torch transformers sentence-transformers pillow
Project Structure
A well-organized project structure helps maintain and scale your API:
ai_api/ ├── app/ │ ├── __init__.py │ ├── main.py # FastAPI application │ ├── models/ │ │ ├── __init__.py │ │ └── ml_models.py # ML model loading and inference │ ├── routers/ │ │ ├── __init__.py │ │ ├── text.py # Text-related endpoints │ │ └── image.py # Image-related endpoints │ ├── schemas/ │ │ ├── __init__.py │ │ └── request.py # Pydantic models for requests/responses │ └── utils/ │ ├── __init__.py │ └── helpers.py # Utility functions ├── tests/ │ ├── __init__.py │ └── test_api.py # API tests ├── .env # Environment variables ├── Dockerfile # Docker configuration ├── requirements.txt # Dependencies └── README.md # Documentation
Creating a Basic FastAPI Application
Let's create a simple FastAPI application (app/main.py):
from fastapi import FastAPI, HTTPException from fastapi.middleware.cors import CORSMiddleware import uvicorn # Import routers from app.routers import text, image # Create FastAPI app app = FastAPI( title="AI Model API", description="API for serving AI models for text and image processing", version="1.0.0" ) # Configure CORS app.add_middleware( CORSMiddleware, allow_origins=["*"], # In production, replace with specific origins allow_credentials=True, allow_methods=["*"], allow_headers=["*"], ) # Include routers app.include_router(text.router, prefix="/api/text", tags=["Text"]) app.include_router(image.router, prefix="/api/image", tags=["Image"]) # Health check endpoint @app.get("/health", tags=["Health"]) async def health_check(): return {"status": "healthy"} if __name__ == "__main__": uvicorn.run("app.main:app", host="0.0.0.0", port=8000, reload=True)
Defining Request and Response Models
Using Pydantic models for request and response validation (app/schemas/request.py):
from pydantic import BaseModel, Field from typing import List, Optional class TextRequest(BaseModel): text: str = Field(..., min_length=1, max_length=5000, description="Input text for processing") options: Optional[dict] = Field(default={}, description="Additional options for processing") class TextResponse(BaseModel): result: str = Field(..., description="Processed text result") confidence: float = Field(..., ge=0, le=1, description="Confidence score") processing_time: float = Field(..., description="Processing time in seconds") class ImageRequest(BaseModel): image_url: Optional[str] = Field(default=None, description="URL of the image to process") # Note: For file uploads, we'll use Form and File instead of this model class ImageResponse(BaseModel): results: List[dict] = Field(..., description="List of detection results") processing_time: float = Field(..., description="Processing time in seconds")
Loading and Serving AI Models
Let's create a module for loading and serving our AI models (app/models/ml_models.py):
import torch from transformers import pipeline from sentence_transformers import SentenceTransformer import time import os # Singleton pattern for model loading class ModelLoader: _instance = None def __new__(cls): if cls._instance is None: cls._instance = super(ModelLoader, cls).__new__(cls) cls._instance._load_models() return cls._instance def _load_models(self): # Text classification model self.sentiment_model = pipeline( "sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0 if torch.cuda.is_available() else -1 ) # Text embedding model self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2') # Image classification model self.image_model = pipeline( "image-classification", model="google/vit-base-patch16-224", device=0 if torch.cuda.is_available() else -1 ) def get_sentiment_model(self): return self.sentiment_model def get_embedding_model(self): return self.embedding_model def get_image_model(self): return self.image_model # Functions for model inference def analyze_sentiment(text): start_time = time.time() model = ModelLoader().get_sentiment_model() result = model(text)[0] return { "result": result["label"], "confidence": result["score"], "processing_time": time.time() - start_time } def generate_embeddings(text): start_time = time.time() model = ModelLoader().get_embedding_model() embedding = model.encode(text) return { "result": embedding.tolist(), "confidence": 1.0, # Embeddings don't have confidence scores "processing_time": time.time() - start_time } def classify_image(image_path): start_time = time.time() model = ModelLoader().get_image_model() results = model(image_path) return { "results": results, "processing_time": time.time() - start_time }
Creating API Endpoints
Now, let's create endpoints for text processing (app/routers/text.py):
from fastapi import APIRouter, HTTPException, BackgroundTasks from app.schemas.request import TextRequest, TextResponse from app.models.ml_models import analyze_sentiment, generate_embeddings import time router = APIRouter() @router.post("/sentiment", response_model=TextResponse) async def sentiment_analysis(request: TextRequest): try: result = analyze_sentiment(request.text) return result except Exception as e: raise HTTPException(status_code=500, detail=str(e)) @router.post("/embeddings", response_model=dict) async def text_embeddings(request: TextRequest): try: result = generate_embeddings(request.text) return result except Exception as e: raise HTTPException(status_code=500, detail=str(e)) # Example of a background task @router.post("/async-process") async def process_text_async(request: TextRequest, background_tasks: BackgroundTasks): # Add task to background background_tasks.add_task(process_long_running_task, request.text) return {"message": "Processing started in background"} # Function to be executed in background def process_long_running_task(text: str): # Simulate long-running task time.sleep(10) # Process text and store results result = analyze_sentiment(text) # Here you would typically store the result in a database or cache
And for image processing (app/routers/image.py):
from fastapi import APIRouter, HTTPException, UploadFile, File, Form from fastapi.responses import JSONResponse from app.models.ml_models import classify_image import shutil import os import tempfile from typing import Optional import aiohttp import aiofiles router = APIRouter() @router.post("/classify") async def classify_uploaded_image( file: Optional[UploadFile] = File(None), image_url: Optional[str] = Form(None) ): if file is None and image_url is None: raise HTTPException(status_code=400, detail="Either file or image_url must be provided") try: # Create a temporary file with tempfile.NamedTemporaryFile(delete=False) as temp: temp_path = temp.name if file: # Save uploaded file to temp location shutil.copyfileobj(file.file, temp) elif image_url: # Download image from URL async with aiohttp.ClientSession() as session: async with session.get(image_url) as response: if response.status != 200: raise HTTPException(status_code=400, detail="Could not download image") async with aiofiles.open(temp_path, 'wb') as f: await f.write(await response.read()) # Process the image result = classify_image(temp_path) # Clean up os.unlink(temp_path) return result except Exception as e: # Clean up in case of error if 'temp_path' in locals(): os.unlink(temp_path) raise HTTPException(status_code=500, detail=str(e))
Optimizing Performance
To handle production workloads, we need to optimize our API:
1. Model Optimization
- Quantization: Reduce model size and increase inference speed
- Model Pruning: Remove unnecessary weights
- Distillation: Create smaller models that mimic larger ones
- Batching: Process multiple requests together
2. Asynchronous Processing
FastAPI supports asynchronous request handling:
@router.post("/batch-process") async def batch_process(requests: List[TextRequest]): # Process multiple requests concurrently results = await asyncio.gather(*[process_single_request(req) for req in requests]) return results async def process_single_request(request: TextRequest): # Use asyncio.to_thread for CPU-bound operations result = await asyncio.to_thread(analyze_sentiment, request.text) return result
3. Caching
Implement caching to avoid redundant computations:
from fastapi_cache import FastAPICache from fastapi_cache.backends.redis import RedisBackend from fastapi_cache.decorator import cache from redis import asyncio as aioredis # Initialize cache in main.py @app.on_event("startup") async def startup(): redis = aioredis.from_url("redis://localhost", encoding="utf8") FastAPICache.init(RedisBackend(redis), prefix="fastapi-cache:") # Use cache decorator @router.post("/sentiment") @cache(expire=3600) # Cache for 1 hour async def sentiment_analysis(request: TextRequest): result = analyze_sentiment(request.text) return result
Containerization with Docker
Create a Dockerfile for your API:
FROM python:3.9-slim WORKDIR /app # Install system dependencies RUN apt-get update && apt-get install -y --no-install-recommends build-essential && rm -rf /var/lib/apt/lists/* # Copy requirements and install dependencies COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Expose port EXPOSE 8000 # Command to run the application CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
And a docker-compose.yml file for local development:
version: '3' services: api: build: . ports: - "8000:8000" volumes: - .:/app environment: - ENVIRONMENT=development command: uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload redis: image: redis:alpine ports: - "6379:6379"
Scaling in Production
For production deployment, consider these scaling strategies:
1. Horizontal Scaling
Deploy multiple instances of your API behind a load balancer:
- Use Kubernetes for orchestration
- Implement auto-scaling based on CPU/memory usage
- Use a load balancer (e.g., Nginx, Traefik) to distribute traffic
2. Model Serving Platforms
Consider specialized platforms for model serving:
- TorchServe for PyTorch models
- TensorFlow Serving for TensorFlow models
- Triton Inference Server for multiple frameworks
3. Serverless Deployment
For variable workloads, serverless can be cost-effective:
- AWS Lambda with API Gateway
- Google Cloud Functions
- Azure Functions
Monitoring and Logging
Implement comprehensive monitoring for your API:
from fastapi import FastAPI, Request import time import logging from prometheus_fastapi_instrumentator import Instrumentator # Set up logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) app = FastAPI() # Add middleware for request logging @app.middleware("http") async def log_requests(request: Request, call_next): start_time = time.time() response = await call_next(request) process_time = time.time() - start_time logger.info(f"Path: {request.url.path} Method: {request.method} Time: {process_time:.4f}s") return response # Set up Prometheus metrics Instrumentator().instrument(app).expose(app)
Security Best Practices
Implement these security measures for your API:
- Authentication: Use OAuth2 or API keys
- Rate Limiting: Prevent abuse with request limits
- Input Validation: Validate all inputs with Pydantic
- HTTPS: Always use TLS in production
- CORS: Configure proper Cross-Origin Resource Sharing
- Dependency Updates: Regularly update dependencies
Conclusion
FastAPI provides an excellent framework for deploying AI models in production. By following the best practices outlined in this tutorial, you can build scalable, secure, and high-performance APIs that serve your machine learning models effectively.
Remember that deploying AI in production is an ongoing process that requires monitoring, maintenance, and continuous improvement. As your application grows, you may need to adapt your architecture to meet changing requirements and scale to handle increased traffic.