football-stats-pipeline.md in my-portfolio

slug:	football-stats-pipeline
date:	Dec 5, 2025
readTime_en:	10 min read
readTime_pt:	10 min leitura
title_en:	Building a real-time football stats pipeline with AI enrichment
title_pt:	Pipeline de estatísticas de futebol em tempo real com IA
excerpt_en:	How I wired fbref data → scheduled workers → GPT analysis → live API in a weekend, entirely on serverless infrastructure.
excerpt_pt:	Como liguei dados do fbref → workers agendados → análise GPT → API em direto num fim de semana, tudo em infraestrutura serverless.
tags:	AI, TypeScript, Data Engineering

The goal

I wanted a personal API that could answer questions like: "Which midfielders in the Premier League have the best progressive pass ratio in the last 5 games?" — with fresh data, not a CSV from 2022.

The plan: scrape fbref → process and store → enrich with AI → serve via REST. All on Val.town, zero infrastructure to manage.

Architecture

[fbref scraper cron]
       ↓
[raw events store (Val blob)]
       ↓
[processing worker]  ←  runs every 15 min
       ↓
[AI enrichment]  ←  GPT-4o summaries + tags
       ↓
[HTTP API val]  ←  public REST endpoints

The scraper

fbref doesn't have an official API, so this involved parsing their HTML tables. The tricky part is rate limiting — scrape too fast and you get blocked. I settled on one request per 3 seconds with randomised jitter:


async function fetchWithDelay(url: string, minMs = 2000, jitter = 1000) {
  const delay = minMs + Math.random() * jitter;
  await new Promise(r => setTimeout(r, delay));
  return fetch(url, { headers: { "User-Agent": "personal-stats-bot/1.0" } });
}

AI enrichment

After storing raw stats, a separate worker calls the OpenAI API to generate:

A plain-English summary of a player's recent form
Automatic tags (e.g. "in-form", "injury-return", "high-press-specialist")
A form score (0–100)

The prompt is simple but the output is surprisingly useful for filtering.

Result

The API now serves ~50 endpoints with sub-100ms response times (mostly from Val blob cache). Total cost: ~$2/month in OpenAI credits. Total infrastructure managed: zero.

The code lives on Val.town at val.town/u/nmsilva — some vals are public if you want to poke around.

---pt---

O objectivo

Queria uma API pessoal que respondesse a perguntas como: "Quais médios da Premier League têm o melhor rácio de passes progressivos nos últimos 5 jogos?" — com dados frescos, não um CSV de 2022.

O plano: fazer scraping do fbref → processar e armazenar → enriquecer com IA → servir via REST. Tudo no Val.town, zero infraestrutura a gerir.

Arquitectura

[cron scraper fbref]
       ↓
[armazenamento de eventos raw (Val blob)]
       ↓
[worker de processamento]  ←  corre a cada 15 min
       ↓
[enriquecimento com IA]  ←  resumos GPT-4o + tags
       ↓
[HTTP API val]  ←  endpoints REST públicos

O scraper

O fbref não tem uma API oficial, por isso isto envolvia fazer parse das tabelas HTML. A parte complicada é o rate limiting — scraping demasiado rápido e ficas bloqueado. Cheguei a um pedido por cada 3 segundos com jitter aleatório:


async function fetchWithDelay(url: string, minMs = 2000, jitter = 1000) {
  const delay = minMs + Math.random() * jitter;
  await new Promise(r => setTimeout(r, delay));
  return fetch(url, { headers: { "User-Agent": "personal-stats-bot/1.0" } });
}

Enriquecimento com IA

Após armazenar estatísticas raw, um worker separado chama a API da OpenAI para gerar:

Um resumo em linguagem simples da forma recente de um jogador
Tags automáticas (ex: "em forma", "regresso de lesão", "especialista em pressing")
Uma pontuação de forma (0–100)

O prompt é simples mas o output é surpreendentemente útil para filtrar.

Resultado

A API serve agora ~50 endpoints com tempos de resposta abaixo de 100ms (principalmente cache do Val blob). Custo total: ~$2/mês em créditos OpenAI. Infraestrutura gerida: zero.

O código está no Val.town em val.town/u/nmsilva — alguns vals são públicos se quiseres explorar.