Download Historical Football Data as CSV with Python
A practical Python guide for downloading historical football matches and match stats to CSV. Covers competition IDs, season IDs, pagination, xG, and rate limits.
If you want a CSV of historical football data, do not start by guessing match IDs. Start with the catalog, resolve the competition and season IDs, then paginate matches and enrich each match with stats.
This guide uses TheStatsAPI's documented REST flow:
GET /football/competitionsto find the league or tournament.GET /football/competitions/{competition_id}/seasonsto find historical seasons.GET /football/matches?competition_id=...&season_id=...to download fixtures and results.GET /football/matches/{match_id}/statsto add shots, possession, xG, corners, and cards.
The examples below use the Premier League because it has stable historical coverage and xG data.
Set up the Python client
python -m pip install requests pandas
export THESTATSAPI_KEY="your_api_key"
Create client.py:
import os
import time
import requests
API_KEY = os.environ["THESTATSAPI_KEY"]
BASE_URL = "https://api.thestatsapi.com/api"
def get(endpoint, params=None, retries=4):
for attempt in range(retries):
response = requests.get(
f"{BASE_URL}{endpoint}",
headers={
"Authorization": f"Bearer {API_KEY}",
"Accept": "application/json",
},
params=params or {},
timeout=30,
)
if response.status_code == 429:
time.sleep(min(15 * (attempt + 1), 60))
continue
response.raise_for_status()
return response.json()
raise RuntimeError(f"Rate limited after {retries} retries")
Find the competition ID
Search by name, then pick the exact competition row you want. Searching Premier League returns multiple leagues, so check the country field before hard-coding an ID.
import time
from client import get
result = get("/football/competitions", {
"search": "Premier League",
"per_page": 10,
})
for competition in result["data"]:
print(competition["id"], competition["name"], competition["country"])
For the English Premier League, the API returns:
comp_3039 Premier League England
Get season IDs
Do not pass plain years like season=2024. Use the season IDs returned by the seasons endpoint.
seasons = get("/football/competitions/comp_3039/seasons", {"per_page": 20})
for season in seasons["data"]:
print(season["id"], season["name"], season["year"])
Example IDs:
sn_3057848 Premier League 24/25 24/25
sn_606923 Premier League 23/24 23/24
sn_654318 Premier League 22/23 22/23
Download all matches for a season
List endpoints return a meta block with page, per_page, total, and total_pages. Use that to paginate.
from client import get
def get_all(endpoint, params):
rows = []
page = 1
while True:
payload = get(endpoint, {**params, "page": page, "per_page": 100})
rows.extend(payload["data"])
if page >= payload["meta"]["total_pages"]:
break
page += 1
time.sleep(2)
return rows
matches = get_all("/football/matches", {
"competition_id": "comp_3039",
"season_id": "sn_3057848",
"status": "finished",
})
print(f"Downloaded {len(matches)} finished matches")
Each match includes fields like id, utc_date, home_team, away_team, score, status, season_id, competition_id, xg_available, and odds_available.
Save matches to CSV
import pandas as pd
match_rows = []
for match in matches:
score = match.get("score") or {}
match_rows.append({
"match_id": match["id"],
"utc_date": match["utc_date"],
"competition_id": match["competition_id"],
"season_id": match["season_id"],
"home_team_id": match["home_team"]["id"],
"home_team": match["home_team"]["name"],
"away_team_id": match["away_team"]["id"],
"away_team": match["away_team"]["name"],
"home_score": score.get("home"),
"away_score": score.get("away"),
"status": match["status"],
"xg_available": match.get("xg_available"),
"odds_available": match.get("odds_available"),
})
pd.DataFrame(match_rows).to_csv("premier-league-2024-25-matches.csv", index=False)
At this point you have a normal fixture/results CSV. For modelling, the next step is match stats.
Add match stats and xG
Only call the stats endpoint for matches where xg_available or detailed stats are relevant to your use case.
def flatten_match_stats(match):
stats = get(f"/football/matches/{match['id']}/stats")["data"]
overview = stats["overview"]
return {
"match_id": match["id"],
"home_xg": overview["expected_goals"]["all"]["home"],
"away_xg": overview["expected_goals"]["all"]["away"],
"home_np_xg": stats.get("np_expected_goals", {}).get("all", {}).get("home"),
"away_np_xg": stats.get("np_expected_goals", {}).get("all", {}).get("away"),
"home_shots": overview["total_shots"]["all"]["home"],
"away_shots": overview["total_shots"]["all"]["away"],
"home_shots_on_target": overview["shots_on_target"]["all"]["home"],
"away_shots_on_target": overview["shots_on_target"]["all"]["away"],
"home_possession": overview["ball_possession"]["all"]["home"],
"away_possession": overview["ball_possession"]["all"]["away"],
"home_corners": overview["corner_kicks"]["all"]["home"],
"away_corners": overview["corner_kicks"]["all"]["away"],
}
stat_rows = []
for match in matches:
if not match.get("xg_available"):
continue
stat_rows.append(flatten_match_stats(match))
time.sleep(2)
pd.DataFrame(stat_rows).to_csv("premier-league-2024-25-match-stats.csv", index=False)
Keep match rows and stat rows in separate CSVs. That makes joins explicit and keeps your raw fixtures useful even when some lower-coverage competitions do not have xG.
Join the CSVs in pandas
matches_df = pd.read_csv("premier-league-2024-25-matches.csv")
stats_df = pd.read_csv("premier-league-2024-25-match-stats.csv")
df = matches_df.merge(stats_df, on="match_id", how="left")
df["total_goals"] = df["home_score"] + df["away_score"]
df["xg_diff"] = df["home_xg"] - df["away_xg"]
print(df[["home_team", "away_team", "home_score", "away_score", "home_xg", "away_xg"]].head())
Bulk download multiple seasons
season_ids = ["sn_3057848", "sn_606923", "sn_654318"]
all_matches = []
for season_id in season_ids:
all_matches.extend(get_all("/football/matches", {
"competition_id": "comp_3039",
"season_id": season_id,
"status": "finished",
}))
print(f"Downloaded {len(all_matches)} matches across {len(season_ids)} seasons")
For a large backfill, store progress after each page. If the script stops, resume from the last completed season_id and page.
FAQ
What endpoint should I call first?
Call /football/competitions first, then /football/competitions/{competition_id}/seasons. That gives you stable IDs before you request matches.
Can I download everything in one request?
No. Use pagination. per_page supports up to 100 rows, and large historical downloads should loop through competitions, seasons, and pages.
Where does xG come from?
Use /football/matches/{match_id}/stats for aggregate xG and /football/matches/{match_id}/shotmap for shot-level xG.
Should I save JSON or CSV?
Save both if you are building a production pipeline: raw JSON for replay/debugging, CSV or database tables for analysis.
Ready to Power Your Sports App?
Start your 7-day free trial. All endpoints included on every plan.