A step-by-step tutorial for building a model that predicts job compensation from role, skills, seniority, and location – a powerful capability for job boards, HR-tech products, compensation teams, and candidates. Built with open job-market data available on Kaggle.

The problem: pay is opaque, and it costs everyone

Compensation is one of the highest-stakes, lowest-transparency decisions in the labor market. Candidates accept offers without knowing if they are fair. Recruiters lose deals over misaligned bands. Compensation teams spend weeks assembling benchmarks from expensive, stale surveys. And product teams at job boards and HR-tech companies are repeatedly asked the same question: what should this role pay?

A salary prediction engine answers that question instantly. Given a job’s title, required skills, seniority, work model, and location, it estimates a market-aligned pay range. In this guide you will build one end to end with machine learning – and see how to extend it into a production capability.

What you will build

  • A trained regression model that predicts the midpoint annual salary for a job posting
  • A reusable feature pipeline over skills, seniority, work model, and geography
  • An evaluation you can trust, plus a clear path to production and richer extensions

The whole thing runs in minutes on a laptop. The only prerequisite is a corpus of real postings that include disclosed pay – which is where open job-market data comes in.

The fuel: real postings with disclosed pay

A salary model is only as good as its training data. You need a large, diverse set of real job postings where compensation is disclosed and the role is described in structured terms – skills, seniority, employment type, and location – not just free text. For this guide we use an open dataset of ~112,000 postings across 23 applicant tracking systems, of which roughly 52,000 carry structured salary ranges, all parsed into typed columns. It is free under CC BY 4.0 and available on Kaggle.

Step 1 – Load and frame the target

Load the postings and build a clean prediction target: the midpoint of the annual salary range, log-transformed because pay distributions are heavily right-skewed.

import pandas as pd
# download the Parquet from the Kaggle dataset page, then:
df = pd.read_parquet(“nextgig_jobs_2026-06.parquet”)
import pandas as pd, numpy as np

# keep credible annual salaries only
m = (df[“salary_rate_unit”] == “year”) & df[“salary_min”].notna() & df[“salary_max”].notna()
sal = df[m].copy()
sal[“salary_mid”] = (sal[“salary_min”] + sal[“salary_max”]) / 2
sal = sal[(sal[“salary_mid”] > 15000) & (sal[“salary_mid”] < 500000)]
sal[“y”] = np.log1p(sal[“salary_mid”])
print(len(sal), ‘training rows’)

Step 2 – Engineer features that drive pay

Compensation is driven by what you can do (skills), how senior you are, how you work (remote vs. on-site), and where the job is. Turn those into model features: the most common skills become binary flags, and categorical attributes are one-hot encoded.

import json
from collections import Counter

def parse_list(x):
    try: return json.loads(x) if isinstance(x, str) else []
    except Exception: return []

sal[“skills”] = sal[“skills_required”].apply(parse_list)
c = Counter()
for s in sal[“skills”]: c.update(s)
top_skills = [s for s,_ in c.most_common(150)]
for s in top_skills:
    sal[f”skill_{s}”] = sal[“skills”].apply(lambda lst: int(s in lst))

cat = [‘experience_level’,’job_level_normalized’,’employment_type’,
       ‘work_model’,’country’,’function’]
X = pd.get_dummies(sal[cat].fillna(‘unknown’), dummy_na=False)
X = pd.concat([X, sal[[f’skill_{s}’ for s in top_skills]]], axis=1)
y = sal[‘y’]

Step 3 – Train the model

Gradient-boosted trees are an excellent default for this kind of tabular, mixed-feature problem. Train on 80% of the data and hold out 20% to evaluate.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
model = HistGradientBoostingRegressor(max_iter=500, learning_rate=0.06)
model.fit(Xtr, ytr)

Step 4 – Evaluate in real dollars

Invert the log transform so the error is interpretable as dollars, and report mean absolute error and R-squared.

pred = np.expm1(model.predict(Xte))
true = np.expm1(yte)
print(‘MAE : $%,.0f’ % mean_absolute_error(true, pred))
print(‘R2  : %.3f’ % r2_score(np.log1p(true), np.log1p(pred)))

Inspect feature importance or SHAP values to see which skills and levels move pay the most – often the most valuable output of the whole exercise for compensation and talent teams.

Step 5 – Predict for a new role

Score an unseen posting by building the same feature vector and calling the model.

def predict_salary(experience_level, job_level, employment_type, work_model,
                   country, function, skills):
    row = {c:0 for c in X.columns}
    for col,_ in [(‘experience_level_’+experience_level,1),
                  (‘job_level_normalized_’+job_level,1),
                  (’employment_type_’+employment_type,1),
                  (‘work_model_’+work_model,1),
                  (‘country_’+country,1),(‘function_’+function,1)]:
        if col in row: row[col]=1
    for s in skills:
        if f’skill_{s}’ in row: row[f’skill_{s}’]=1
    xv = pd.DataFrame([row])[X.columns]
    return float(np.expm1(model.predict(xv))[0])

print(predict_salary(‘Senior’,’Senior’,’Full-time’,’Remote’,’United States’,
                     ‘Engineering’, [‘Python’,’Kubernetes’,’AWS’]))

From notebook to production

  • Wrap the model in an API endpoint and return a range (e.g., P25-P75) rather than a single number.
  • Add geospatial features (latitude/longitude) and cost-of-living adjustments for location precision.
  • Blend in TF-IDF or embeddings over the role summary to capture nuance the structured fields miss.
  • Retrain on a fresh snapshot periodically so the model tracks the market.
  • Calibrate uncertainty so the product can say how confident the estimate is.

The same playbook, other high-value use cases

The load-features-train pattern generalizes. With the same structured job data you can build:

Skills-demand intelligence

Quantify which skills are rising or fading, cluster skill bundles by role, and power workforce-planning dashboards.

Seniority and title normalization

Train classifiers that map messy titles to a clean taxonomy and infer seniority – a core need in recruiting pipelines.

Remote-vs-onsite and category classifiers

Predict work model, function, or whether a role is an ‘AI job’ to enrich search and filtering.

Resume-to-job matching

Use skills, responsibilities, and geography as features for semantic matching and recommendations.

LLM extraction fine-tuning

Use the paired posting/structured-output format to distill a large model’s extraction behavior into a fast, cheap, self-hosted one.

Get the data and start building

The open dataset used in this guide is free under CC BY 4.0:

A note on responsible use: salaries are disclosed on a subset of roles, date_posted is sparse, and geography is approximate – treat outputs as market estimates, not ground truth. The dataset is built for modeling and research, not for re-scraping or powering a live job board.

JS Bin