Using TF-IDF and k-nearest neighbors to automate compensation survey matching in HR

Portfolio | Python | NLP | Human Resources

February 2024


HR professionals routinely go through the process of matching jobs across multiple compensation surveys to determine pay ranges for related job codes. It's a manual, time-consuming process that relies heavily on human judgment — and it scales poorly as the number of roles and surveys grows. One practical way to modernize this is by building a k-nearest neighbor model with scikit-learn to fuzzy-match jobs automatically, with a quantified confidence score attached to every result.

Here's how it works.


Setup

Start by importing the relevant libraries. The core tools are scikit-learn for the vectorizer and nearest neighbors model, pandas for data wrangling, and ftfy for cleaning encoding artifacts common in HR data exports.

import pandas as pd
import numpy as np
import regex as re
from ftfy import fix_text
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from sklearn.utils import shuffle
import seaborn as sns
import warnings

pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.2f}'.format
np.set_printoptions(suppress=True)
warnings.filterwarnings("ignore")

python


Load and Preview the Survey Data

We load two compensation survey CSVs — encoding them in Microsoft's cp1252 format, which is the default for most HR system exports — and take a random sample preview of each.

survey_a = pd.read_csv('survey_a_jobs.csv', encoding='cp1252')
survey_b = pd.read_csv('survey_b_jobs.csv', encoding='cp1252')

preview_a = shuffle(survey_a)
preview_a['Description'] = preview_a['Description'].str[:50]
preview_a = preview_a.head(n=3)

A quick distribution check of job families across both surveys confirms they're structured similarly — a prerequisite before running any matching model.

g = sns.catplot(x="Family", data=survey_a, kind="count",
                order=survey_a.Family.value_counts().index,
                height=3, aspect=2)
g.set_axis_labels("", "Count of Jobs - Survey A")

Build the Match Column and N-Gram Function

We create a composite match column by concatenating the job title and description. This gives the model more signal than job title alone, especially for roles with generic titles but distinctive descriptions.

The n-gram function cleans and normalizes the string before generating character trigrams — the same approach used in vector databases for semantic similarity.

survey_a['match_column'] = survey_a['Job Title'] + '_' + survey_a['Description']
survey_b['match_column'] = survey_b['Job Title'] + '_' + survey_b['Description']

def ngrams(string, n=3):
    string = str(string)
    string = fix_text(string)
    string = string.encode("ascii", errors="ignore").decode()
    string = string.lower()
    chars_to_remove = [")", "(", ".", "|", "[", "]", "{", "}", "'", ":", "-"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and').replace(',', ' ')
    string = string.replace('-', ' ').replace('_', ' ')
    string = string.title()
    string = re.sub(' +', ' ', string).strip()
    string = ' ' + string + ' '
    string = re.sub(r'[,-./]|\sBD', r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

Vectorize and Run the Nearest Neighbors Model

This step is effectively the same method used to build a vector database for generative AI applications — term-frequency / inverse document frequency (TF-IDF) converts each job string into a numeric vector, and then k-nearest neighbors finds the closest match in Survey A for every job in Survey B.

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)

unique_a = survey_a['match_column'].unique().astype('U')
unique_b = survey_b['match_column'].unique().astype('U')

tfidf = vectorizer.fit_transform(unique_a)
nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)

def getNearestN(query):
    queryTFIDF_ = vectorizer.transform(query)
    distances, indices = nbrs.kneighbors(queryTFIDF_)
    return distances, indices

distances, indices = getNearestN(unique_b)

Build the Results and Normalize the Confidence Score

We collect matches into a dataframe, then normalize the confidence score so that higher values indicate better matches (flipping the distance measure, which is inverse to similarity).

matches = []
for i, j in enumerate(indices):
    temp = [round(distances[i][0], 2), unique_a[j][0], unique_b[i]]
    matches.append(temp)

matches = pd.DataFrame(matches)
matches = matches.rename({0: "match_confidence",
                           1: "match_column_a",
                           2: "match_column_b"}, axis=1)

# Merge back with original survey data
matches_a = survey_a.merge(matches, left_on='match_column',
                            right_on='match_column_a', how='left')
matches_all = pd.merge(matches_a, survey_b, left_on='match_column_b',
                        right_on='match_column', how='left', suffixes=('', '_b'))

# Normalize: flip so higher = better match
matches_all['match_confidence'] = round(
    (matches_all['match_confidence'] - matches_all['match_confidence'].min()) /
    (matches_all['match_confidence'].max() - matches_all['match_confidence'].min()),
    3
)

Output

The final file gives HR teams a ranked list of cross-survey job matches with a normalized confidence score — enabling them to accept high-confidence matches automatically and focus manual review on borderline cases.

preview = matches_all[["job_code", "job_title", "description",
                         "job_code_b", "job_title_b", "description_b",
                         "match_confidence"]]

preview = preview.sort_values(by=['match_confidence'], ascending=False)
preview.to_csv('best_match_score.csv', index=False)

With this output, pay ranges from Survey B can be confidently mapped to Survey A job codes — dramatically reducing the time HR teams spend on manual survey matching.

Thanks to Josh Taylor for the foundational approach.

← Back to Blog