How a multi-method NLP pipeline achieved 80%+ match rates across tens of millions of loan records

Case Study | Data Engineering | NLP | Python


The mortgage industry is rife with messy, free-text data. Sources include county recorders, servicers, originators, and handwritten paper forms of many kinds. Each source though names institutions differently. A single lender can have multiple titles. For example, "Wells Fargo Bank, NA," "Wells Fargo," or "WFB" might be used.

As a result, mortgage data platforms spend an inordinate amount of time standardizing and normalizing millions of these records. Inconsistent names make it hard to track transactions correctly. One leading mortgage data company faced this challenge and came to Terlina for a solution. Their platform collects transaction data from public records and needed a way to automatically match lender names to their master list of verified companies. Analytical data products, reporting, and market-share calculations all relied on accurate data.

Here's a sample of what the raw lender_beneficiary_source field actually looks like coming in:

lender_beneficiary_source (raw)
us bank na
jpmorgan chase bank na
washington state employees cu
wells fargo bank na
tum two morgage
jackson flagstar centre office
secretary of veterans affairs, an office
yakim

The Solution: A Cascading Multi-Method Pipeline

To solve this problem, the Terlina team designed a process for handling this naming variation gracefully. It designed and built a Python-based matching pipeline that runs transactions through three sequential stages, each progressively more sophisticated. The script assigns a confidence score to each candidate match and scales to handle large datasets without requiring match-by-match manual review. The approach prioritizes efficiency: fast, high-confidence methods run first, reserving the more compute-intensive methods for the harder cases that fall through.

Stage 1: Exact Matching

After normalizing both datasets (lowercasing, stripping whitespace and special characters), the pipeline performs an inner join using Dask for distributed processing. Matches found here receive a similarity score of 1.0, resolving a substantial share of total transactions at essentially zero compute cost.

# Load with Dask for scalable processing
transactions_raw = dd.read_csv('dev_transactions.csv')
companies_raw = dd.read_csv('dev_lenders.csv')

# Normalize: lowercase and strip whitespace
def preprocess(df, column):
    df[column] = df[column].str.lower().str.strip()
    return df

transactions = preprocess(transactions_raw, 'lender_beneficiary_source')
companies = preprocess(companies_raw, 'name')

# Exact join and score
exact_match = dd.merge(
    transactions, companies,
    left_on='lender_beneficiary_source',
    right_on='name',
    how='inner'
).compute()

exact_match['similarity_score'] = 1

Result — exact matches (score = 1.0):

lender_beneficiary_source Matched Company Name Score
homestreet bank homestreet bank 1.000
adcom group inc adcom group inc 1.000
washington state bank washington state bank 1.000
directors mortgage inc directors mortgage inc. 1.000
global cu global credit union 1.000

These records are resolved and removed from the pipeline. Everything else moves to Stage 2.


Stage 2: Fuzzy String Matching

Unmatched transactions go through FuzzyWuzzy's token sort ratio algorithm. By tokenizing and sorting each name before comparison, this method handles word-order variation well — "First National Bank of Oklahoma" and "Oklahoma First National Bank" would score near-perfectly. The algorithm only accepts a match above a configurable threshold (90 by default), keeping precision high.

from fuzzywuzzy import process, fuzz

def fuzzy_match(row, choices, scorer=fuzz.token_sort_ratio, threshold=90):
    best_match, best_score = process.extractOne(
        row['lender_beneficiary_source'], choices, scorer=scorer
    )
    if best_score >= threshold:
        return pd.Series([best_match, best_score])
    else:
        return pd.Series([None, 0])

choices = unmatched_companies['name'].tolist()
unmatched_transactions[['best_match_name', 'similarity_score']] = \
    unmatched_transactions.apply(fuzzy_match, choices=choices, axis=1)

Result — fuzzy matches (score ≥ 0.90):

lender_beneficiary_source Matched Company Name Score
united american mortgage corp united american mortgage corporation 0.937
priority home lending llc priority home lending, llc 0.980
lennar mortgage llc lennar mortgage, llc 0.970
pennymac loan services llc pennymac loan services, llc 0.980
pmac lending services inc park lending services, inc. 0.900

These are the "almost exact" matches — legal suffix variations, minor abbreviations, punctuation differences — that exact joining misses but fuzzy scoring catches cleanly.


Stage 3: TF-IDF Cosine Similarity with Character N-Grams

The toughest cases — abbreviated names, partial matches, OCR-corrupted strings — go to a TF-IDF vectorizer configured with character-level trigrams rather than word tokens. This is a key design choice: character n-grams capture sub-word patterns that word-level models miss entirely. Cosine similarity scores then determine the best candidate for each remaining unmatched transaction.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import regex as re

# Character-level trigram generator
def ngrams(string, n=3):
    string = re.sub(r'[)(",.|[\]{}\'&,-./]', '', string.lower().strip())
    string = re.sub(' +', ' ', string).strip()
    return [''.join(gram) for gram in zip(*[string[i:] for i in range(n)])]

def tfidf_cosine_similarity(unmatched_transactions, unmatched_companies):
    vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)

    tfidf_transactions = vectorizer.fit_transform(
        unmatched_transactions['lender_beneficiary_source']
    )
    tfidf_companies = vectorizer.transform(unmatched_companies['name'])

    cosine_similarities = cosine_similarity(tfidf_transactions, tfidf_companies)
    best_matches = cosine_similarities.argmax(axis=1)
    best_scores = cosine_similarities.max(axis=1)

    return pd.DataFrame({
        'lender_beneficiary_source': unmatched_transactions['lender_beneficiary_source'].values,
        'best_match_name': unmatched_companies.iloc[best_matches]['name'].values,
        'similarity_score': best_scores
    })

tfidf_matches = tfidf_cosine_similarity(unmatched_after_fuzzy, unmatched_companies)

Result — TF-IDF matches (score 0.70–0.89):

lender_beneficiary_source Matched Company Name Score
jpmorgan chase bank na jpmorgan chase bank, national association 0.865
wells fargo bank na wells fargo bank, national association 0.867
cornerstone home lending cornerstone lending inc. 0.857
potlatch no 1 fcu potlatch no.1 financial credit union 0.519
third fsla of cleveland bank of cleveland 0.726

The character trigram approach is what allows the pipeline to connect jpmorgan chase bank na to its full legal name — a match that purely word-based methods would score much lower.


Stage 4: Combine and Save

After all three stages, results are concatenated into a single output with a consistent similarity score column across methods, then saved for downstream use.

final_result = pd.concat(
    [exact_match, fuzzy_wuzzy, tfidf_matches],
    ignore_index=True
)

final_result.to_csv('lender_matches.csv', index=False)

Final combined output (sample):

lender_beneficiary_source Matched Company Name Score Method
homestreet bank homestreet bank 1.000 Exact
lennar mortgage llc lennar mortgage, llc 0.970 Fuzzy
wells fargo bank na wells fargo bank, national association 0.867 TF-IDF
cornerstone home lending cornerstone lending inc. 0.857 TF-IDF
jackson flagstar centre office d.j. jackson, inc. 0.464 Unresolved

Levenshtein: juice not worth the squeeze. The pipeline includes scaffolding for Levenshtein distance scoring, which is excellent for detecting misspellings. In reality though, computing pairwise Levenshtein across large datasets is memory-intensive for marginal gain once TF-IDF is in play.


Results

The final version of the pipeline successfully resolves matches for over 80% of lender beneficiary sources in the transaction dataset. That represents a substantial improvement over what exact matches alone achieve, and dramatically reduces the volume of records requiring manual review or write-off.

Score Band Range Count (sample) Notes
Exact 1.0 152 High confidence, no review needed
Fuzzy 0.90–0.99 157 High confidence, spot-check recommended
TF-IDF 0.70–0.89 37 Review borderline cases
Low / Unresolved < 0.70 23 Manual review or write-off

Each match gets a similarity score, and results are output in both CSV and JSON formats — giving the client's data team flexibility to apply their own downstream thresholds, audit borderline matches, or feed scores into other analytical models.


Lessons Learned

Cascade design works better than a single method. Run easy cases through fast methods. This keeps compute costs low. You can then tune each stage's threshold separately.

Character n-grams are often overlooked for financial entity names. Most NLP tutorials use word tokens. However, matching company names greatly benefits from character-level analysis. After all, abbreviations and legal suffixes are pretty common.

Scoring and transparency are as important as the match rate. A similarity score with each match helps users decide which ones to trust. It also creates a natural audit trail.

Start distributed. Use Dask instead of pandas from the beginning. This avoids difficult rewrites as data volumes increase.

Vector databases offer enormous potential in data pipelines like this. An "approximate nearest neighbor" approach using a similar method to LLMs remains an option for future enhancements. However, the overhead and compute power required to run them often outweighs more traditional NLP approaches like this.


If your organization is dealing with messy entity matching challenges in financial, healthcare, or operational data, I'd be glad to talk through the problem. Reach out at hello@terlina.com.

← Back to Blog