How a multi-method NLP pipeline achieved 80%+ match rates across tens of millions of loan records
Case Study | Data Engineering | NLP | Python
The mortgage industry is rife with messy, free-text data. Sources include county recorders, servicers, originators, and handwritten paper forms of many kinds. Each source though names institutions differently. A single lender can have multiple titles. For example, "Wells Fargo Bank, NA," "Wells Fargo," or "WFB" might be used.
As a result, mortgage data platforms spend an inordinate amount of time standardizing and normalizing millions of these records. Inconsistent names make it hard to track transactions correctly. One leading mortgage data company faced this challenge and came to Terlina for a solution. Their platform collects transaction data from public records and needed a way to automatically match lender names to their master list of verified companies. Analytical data products, reporting, and market-share calculations all relied on accurate data.
Here's a sample of what the raw lender_beneficiary_source field actually looks like coming in:
| lender_beneficiary_source (raw) |
|---|
us bank na |
jpmorgan chase bank na |
washington state employees cu |
wells fargo bank na |
tum two morgage |
jackson flagstar centre office |
secretary of veterans affairs, an office |
yakim |
The Solution: A Cascading Multi-Method Pipeline
To solve this problem, the Terlina team designed a process for handling this naming variation gracefully. It designed and built a Python-based matching pipeline that runs transactions through three sequential stages, each progressively more sophisticated. The script assigns a confidence score to each candidate match and scales to handle large datasets without requiring match-by-match manual review. The approach prioritizes efficiency: fast, high-confidence methods run first, reserving the more compute-intensive methods for the harder cases that fall through.
Stage 1: Exact Matching
After normalizing both datasets (lowercasing, stripping whitespace and special characters), the pipeline performs an inner join using Dask for distributed processing. Matches found here receive a similarity score of 1.0, resolving a substantial share of total transactions at essentially zero compute cost.
# Load with Dask for scalable processing
transactions_raw = dd.read_csv('dev_transactions.csv')
companies_raw = dd.read_csv('dev_lenders.csv')
# Normalize: lowercase and strip whitespace
def preprocess(df, column):
df[column] = df[column].str.lower().str.strip()
return df
transactions = preprocess(transactions_raw, 'lender_beneficiary_source')
companies = preprocess(companies_raw, 'name')
# Exact join and score
exact_match = dd.merge(
transactions, companies,
left_on='lender_beneficiary_source',
right_on='name',
how='inner'
).compute()
exact_match['similarity_score'] = 1
Result — exact matches (score = 1.0):
| lender_beneficiary_source | Matched Company Name | Score |
|---|---|---|
homestreet bank |
homestreet bank | 1.000 |
adcom group inc |
adcom group inc | 1.000 |
washington state bank |
washington state bank | 1.000 |
directors mortgage inc |
directors mortgage inc. | 1.000 |
global cu |
global credit union | 1.000 |
These records are resolved and removed from the pipeline. Everything else moves to Stage 2.
Stage 2: Fuzzy String Matching
Unmatched transactions go through FuzzyWuzzy's token sort ratio algorithm. By tokenizing and sorting each name before comparison, this method handles word-order variation well — "First National Bank of Oklahoma" and "Oklahoma First National Bank" would score near-perfectly. The algorithm only accepts a match above a configurable threshold (90 by default), keeping precision high.
from fuzzywuzzy import process, fuzz
def fuzzy_match(row, choices, scorer=fuzz.token_sort_ratio, threshold=90):
best_match, best_score = process.extractOne(
row['lender_beneficiary_source'], choices, scorer=scorer
)
if best_score >= threshold:
return pd.Series([best_match, best_score])
else:
return pd.Series([None, 0])
choices = unmatched_companies['name'].tolist()
unmatched_transactions[['best_match_name', 'similarity_score']] = \
unmatched_transactions.apply(fuzzy_match, choices=choices, axis=1)
Result — fuzzy matches (score ≥ 0.90):
| lender_beneficiary_source | Matched Company Name | Score |
|---|---|---|
united american mortgage corp |
united american mortgage corporation | 0.937 |
priority home lending llc |
priority home lending, llc | 0.980 |
lennar mortgage llc |
lennar mortgage, llc | 0.970 |
pennymac loan services llc |
pennymac loan services, llc | 0.980 |
pmac lending services inc |
park lending services, inc. | 0.900 |
These are the "almost exact" matches — legal suffix variations, minor abbreviations, punctuation differences — that exact joining misses but fuzzy scoring catches cleanly.
Stage 3: TF-IDF Cosine Similarity with Character N-Grams
The toughest cases — abbreviated names, partial matches, OCR-corrupted strings — go to a TF-IDF vectorizer configured with character-level trigrams rather than word tokens. This is a key design choice: character n-grams capture sub-word patterns that word-level models miss entirely. Cosine similarity scores then determine the best candidate for each remaining unmatched transaction.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import regex as re
# Character-level trigram generator
def ngrams(string, n=3):
string = re.sub(r'[)(",.|[\]{}\'&,-./]', '', string.lower().strip())
string = re.sub(' +', ' ', string).strip()
return [''.join(gram) for gram in zip(*[string[i:] for i in range(n)])]
def tfidf_cosine_similarity(unmatched_transactions, unmatched_companies):
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tfidf_transactions = vectorizer.fit_transform(
unmatched_transactions['lender_beneficiary_source']
)
tfidf_companies = vectorizer.transform(unmatched_companies['name'])
cosine_similarities = cosine_similarity(tfidf_transactions, tfidf_companies)
best_matches = cosine_similarities.argmax(axis=1)
best_scores = cosine_similarities.max(axis=1)
return pd.DataFrame({
'lender_beneficiary_source': unmatched_transactions['lender_beneficiary_source'].values,
'best_match_name': unmatched_companies.iloc[best_matches]['name'].values,
'similarity_score': best_scores
})
tfidf_matches = tfidf_cosine_similarity(unmatched_after_fuzzy, unmatched_companies)
Result — TF-IDF matches (score 0.70–0.89):
| lender_beneficiary_source | Matched Company Name | Score |
|---|---|---|
jpmorgan chase bank na |
jpmorgan chase bank, national association | 0.865 |
wells fargo bank na |
wells fargo bank, national association | 0.867 |
cornerstone home lending |
cornerstone lending inc. | 0.857 |
potlatch no 1 fcu |
potlatch no.1 financial credit union | 0.519 |
third fsla of cleveland |
bank of cleveland | 0.726 |
The character trigram approach is what allows the pipeline to connect jpmorgan chase bank na to its full legal name — a match that purely word-based methods would score much lower.
Stage 4: Combine and Save
After all three stages, results are concatenated into a single output with a consistent similarity score column across methods, then saved for downstream use.
final_result = pd.concat(
[exact_match, fuzzy_wuzzy, tfidf_matches],
ignore_index=True
)
final_result.to_csv('lender_matches.csv', index=False)
Final combined output (sample):
| lender_beneficiary_source | Matched Company Name | Score | Method |
|---|---|---|---|
homestreet bank |
homestreet bank | 1.000 | Exact |
lennar mortgage llc |
lennar mortgage, llc | 0.970 | Fuzzy |
wells fargo bank na |
wells fargo bank, national association | 0.867 | TF-IDF |
cornerstone home lending |
cornerstone lending inc. | 0.857 | TF-IDF |
jackson flagstar centre office |
d.j. jackson, inc. | 0.464 | Unresolved |
Levenshtein: juice not worth the squeeze. The pipeline includes scaffolding for Levenshtein distance scoring, which is excellent for detecting misspellings. In reality though, computing pairwise Levenshtein across large datasets is memory-intensive for marginal gain once TF-IDF is in play.
Results
The final version of the pipeline successfully resolves matches for over 80% of lender beneficiary sources in the transaction dataset. That represents a substantial improvement over what exact matches alone achieve, and dramatically reduces the volume of records requiring manual review or write-off.
| Score Band | Range | Count (sample) | Notes |
|---|---|---|---|
| Exact | 1.0 | 152 | High confidence, no review needed |
| Fuzzy | 0.90–0.99 | 157 | High confidence, spot-check recommended |
| TF-IDF | 0.70–0.89 | 37 | Review borderline cases |
| Low / Unresolved | < 0.70 | 23 | Manual review or write-off |
Each match gets a similarity score, and results are output in both CSV and JSON formats — giving the client's data team flexibility to apply their own downstream thresholds, audit borderline matches, or feed scores into other analytical models.
Lessons Learned
Cascade design works better than a single method. Run easy cases through fast methods. This keeps compute costs low. You can then tune each stage's threshold separately.
Character n-grams are often overlooked for financial entity names. Most NLP tutorials use word tokens. However, matching company names greatly benefits from character-level analysis. After all, abbreviations and legal suffixes are pretty common.
Scoring and transparency are as important as the match rate. A similarity score with each match helps users decide which ones to trust. It also creates a natural audit trail.
Start distributed. Use Dask instead of pandas from the beginning. This avoids difficult rewrites as data volumes increase.
Vector databases offer enormous potential in data pipelines like this. An "approximate nearest neighbor" approach using a similar method to LLMs remains an option for future enhancements. However, the overhead and compute power required to run them often outweighs more traditional NLP approaches like this.
If your organization is dealing with messy entity matching challenges in financial, healthcare, or operational data, I'd be glad to talk through the problem. Reach out at hello@terlina.com.