Fuzzy Match Invoices using Python & AI

Fuzzy Match Invoices using Python & AI - AI workflow visualization using GitHub Copilot

⚡ TL;DR

GitHub Copilot enables Internal Auditors to automate detailed invoice reconciliation by generating Python scripts for fuzzy matching. This workflow reduces manual sampling risk and identifies near-match anomalies in minutes using the `thefuzz` library.

For Internal Auditors, the era of random sampling and manual VLOOKUPs is ending. One of the most common auditing pain points is reconciling invoice numbers between internal ledgers and bank statements or vendor lists. Human error often results in typos (e.g., "INV-2023" vs "INV-2O23") that exact-match functions in Excel miss entirely. By leveraging GitHub Copilot, auditors can deploy powerful Python scripts to perform "fuzzy matching"—identifying text strings that are approximately equal—without needing a Computer Science degree.

⏱️ Time to Complete: 15 minutes | 📊 Difficulty: Intermediate | 🛠️ Tool: GitHub Copilot & VS Code

Why This Workflow Matters

Traditional Excel matching fails when data isn't perfect, leaving "near-matches" unchecked and potentially hiding duplicate payments or fraud. This workflow allows an Internal Auditor to test 100% of a population for anomalies rather than relying on a small sample. You will move from checking 50 invoices manually to analyzing 50,000 invoices automatically, catching subtle errors that save the company money.

Prerequisites

  • Visual Studio Code (VS Code): Installed with the Python extension.
  • GitHub Copilot: Active subscription and extension installed in VS Code.
  • Python Installed: Basic Python installation on your machine.
  • Test Data: Two CSV files (e.g., ledger.csv and bank_statement.csv) containing invoice numbers.

Step-by-Step Guide

Step 1: Set Up Your Project Environment

First, we need to create a folder for your audit analytics project and install the necessary Python libraries for data manipulation and string matching.

📋 Copy Command pip install pandas thefuzz openpyxl

Open your VS Code terminal (Ctrl+`) and paste the command above to install the required libraries.

Step 2: Prepare Your Prompt for Copilot

Open a new file named fuzzy_match_invoices.py. We will use a "Chain of Thought" prompting strategy to ensure GitHub Copilot understands the audit objective, the file structures, and the desired output format.

📋 Copilot Prompt """ Act as a Senior Data Analyst for Internal Audit. Write a Python script to perform the following tasks: 1. Load two CSV files: 'ledger.csv' and 'bank.csv'. 2. Compare the 'InvoiceID' column in ledger.csv against the 'Description' column in bank.csv. 3. Use the 'thefuzz' library to perform fuzzy matching (token_sort_ratio). 4. Identify matches with a similarity score greater than 85. 5. Create a new DataFrame containing: Ledger_Invoice, Bank_Description, and Similarity_Score. 6. Export the results to 'potential_matches.csv'. Add comments explaining each step for non-technical auditors. """

Step 3: Generate and Refine the Code

Paste the prompt above into your Python file (inside the triple quotes) and press enter. GitHub Copilot will generate the code block. Review the code to ensure it references the correct column names for your specific files. If the column names in your CSVs differ (e.g., "Inv_Num" instead of "InvoiceID"), you can highlight the variable name and ask Copilot to rename it.

Step 4: Execute the Script

Run the script. In seconds, Python will compare every single row in your ledger against every row in the bank statement—a process that would take hours in Excel. The output file potential_matches.csv will contain only the high-probability matches that require your human judgment.

Pro Tips

  • Normalize Data First: ask Copilot to add a step that converts all text to uppercase and removes spaces before matching. This improves accuracy significantly.
  • Adjust Thresholds: If you are getting too many false positives, raise the similarity score in the prompt from 85 to 90 or 95.
  • Cross-Check Dates: Enhance the script by asking Copilot to "only match if the invoice dates are within 5 days of each other" to reduce false positives.

Common Mistakes to Avoid

  • Ignoring Data Types: Ensure invoice numbers are treated as strings (text), not integers, or leading zeros (00123) might gets stripped, causing mismatches.
  • Overlooking File Paths: Ensure your CSV files are in the exact same folder as your Python script, or the script will fail to load the data.
  • Blind Trust: Always manually verify a sample of the matches. Fuzzy matching is a tool for lead generation, not final conclusion.

Frequently Asked Questions

Q: Do I need to know how to write Python code from scratch?

A: No. With GitHub Copilot, your role shifts from "coder" to "reviewer." You provide the audit logic in English, and Copilot handles the syntax. You only need to know how to run the script.

Q: Is my sensitive audit data safe with GitHub Copilot?

A: Copilot analyzes the code you write in your editor, not the data inside your CSV files (unless you hardcode data into the script, which you shouldn't). Ensure your organization has the Copilot for Business privacy settings enabled so your prompts aren't used for model training.

Q: Can this handle millions of records?

A: Yes, but for very large datasets (millions of rows), the basic loop might be slow. You can ask Copilot to "optimize this script for large datasets using vectorization or rapidfuzz" to speed it up.

🎯 Key Takeaways

  • Test 100% of invoice populations instead of relying on random sampling.
  • Instantly detect 'fat-finger' data entry errors and potential duplicate payments hidden by typos.
  • Generate complex Python logic using plain English audit instructions without prior coding experience.
Share this workflow:

Explore More Internal Auditor Workflows