Document Classification & Invoice Extraction API

Project Overview

This project is a Document Classification & Invoice Extraction API using NLP and deep learning. It classifies documents into categories like invoices, budgets, and emails using a DistilBERT model. If a document is classified as an invoice, the API extracts relevant fields such as invoice number, date, amount, and vendor.

Installation & Setup

Prerequisites

Python 3.8+
FastAPI
PyTorch
Tesseract OCR
Transformers (Hugging Face)
Uvicorn

Models and Algorithms and techniques used

1. Document Classification

Used DistilBERT, a transformer-based deep learning model, for text classification.

DistilBERT (Distilled BERT) is a lighter and faster version of BERT (Bidirectional Encoder Representations from Transformers). It is designed to retain most of BERT’s accuracy while being smaller and more efficient.

Key Features:

Smaller Size: 40% fewer parameters than BERT.
Faster Inference: 60% faster while maintaining 97% of BERT’s performance.
Uses Knowledge Distillation: Trained by compressing knowledge from BERT.
Retains BERT’s Bi-directional Attention: Ensures high-quality text understanding.

Why Use DistilBERT?

Faster classification with reduced computational cost.
Lower memory consumption, making it ideal for real-time applications like our API.
Pretrained models available in Hugging Face, making it easy to fine-tune for specific tasks.

2. Invoice Extraction

Regular Expressions (Regex): Extracts structured information such as invoice numbers, dates, amounts, and vendor names from text.

Pattern Matching: Patterns are designed to identify invoice-specific terms like Invoice #, dates in various formats, and currency values.

Text Preprocessing: Removes unwanted characters and normalizes text before extraction.

Steps

1. Clone the repository

https://github.com/pranj4/Document-Classifier-/

2. Create a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

3. Install dependencies:

pip install -r requirements.txt

4. Run the FastAPI server:

cd document-classification-api

Folder Structure

.venv/: Contains the virtual environment for Python dependencies.
api/: Handles API-related logic.
- invoice_extractor.py: Extracts data from invoices.
- main.py: Entry point for running the API.
- model_loader.py: Utility to load pre-trained or fine-tuned models for API use.
config/: Contains configuration files for the project.
data/: Placeholder for storing raw and processed datasets.
model/: Directory for storing machine learning models.
src/: Core logic for the project.
- results/: Stores outputs, logs, or evaluation metrics from experiments.
- saved_model/: Contains pre-trained or fine-tuned model files.
- autolabel.py: Automates the labeling of datasets.
- create_dataset.py: Generates and processes datasets for training.
- explore.py: Helps explore data and perform analyses.
- extract_invoice_data.py: Specialized script for extracting and processing invoice data.
- extract_text.py: Extracts textual information from input data.
- fine_tuned.py: Handles fine-tuning of machine learning models.
- preprocess.py: Prepares datasets for training (e.g., cleaning and tokenizing data).
- train_model.py: Script to train machine learning models.
.gitignore: Specifies files and directories to be excluded from version control.
README.md: Documentation for the project.
test.py & test2.py: Scripts for testing various components of the project.

Dataset Exploration and Preprocessing

1. Install Kaggle

pip install kaggle

2.Get Kaggle API Key and store it in your project

3. Download the dataset and move it into your project and explore it.

4. Extract the data from images after labeling it as invoice ,email ,Bill etc and clean the data and prepare it for training (remove unwanted characters and symboles) using preprocess.py and extract_text.py

Convert Image to Text using OCR

Autolabeling as Invoice , Bill , Email etc.

Preprocessing to normalise the data

Extracting the data to a .CSV file and viewing it.

Model Training (train_model.py) and Evaluation on various metrics like Accuracy, Precision ,F1 score

Used DistilBERTForSequenceClassification from Hugging Face.
Tokenized text using DistilBertTokenizer.
Trained on labeled data with PyTorch.
Saved trained model for later inference.

Evaluation

From the scores displayed in the above images, the following can be inferred:

Accuracy (90.67%): The model correctly classified approximately 90.67% of the test samples. This is a strong performance, indicating that the model is well-trained.

Precision (91.16%): Out of all the positive predictions made by the model, 91.16% were actually correct. This suggests a low false positive rate, meaning the model is reliable when it predicts a positive class.

Recall (90.67%): The model correctly identified 90.67% of the actual positive cases. This means it is effectively capturing most of the relevant instances.

F1 Score (90.72%): The F1 Score is a balance between precision and recall. A score of 90.72% indicates a well-balanced model that neither favors precision nor recall too much.

The train loss (0.4387) and eval loss (0.2541) suggest that the model performed better on the validation set than the training set, possibly due to regularization techniques or a relatively small dataset.

Invoice Data extraction (extract_invoice_data.py) and saving it to a .CSV file

FastAPI Integration

The trained model is integrated into a REST API built with FastAPI. The API allows users to upload a document, classify it, and extract relevant details if it's an invoice

API Endpoints

1. Home Endpoint

URL: /

Method: GET

Response:

{ "message": "Document Classification & Invoice Extraction API is Running!" }

2. Document Classification & Extraction

URL: /classify_and_extract

Method: POST

Request: Upload a text file

Test with curl

curl -X 'POST' 'http://127.0.0.1:8000/classify_and_extract' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@invoice.pdf'

replace invoice.pdf with a text file of yours to test

Future Improvements

Improve classification accuracy with a larger dataset.
Deploy the API using Docker & AWS Lambda.
Extend the extraction model to support more document types.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
autolabel.py		autolabel.py
create_dataset.py		create_dataset.py
explore.py		explore.py
extract_invoice_data.py		extract_invoice_data.py
extract_text.py		extract_text.py
fine_tuned.py		fine_tuned.py
invoice_extractor.py		invoice_extractor.py
main.py		main.py
model_loader.py		model_loader.py
preprocess.py		preprocess.py
test.py		test.py
test2.py		test2.py
train_model.py		train_model.py

pranj4/Document-Classifier-

Folders and files

Latest commit

History

Repository files navigation