AI Large Document processor using Longformer LLM, FastAPI

This project uses the Longformer transformer model to check whether texts (like website content) comply with specific policies. The project is designed to automate compliance checks, especially for long documents, by splitting texts into smaller chunks, classifying each chunk as compliant or non-compliant, and providing feedback on specific non-compliant terms.

Why Longformer

We chose Longformer for this project because it is specifically designed to handle long documents more efficiently than standard transformer models. It uses a sliding window attention mechanism, allowing it to process long sequences (up to 4096 tokens or more) without the memory and computational overhead typical of full self-attention models like BERT. This makes it ideal for tasks involving large text inputs, such as compliance checking across extensive documents or website content, where understanding the broader context is crucial. Also it can be run locally on my system without issue.

Brief Summary

First scrape the target website urls and retrieve the text content
Divide both into chunks, starting by summarizing the policy
Now against the summarized policy, start validating every chunk of target website text
if any chunk is found to be non-compliant, consider the whole website to be non-compliant
Selected Longformer because I can run it locally and also it has a decent context window
Trained Longformer using some test data I generated with the help of ChatGPT
Made sure to train with both positive and negative cases and tried to cover some edge cases as well

Corners Cut/ Out of Scope

Didn't scan all subdomains
Didn't train a cloud model (didnt want to exceed an API limit), ran Longformer locally
Didn't train on a larger dataset which would perhaps catch more nuances
Didn't optimise the report so it would highlight the exact term that was causing the problem

Key Components

FastAPI: API framework for handling requests and responses.
Longformer Model: A pre-trained transformer model, fine-tuned for sequence classification to detect compliance.
Chunking: Long texts are divided into smaller chunks to be processed within model limitations.
Reporting: Non-compliant terms are identified and chunks are returned.

How it Works

Policy Text & Target Text: The model compares policy guidelines with the target text (e.g., content from a website).
Chunking & Tokenization: Large texts are split into smaller chunks, tokenized, and processed by the model.
Inference: The model predicts whether each chunk is compliant or non-compliant based on fine-tuned training data.
Reporting: Non-compliant chunks are returned for easier identification. (currently chunk size is pretty big so not pinpointing the exact mistakes)

Usage

Train the Model: Fine-tune the Longformer model with policy text and labeled examples of compliant/non-compliant text.
Inference: Run compliance checks via the FastAPI app by sending the policy text and target website content.
Result: Get a boolean isCompliant that tells me if its compliant or not and an array of chunks with results.

Actual Request & Response

{
    "policy_website_url": "https://docs.stripe.com/treasury/marketing-treasury",
    "target_website_url": "https://mercury.com"
}

{
    "result": {
        "isCompliant": false,
        "mistakes": [
            {
                "chunk": "Online Business Banking For Startups | Simplified Financial Workflows Products Resources About Pricing Log In Log In Open Account Log In Log In Open Account Open Menu Products Resources About Pricing  
                 --- truncated ----
                achieving PMF from AMAs. I highly recommend it. Ch",
                "error": "Non-compliant content found"
            },
            {
                "chunk": "arles Meyer Founder , My Better AI Building trust as a finance leader Read the Story Carolynn Levy, inventor of the SAFE Read the Story Sending international wires through SWIFT Read the Story Pricing 
                 --- truncated ---
                FDIC-insured bank . Banking services provided by Choice Financial Group , Column N.A. , and Evolve Bank & Trust , Members FDIC. Deposit insurance covers the failure of an insured bank.",
                "error": "Non-compliant content found"
            }
        ]
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
app		app
scripts		scripts
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Large Document processor using Longformer LLM, FastAPI

Why Longformer

Brief Summary

Corners Cut/ Out of Scope

Key Components

How it Works

Usage

Actual Request & Response

About

Releases

Packages

Languages

souravendra/expert-octo-rotary-phone-sei

Folders and files

Latest commit

History

Repository files navigation

AI Large Document processor using Longformer LLM, FastAPI

Why Longformer

Brief Summary

Corners Cut/ Out of Scope

Key Components

How it Works

Usage

Actual Request & Response

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages