This project uses the Longformer transformer model to check whether texts (like website content) comply with specific policies. The project is designed to automate compliance checks, especially for long documents, by splitting texts into smaller chunks, classifying each chunk as compliant or non-compliant, and providing feedback on specific non-compliant terms.
We chose Longformer for this project because it is specifically designed to handle long documents more efficiently than standard transformer models. It uses a sliding window attention mechanism, allowing it to process long sequences (up to 4096 tokens or more) without the memory and computational overhead typical of full self-attention models like BERT. This makes it ideal for tasks involving large text inputs, such as compliance checking across extensive documents or website content, where understanding the broader context is crucial. Also it can be run locally on my system without issue.
- First scrape the target website urls and retrieve the text content
- Divide both into chunks, starting by summarizing the policy
- Now against the summarized policy, start validating every chunk of target website text
- if any chunk is found to be non-compliant, consider the whole website to be non-compliant
- Selected Longformer because I can run it locally and also it has a decent context window
- Trained Longformer using some test data I generated with the help of ChatGPT
- Made sure to train with both positive and negative cases and tried to cover some edge cases as well
- Didn't scan all subdomains
- Didn't train a cloud model (didnt want to exceed an API limit), ran Longformer locally
- Didn't train on a larger dataset which would perhaps catch more nuances
- Didn't optimise the report so it would highlight the exact term that was causing the problem
- FastAPI: API framework for handling requests and responses.
- Longformer Model: A pre-trained transformer model, fine-tuned for sequence classification to detect compliance.
- Chunking: Long texts are divided into smaller chunks to be processed within model limitations.
- Reporting: Non-compliant terms are identified and chunks are returned.
- Policy Text & Target Text: The model compares policy guidelines with the target text (e.g., content from a website).
- Chunking & Tokenization: Large texts are split into smaller chunks, tokenized, and processed by the model.
- Inference: The model predicts whether each chunk is compliant or non-compliant based on fine-tuned training data.
- Reporting: Non-compliant chunks are returned for easier identification. (currently chunk size is pretty big so not pinpointing the exact mistakes)
- Train the Model: Fine-tune the Longformer model with policy text and labeled examples of compliant/non-compliant text.
- Inference: Run compliance checks via the FastAPI app by sending the policy text and target website content.
- Result: Get a boolean
isCompliant
that tells me if its compliant or not and an array of chunks with results.
{
"policy_website_url": "https://docs.stripe.com/treasury/marketing-treasury",
"target_website_url": "https://mercury.com"
}
{
"result": {
"isCompliant": false,
"mistakes": [
{
"chunk": "Online Business Banking For Startups | Simplified Financial Workflows Products Resources About Pricing Log In Log In Open Account Log In Log In Open Account Open Menu Products Resources About Pricing
--- truncated ----
achieving PMF from AMAs. I highly recommend it. Ch",
"error": "Non-compliant content found"
},
{
"chunk": "arles Meyer Founder , My Better AI Building trust as a finance leader Read the Story Carolynn Levy, inventor of the SAFE Read the Story Sending international wires through SWIFT Read the Story Pricing
--- truncated ---
FDIC-insured bank . Banking services provided by Choice Financial Group , Column N.A. , and Evolve Bank & Trust , Members FDIC. Deposit insurance covers the failure of an insured bank.",
"error": "Non-compliant content found"
}
]
}
}