Merge pull request #142 from raju-rangan/main

feat: Orchestrate an intelligent document processing workflow using tool-use on Amazon Bedrock
aws-samples · Jan 15, 2025 · 4c2365c · 4c2365c
2 parents 101bb16 + ca0a261
commit 4c2365c
Show file tree

Hide file tree

Showing 19 changed files with 1,872 additions and 0 deletions.
diff --git a/medical-idp/.gitignore b/medical-idp/.gitignore
@@ -0,0 +1,10 @@
+.idea
+grobid/__pycache__
+__pycache__
+.venv
+.vscode
+temp/*
+*.DS_Store
+*/.ipynb_checkpoints
+.ipynb_checkpoints/*
+Untitled*
diff --git a/medical-idp/.streamlit/config.toml b/medical-idp/.streamlit/config.toml
@@ -0,0 +1,8 @@
+[logger]
+level = "info"
+
+[browser]
+gatherUsageStats = true
+
+[ui]
+hideTopBar = true
diff --git a/medical-idp/README.md b/medical-idp/README.md
@@ -0,0 +1,52 @@
+# Orchestrate an intelligent document processing workflow using tool-use on Amazon Bedrock 
+
+## Solution Overview
+
+This intelligent document processing solution leverages Amazon Bedrock to orchestrate a sophisticated workflow for handling multi-page healthcare documents with mixed content types. At the core of this solution is Amazon Bedrock's Converse API with its powerful tool-use capabilities, which enables foundation models to interact with external functions and APIs as part of their response generation.
+
+The solution employs a strategic multi-model approach, optimizing for both performance and cost by selecting the most appropriate model for each task:
+
+* **Claude 3 Haiku**: Serves as the workflow orchestrator due to its lower latency and cost-effectiveness. Its strong reasoning and tool-use abilities make it ideal for:
+    - Coordinating the overall document processing pipeline
+    - Making routing decisions for different document types
+    - Invoking appropriate processing functions
+    - Managing the workflow state
+
+* **Claude 3.5 Sonnet (v2)**: For vision-intensive tasks where its superior visual reasoning capabilities excels at:
+    - Interpreting complex document layouts and structure
+    - Extracting text from tables and forms 
+    - Processing medical charts and handwritten notes
+    - Converting unstructured visual information into structured data
+
+![Orchestration Flow](static/flow_diagram.webp)
+
+## Use Case and Dataset
+
+For our example use case, we'll examine a patient intake process at a healthcare institution. The workflow processes a patient health information package containing three distinct document types that demonstrate the varying complexity in document processing:
+
+1. **Structured Document**: A new patient intake form with standardized fields for personal information, medical history, and current symptoms. This form follows a consistent layout with clearly defined fields and checkboxes, making it an ideal example of a structured document.
+
+2. **Semi-structured Document**: A health insurance card that contains essential coverage information. While insurance cards generally contain similar information (policy number, group ID, coverage dates), they come from different providers with varying layouts and formats, showing the semi-structured nature of these documents.
+
+3. **Unstructured Document**: A handwritten doctor's note from an initial consultation, containing free-form observations, preliminary diagnoses, and treatment recommendations. This represents the most challenging category of unstructured documents, where information isn't confined to any predetermined format or structure.
+
+The example document can be downloaded [here](docs/new-patient-registration.pdf).
+
+## Solution Setup
+1.	Setup an Amazon SageMaker Domain using the instruction in the [quick setup guide](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html)
+2.	Launch the Studio. Then create and launch a JupyterLab space using the instruction in the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl-user-guide-create-space.html)
+3.	Follow instructions in the documentation to [create a guardrail](https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails-create.html). Focus on adding “Sensitive Information Filters” that would mask personally identifiable information (PII) or personal health information (PIH).
+4.	Clone the code from the aws-samples GitHub repository
+`git clone <repo-url>`
+5.	Change directory to the root of the cloned repository by running
+`cd medical-idp`
+6.	Install dependencies by running
+`pip install -r requirements.txt`
+7.	Update setup.sh with the guardrail ID you created in step 3. Then set the ENV variable by running
+`source setup.sh`
+8.	Finally, start the Streamlit application by running
+`streamlit run app.py`
+
+Now you are ready to explore the intelligent document processing workflow using Amazon Bedrock.
+
+> ⚠️ **WARNING**: This codebase demonstrates intelligent document processing capabilities using Claude models and references medical documents as an example. Any medical or healthcare-related analysis, diagnosis, or decision-making without proper review and validation by qualified medical professionals is done at your own risk. Neither AWS nor the authors assume any liability for such use.
diff --git a/medical-idp/config.py b/medical-idp/config.py
diff --git a/medical-idp/docs/new-patient-registration.pdf b/medical-idp/docs/new-patient-registration.pdf
diff --git a/medical-idp/requirements.txt b/medical-idp/requirements.txt
@@ -0,0 +1,11 @@
+pyarrow
+python-dotenv
+beautifulsoup4
+lxml
+grobid-client-python
+watchdog
+streamlit-pdf-viewer
+requests
+PyMuPDF
+Pillow
+boto3
diff --git a/medical-idp/setup.sh b/medical-idp/setup.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+
+# Set the Guardrail ID
+export GUARDRAIL_ID="<TODO: ENTER GUARDRAIL ID HERE>"
+
+# Set the Guardrail Version
+export GUARDRAIL_VERSION="DRAFT" # Change this to use a specific version of the guardrail
+
+# Print the values to confirm
+echo "GUARDRAIL_ID set to: $GUARDRAIL_ID"
+echo "GUARDRAIL_VERSION set to: $GUARDRAIL_VERSION"
+
+# Optionally, you can add more environment variables here if needed
+
+# Note: Running this script directly won't affect your current shell session.
+# You need to source it for the variables to be available in your current session.
diff --git a/medical-idp/static/favicon.ico b/medical-idp/static/favicon.ico
diff --git a/medical-idp/static/flow_diagram.webp b/medical-idp/static/flow_diagram.webp
diff --git a/medical-idp/streamlit_app.py b/medical-idp/streamlit_app.py
@@ -0,0 +1,148 @@
+import os
+from hashlib import blake2b
+from tempfile import NamedTemporaryFile
+import subprocess
+
+import dotenv
+from streamlit_pdf_viewer import pdf_viewer
+from PIL import Image
+dotenv.load_dotenv(override=True)
+import boto3
+import json
+import streamlit as st
+
+from utils.processor import FileProcessor, ToolSpec
+
+region_name = "us-east-1"
+
+# Create a Boto3 session
+session = boto3.Session(region_name=region_name)
+
+# Create a Bedrock client
+bedrock_client = session.client("bedrock")
+
+# Create an instance of FileProcessor
+processor = FileProcessor()
+
+title = "From Conversation to Automation"
+
+css='''
+<style>
+    [data-testid="column"] {
+        overflow-y: auto;
+    }
+
+    .column-2 {
+        max-height: 80vh;
+        overflow-y: auto;
+    }
+</style>
+'''
+
+
+if 'tmp_file' not in st.session_state:
+    st.session_state['tmp_file'] = None
+
+if 'doc_id' not in st.session_state:
+    st.session_state['doc_id'] = None
+
+if 'button_enabled' not in st.session_state:
+    st.session_state['button_enabled'] = False
+
+if 'binary' not in st.session_state:
+    st.session_state['binary'] = None
+
+im = Image.open("static/favicon.ico")
+st.set_page_config(
+    page_title=title,
+    page_icon=im,
+    initial_sidebar_state="expanded",
+    menu_items={
+        'About': "A demo to showcase intelligent document processing using tools in Amazon Bedrock"
+    },
+    layout="wide"
+)
+st.markdown(css, unsafe_allow_html=True)
+with st.sidebar:
+    st.header("Documentation")
+    st.markdown("[Amazon Bedrock](https://aws.amazon.com/bedrock/)")
+    st.markdown(
+        """Upload doctor's notes and see Anthropic Claude model's multi-modal capability to extract information""")
+
+    st.header("Inference Options")
+    enable_guardrails = st.toggle('Use Guardrails', value=False, disabled=not st.session_state['button_enabled'],
+                            help="When enabled will use a gaurdrail to detect and block PII in the request and response.")
+    temp = st.slider(label="Temperature", min_value=0.0, max_value=1.0, step=0.1, value=0.0, help="Temperature controls the level of randomness in the model's output")
+    maxTokens = st.slider(label="Max Output Tokens", min_value=50, max_value=2048, value=2000, help="Output tokne controls the size of the output")
+
+    resolution_boost = st.slider(label="Resolution boost", min_value=1, max_value=10, value=1)
+    width = st.slider(label="PDF width", min_value=100, max_value=1000, value=800) 
+
+
+
+def new_file():
+    st.session_state['doc_id'] = None
+    st.session_state['button_enabled'] = True
+    st.session_state['binary'] = None
+    st.session_state['tmp_file'] = None
+
+col1, col2= st.columns([6,4])
+
+with col1:
+    st.title(title)
+    st.subheader("Connecting Foundational Models to external tools.")
+    process_button = st.button("Process Document", disabled=not st.session_state['button_enabled'])
+    uploaded_file = st.file_uploader("Upload a document",
+                                    type=("pdf"),
+                                    on_change=new_file,
+                                    help="Process mortgage applications using generative AI")
+
+    if uploaded_file:
+        if not st.session_state['binary']:
+            with (st.spinner('Reading file...')):
+                binary = uploaded_file.getvalue()
+                tmp_file = NamedTemporaryFile(suffix='.pdf', delete=False)
+                tmp_file.write(bytearray(binary))
+                st.session_state["tmp_file"] = tmp_file.name
+                st.session_state['binary'] = binary
+
+        with (st.spinner("Rendering PDF document")):
+                pdf_viewer(
+                    input=st.session_state['binary'],
+                    width=width,
+                    pages_vertical_spacing=10,
+                    resolution_boost=resolution_boost
+                )
+
+with col2:
+    st.markdown('<div class="column-2">', unsafe_allow_html=True)  # Start of scrollable column
+    # add a bunch of text to the second columns
+    st.subheader("Output")
+    st.markdown("Output from the Foundational Model")
+
+    if process_button:
+        if st.session_state['tmp_file']:
+            placeholder = st.empty()
+            with (st.spinner("Processing the document...")):
+                prompt_parts = []
+                toolspecs = [ToolSpec.DOCUMENT_PROCESSING_PIPELINE]  # Always include the main tool DOCUMENT_PROCESSING_PIPELINE
+                toolspecs.append(ToolSpec.DOC_NOTES)
+                toolspecs.append(ToolSpec.NEW_PATIENT_INFO)
+                toolspecs.append(ToolSpec.INSURANCE_FORM)
+
+                tmp_file = st.session_state['tmp_file']
+
+                prompt = ("1. Extract 2. save and 3. summarize the information from the patient information package located at " + tmp_file + ". " +
+                          "The package might contain various types of documents including insurance cards. Extract and save information from all documents provided. "
+                          "Perform any preprocessing or classification of the file provided prior to the extraction." + 
+                          "Set the enable_guardrails parameter to " + str(enable_guardrails) + ". " + 
+                          "At the end, list all the tools that you had access to. Give an explantion on why each tool was used and if you are not using a tool, explain why it was not used as well" + 
+                          "Think step by step.")
+                processor.process_file(prompt=prompt, 
+                                        placeholder=placeholder, 
+                                        enable_guardrails=enable_guardrails, 
+                                        temperature=temp, 
+                                        maxTokens=maxTokens,
+                                        toolspecs=toolspecs)
+
+
diff --git a/medical-idp/tools/document_classifier.py b/medical-idp/tools/document_classifier.py
@@ -0,0 +1,141 @@
+import json
+from utils.constants import ModelIDs
+from utils.bedrockutility import BedrockUtils
+
+UNKNOWN_TYPE = "UNK"
+DOCUMENT_TYPES = ["INTAKE_FORM", "INSURANCE_CARD", "DOC_NOTES", UNKNOWN_TYPE]
+
+class DocumentClassifier:
+    def __init__(self, file_handler):
+        self.sonnet_3_5_bedrock_utils = BedrockUtils(model_id=ModelIDs.anthropic_claude_3_5_sonnet)
+        self.sonnet_3_0_bedrock_utils = BedrockUtils(model_id=ModelIDs.anthropic_claude_3_sonnet)
+        self.haiku_bedrock_utils = BedrockUtils(model_id=ModelIDs.anthropic_claude_3_haiku)
+        self.meta_32_util = BedrockUtils(model_id=ModelIDs.meta_llama_32_model_id)
+        self.file_handler = file_handler
+
+    def classify_documents(self, input_data):
+        """Classify documents based on their content."""
+        return self.categorize_document(input_data['document_paths'])
+
+    def categorize_document(self, file_paths):
+        """
+        Categorize documents based on their content.
+        """
+        try:
+            if len(file_paths) == 1:
+                # Single file handling
+                binary_data, media_type = self.file_handler.get_binary_for_file(file_paths[0])
+                if binary_data is None or media_type is None:
+                    return []
+
+                message_content = [
+                    {"image": {"format": media_type, "source": {"bytes": data}}}
+                    for data in binary_data
+                ]
+            else:
+                # Multiple file handling
+                binary_data_array = []
+                for file_path in file_paths:
+                    binary_data, media_type = self.file_handler.get_binary_for_file(file_path)
+                    if binary_data is None or media_type is None:
+                        continue
+                    # Only use the first page for classification in multiple file case
+                    binary_data_array.append((binary_data[0], media_type))
+
+                if not binary_data_array:
+                    return []
+
+                message_content = [
+                    {"image": {"format": media_type, "source": {"bytes": data}}}
+                    for data, media_type in binary_data_array
+                ]
+
+            message_list = [{
+                "role": 'user',
+                "content": [
+                    *message_content,
+                    {"text": "What types of document is in this image?"}
+                ]
+            }]
+
+            # Create system message with instructions
+            data = {"file_paths": file_paths}
+            files = json.dumps(data, indent=2)
+            system_message = self._create_system_message(files)
+
+            response = self.sonnet_3_0_bedrock_utils.invoke_bedrock(
+                message_list=message_list,
+                system_message=system_message
+            )
+            response_message = [response['output']['message']]
+            return response_message
+
+        except Exception as e:
+            print(f"An error occurred: {str(e)}")
+            return []
+
+    def _create_system_message(self, files):
+        """
+        Create a system message for document classification in a doctor's consultation package.
+        """
+        return [{
+            "text": f'''
+                    <task>
+                    You are a medical document processing agent. You have perfect vision. 
+                    You meticulously analyze the images and categorize them based on these document types:
+                    <document_types>INTAKE_FORM, INSURANCE_CARD, DOC_NOTES</document_types>
+                    </task>
+                    
+                    <input_files>
+                    {files}
+                    </input_files>
+                    
+                    <instructions>
+                    1. Categorize each file into one of the document types.
+                    2. Use 'UNK' for unknown document types.
+                    3. Look at all the sections on each page and associate <topics> to it. 
+                    4. For example, if `Patient Information` is on the page this will include `PATIENT_INFO`, 
+                    if `Medical History` is on this page, the value will include `MEDICAL_HISTORY`.
+                    If one of the listed topics is not found on the page, just return UNK.
+                    5. Only include the topics that are relevant to the particular file.
+                    6. Ensure that there is no confusion between the section number and the file path.
+                    7. Your output should be an array with one element per file, 
+                        and the following attributes for each element `category`, `file_path`, and `topic`.
+                    </instructions>
+                    
+                    <topics>
+                    PATIENT_INFO, 
+                    MEDICAL_HISTORY, 
+                    CURRENT_MEDICATIONS,  
+                    ALLERGIES, 
+                    VITAL_SIGNS, 
+                    CHIEF_COMPLAINT, 
+                    PHYSICAL_EXAMINATION, 
+                    DIAGNOSIS,
+                    TREATMENT_PLAN,
+                    INSURANCE_DETAILS,
+                    UNK
+                    </topics>
+
+                    <important>
+                    Do not include any text outside the JSON object in your response.
+                    Your entire response should be parseable as a single JSON object.
+                    </important>
+
+                    <example_output>
+                    [
+                        {{
+                            "category": "INTAKE_FORM",
+                            "file_path": "temporary/file/path.png",
+                            "topics": [
+                                "PATIENT_INFO",
+                                "MEDICAL_HISTORY",
+                                "CURRENT_MEDICATIONS",
+                                "ALLERGIES"
+                            ]
+                        }}
+                    ]
+                    </example_output>
+                    '''
+        }]
+