Auto-Categorization Guide

Version: 1.0 Last Updated: November 28, 2025

Overview

The auto-categorization system intelligently classifies uploaded documents to determine which require full AI analysis. This dramatically reduces API costs (74% fewer calls) while ensuring critical documents are always analyzed.

How It Works

Two-Stage Process

Filename Pattern Matching (free, instant, ~80% accuracy)
- Uses 60+ regex patterns to match common document types
- Returns category and confidence score
- No API calls required
AI Scan Fallback (optional, when confidence < 0.8)
- Analyzes first page with Gemini 2.0 Flash
- Provides reasoning and higher confidence
- Costs 1 API call per ambiguous document

Decision Flow

Categories

Critical (Auto-Analyzed)

Purpose: Documents essential for safety, legal compliance, or major financial decisions.

Filename Patterns:

Regex

(?i)inspection.*report
(?i)home.*inspection
(?i)property.*inspection
(?i)structural.*report
(?i)pest.*inspection
(?i)termite.*report
(?i)seller.*disclosure
(?i)property.*disclosure
(?i)natural.*hazard.*disclosure
(?i)lead.*based.*paint
(?i)environmental.*disclosure
(?i)material.*facts.*disclosure

Example Filenames:

Home_Inspection_Report_123_Main_St.pdf
Seller_Property_Disclosure_Statement.pdf
Pest_Termite_Inspection_2024.pdf
Natural_Hazard_Disclosure_Form.pdf
Lead_Based_Paint_Disclosure.pdf

Auto-Analysis: ✅ Yes (queued immediately on upload)

API Impact: Full analysis (10-50 Gemini Vision calls per document)

Important (Auto-Analyzed)

Purpose: Documents providing significant financial or legal value.

Filename Patterns:

Regex

(?i)loan.*estimate
(?i)closing.*disclosure
(?i)mortgage.*application
(?i)pre.*approval
(?i)financing.*terms
(?i)lending.*disclosure
(?i)appraisal.*report
(?i)property.*valuation
(?i)comparative.*market.*analysis
(?i)cma.*report
(?i)title.*report
(?i)preliminary.*title
(?i)title.*commitment

Example Filenames:

Loan_Estimate_Wells_Fargo.pdf
Closing_Disclosure_Final.pdf
Appraisal_Report_456_Oak_Ave.pdf
Preliminary_Title_Report.pdf
Mortgage_Pre_Approval_Letter.pdf

Auto-Analysis: ✅ Yes (queued immediately on upload)

API Impact: Full analysis (5-20 Gemini Vision calls per document)

Optional (On-Demand)

Purpose: Documents analyzed only when user specifically asks about them.

Filename Patterns:

Regex

(?i)purchase.*agreement
(?i)sales.*contract
(?i)listing.*agreement
(?i)addendum
(?i)amendment
(?i)buyer.*representation
(?i)agency.*disclosure
(?i)escrow.*instructions
(?i)wire.*transfer.*authorization

Example Filenames:

Purchase_Agreement_Residential.pdf
Sales_Contract_Signed.pdf
Addendum_A_Repairs.pdf
Buyer_Representation_Agreement.pdf
Escrow_Instructions.pdf

Auto-Analysis: ❌ No (analyzed when user asks)

API Impact: None unless triggered by user question

Trigger Example:

User: "What does the purchase agreement say about contingencies?"
System: Detects "purchase" + "agreement" keywords
        → Finds unanalyzed Purchase_Agreement.pdf
        → Queues for background analysis
        → Notifies user analysis is in progress

Noise (Not Analyzed)

Purpose: Administrative documents that don't require AI analysis.

Filename Patterns:

Regex

(?i)receipt
(?i)acknowledgement
(?i)acknowledgment
(?i)notice.*to.*perform
(?i)contingency.*removal
(?i)hoa.*rules
(?i)cc&r
(?i)covenants.*conditions.*restrictions
(?i)calendar
(?i)schedule
(?i)checklist

Example Filenames:

Receipt_Earnest_Money.pdf
Acknowledgement_Of_Receipt.pdf
HOA_Rules_And_Regulations.pdf
CC&Rs_Community_Association.pdf
Closing_Checklist.pdf

Auto-Analysis: ❌ No (metadata only)

API Impact: Zero (never analyzed)

Implementation Details

Service Location

File: /backend/app/services/document_categorizer.py

Class Structure

Python

from app.services.document_categorizer import get_document_categorizer

categorizer = get_document_categorizer()

# Categorize by filename only (fast)
result = categorizer.categorize_by_filename("Home_Inspection_Report.pdf")

# Categorize with AI fallback (if low confidence)
result = await categorizer.categorize(
    filename="unclear_document.pdf",
    first_page_text=extracted_text,  # Optional
    gemini_client=gemini_client       # Optional
)

# Check if should auto-analyze
should_analyze = categorizer.should_auto_analyze(result.category)

CategorizationResult Object

Python

@dataclass
class CategorizationResult:
    category: str           # critical, important, optional, noise, uncategorized
    confidence: float       # 0.0 to 1.0
    method: str            # filename_pattern, ai_scan, manual
    matched_patterns: list # Regex patterns that matched
    reasoning: str         # Human-readable explanation

Example Results

High-Confidence Filename Match

Python

result = categorizer.categorize_by_filename("Home_Inspection_Report_2024.pdf")

# Output:
# category: "critical"
# confidence: 0.95
# method: "filename_pattern"
# matched_patterns: ["(?i)inspection.*report", "(?i)home.*inspection"]
# reasoning: "Filename matches critical patterns: (?i)inspection.*report, (?i)home.*inspection"

AI Scan Fallback

Python

result = await categorizer.categorize(
    filename="property_doc_scan_001.pdf",
    first_page_text="SELLER'S REAL PROPERTY DISCLOSURE STATEMENT...",
    gemini_client=gemini
)

# Output (after AI analysis):
# category: "critical"
# confidence: 0.92
# method: "ai_scan"
# matched_patterns: ["ai_analysis"]
# reasoning: "Document is a seller property disclosure form required by law, contains material facts about property condition"

Integration Examples

API Endpoint (properties.py)

Python

from app.services.document_categorizer import get_document_categorizer

@router.post("/{property_id}/upload")
async def upload_documents(property_id: str, files: list[UploadFile]):
    categorizer = get_document_categorizer()

    for file in files:
        # Categorize document
        categorization = await categorizer.categorize(
            filename=file.filename,
            first_page_text=None,  # Fast upload, skip AI scan
            gemini_client=None
        )

        # Store in database
        db.insert({
            "category": categorization.category,
            "categorization_metadata": {
                "method": categorization.method,
                "confidence": categorization.confidence,
                "matched_patterns": categorization.matched_patterns,
                "reasoning": categorization.reasoning,
            }
        })

        # Queue for analysis if critical/important
        if categorizer.should_auto_analyze(categorization.category):
            background_tasks.add_task(analyze_document, file)

Chat Endpoint (chat.py)

Python

async def check_and_trigger_on_demand_analysis(property_id, query):
    """Detect when user asks about unanalyzed documents"""

    # Get unanalyzed documents
    docs = db.query(
        "SELECT id, original_filename, category "
        "FROM analyses "
        "WHERE property_id = ? AND is_analyzed = false",
        property_id
    )

    # Keyword matching
    query_lower = query.lower()
    doc_keywords = {
        "purchase": ["purchase", "agreement", "contract"],
        "loan": ["loan", "mortgage", "financing"],
        "disclosure": ["disclosure", "seller"],
        # ... more keywords
    }

    # Find relevant docs
    for doc in docs:
        filename = doc["original_filename"].lower()
        for doc_type, keywords in doc_keywords.items():
            if any(kw in query_lower for kw in keywords) and doc_type in filename:
                # Queue for analysis
                queue_analysis(doc["id"])
                return

Adding Custom Patterns

1. Define New Patterns

Edit /backend/app/services/document_categorizer.py:

Python

class DocumentCategorizer:
    CRITICAL_PATTERNS = [
        # Existing patterns...
        r"(?i)radon.*test",           # Add radon test reports
        r"(?i)asbestos.*inspection",  # Add asbestos inspections
    ]

2. Test Patterns

Python

import re

pattern = r"(?i)radon.*test"
filename = "Radon_Test_Results_2024.pdf"

if re.search(pattern, filename):
    print("Pattern matches!")

3. Deploy

Restart the API server:

Shell

docker-compose restart api

Tuning Confidence Thresholds

Current Threshold: 0.8

Documents with confidence < 0.8 trigger AI scan (if available).

Adjust threshold in categorizer.categorize():

Python

async def categorize(self, filename, first_page_text=None, gemini_client=None):
    result = self.categorize_by_filename(filename)

    # Lower threshold = more AI scans (higher accuracy, higher cost)
    # Higher threshold = fewer AI scans (lower accuracy, lower cost)
    CONFIDENCE_THRESHOLD = 0.8  # Adjust here

    if result.confidence < CONFIDENCE_THRESHOLD and first_page_text and gemini_client:
        ai_result = await self.categorize_by_ai_scan(filename, first_page_text, gemini_client)
        if ai_result.confidence > result.confidence:
            result = ai_result

    return result

Threshold Recommendations:

0.7: More aggressive AI scanning (best accuracy, 10-20% more API calls)
0.8: Balanced (current default)
0.9: Conservative (fewer AI scans, may miss some categorizations)

Performance Metrics

Categorization Speed

Method	Speed	API Cost	Accuracy
Filename Patterns	<1ms	Free	~80%
AI Scan (Gemini 2.0 Flash)	500-1500ms	1 call	~95%
Hybrid (filename + AI fallback)	1-1500ms	0-1 calls	~90%

Resource Savings

Scenario: 50 documents uploaded

Without categorization:

All 50 documents analyzed
50 docs × 10 images/doc × 1 Gemini call = 500 API calls

With categorization:

50 filename checks (free)
~10 AI scans for ambiguous docs = 10 calls
~7 critical docs × 10 images = 70 calls
~5 optional docs (when user asks) × 10 images = 50 calls
Total: 130 API calls (74% reduction)

Testing

Unit Tests

Python

# tests/test_document_categorizer.py

import pytest
from app.services.document_categorizer import get_document_categorizer

@pytest.fixture
def categorizer():
    return get_document_categorizer()

def test_critical_inspection_report(categorizer):
    result = categorizer.categorize_by_filename("Home_Inspection_Report.pdf")
    assert result.category == "critical"
    assert result.confidence >= 0.9

def test_important_loan_estimate(categorizer):
    result = categorizer.categorize_by_filename("Loan_Estimate_2024.pdf")
    assert result.category == "important"
    assert result.confidence >= 0.9

def test_optional_purchase_agreement(categorizer):
    result = categorizer.categorize_by_filename("Purchase_Agreement.pdf")
    assert result.category == "optional"
    assert result.confidence >= 0.8

def test_noise_receipt(categorizer):
    result = categorizer.categorize_by_filename("Receipt_Earnest_Money.pdf")
    assert result.category == "noise"
    assert result.confidence >= 0.8

def test_uncategorized_generic_name(categorizer):
    result = categorizer.categorize_by_filename("document.pdf")
    assert result.category == "uncategorized"
    assert result.confidence == 0.0

@pytest.mark.asyncio
async def test_ai_scan_fallback(categorizer, gemini_client):
    result = await categorizer.categorize(
        filename="scan_001.pdf",
        first_page_text="SELLER'S PROPERTY DISCLOSURE STATEMENT...",
        gemini_client=gemini_client
    )
    assert result.category == "critical"
    assert result.method == "ai_scan"
    assert result.confidence > 0.8

Integration Tests

Shell

# Test multi-file upload with categorization
curl -X POST "http://localhost:8000/v1/properties/{id}/upload" \
  -H "Authorization: Bearer hi_test_dev_key_12345" \
  -F "files=@tests/fixtures/Home_Inspection_Report.pdf" \
  -F "files=@tests/fixtures/Loan_Estimate.pdf" \
  -F "files=@tests/fixtures/Receipt.pdf" \
  | jq '.documents[] | {filename, category, confidence}'

# Expected output:
# {
#   "filename": "Home_Inspection_Report.pdf",
#   "category": "critical",
#   "confidence": 0.95
# }
# {
#   "filename": "Loan_Estimate.pdf",
#   "category": "important",
#   "confidence": 0.90
# }
# {
#   "filename": "Receipt.pdf",
#   "category": "noise",
#   "confidence": 0.85
# }

Troubleshooting

Problem: Documents Miscategorized

Check pattern matching:

Python

import re
from app.services.document_categorizer import DocumentCategorizer

cat = DocumentCategorizer()
filename = "unclear_document.pdf"

# Test each pattern set
for pattern in cat.CRITICAL_PATTERNS:
    if re.search(pattern, filename):
        print(f"Matches CRITICAL: {pattern}")

# ... repeat for IMPORTANT, OPTIONAL, NOISE

Solution: Add or refine patterns in document_categorizer.py

Problem: Too Many AI Scans

Check confidence distribution:

SQL

SELECT
  categorization_metadata->>'method' as method,
  COUNT(*) as count
FROM analyses
GROUP BY categorization_metadata->>'method';

-- Expected:
-- filename_pattern: ~80%
-- ai_scan: ~20%

Solution: Lower confidence threshold or improve filename patterns

Problem: AI Scan Errors

Check logs:

Shell

docker-compose logs api | grep "AI categorization failed"

Common causes:

Gemini API rate limit
Invalid first page text
Network timeout

Solution:

Python

# Graceful fallback to filename categorization
try:
    ai_result = await self.categorize_by_ai_scan(...)
except Exception as e:
    logger.error(f"AI scan failed: {e}")
    # Use filename result as fallback

Best Practices

1. Use Descriptive Filenames

✅ Good: Home_Inspection_Report_123_Main_St_2024.pdf

❌ Bad: scan.pdf, document1.pdf, file.pdf

2. Test New Patterns

Always test new regex patterns before deploying:

Python

import re

pattern = r"(?i)new.*pattern"
test_files = [
    "New_Pattern_Document.pdf",  # Should match
    "other_file.pdf",            # Should not match
]

for filename in test_files:
    matches = bool(re.search(pattern, filename))
    print(f"{filename}: {matches}")

3. Monitor Categorization Accuracy

SQL

-- Check categorization distribution
SELECT category, COUNT(*) as count
FROM analyses
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY category
ORDER BY count DESC;

-- Expected distribution (varies by use case):
-- critical: 10-20%
-- important: 10-20%
-- optional: 20-30%
-- noise: 20-40%
-- uncategorized: &lt;10%

4. Log Ambiguous Cases

Python

if result.confidence < 0.7:
    logger.warning(
        f"Low confidence categorization: {filename} "
        f"-> {result.category} (confidence: {result.confidence})"
    )

Support

Add Pattern Request: Open GitHub issue with example filenames
Report Miscategorization: Email support@homeinsightai.com with filename
Custom Categories: Contact enterprise@homeinsightai.com