Auto-Categorization Guide
Version: 1.0 Last Updated: November 28, 2025
Overview
The auto-categorization system intelligently classifies uploaded documents to determine which require full AI analysis. This dramatically reduces API costs (74% fewer calls) while ensuring critical documents are always analyzed.
How It Works
Two-Stage Process
-
Filename Pattern Matching (free, instant, ~80% accuracy)
- Uses 60+ regex patterns to match common document types
- Returns category and confidence score
- No API calls required
-
AI Scan Fallback (optional, when confidence < 0.8)
- Analyzes first page with Gemini 2.0 Flash
- Provides reasoning and higher confidence
- Costs 1 API call per ambiguous document
Decision Flow
Categories
Critical (Auto-Analyzed)
Purpose: Documents essential for safety, legal compliance, or major financial decisions.
Filename Patterns:
(?i)inspection.*report
(?i)home.*inspection
(?i)property.*inspection
(?i)structural.*report
(?i)pest.*inspection
(?i)termite.*report
(?i)seller.*disclosure
(?i)property.*disclosure
(?i)natural.*hazard.*disclosure
(?i)lead.*based.*paint
(?i)environmental.*disclosure
(?i)material.*facts.*disclosure
Example Filenames:
Home_Inspection_Report_123_Main_St.pdfSeller_Property_Disclosure_Statement.pdfPest_Termite_Inspection_2024.pdfNatural_Hazard_Disclosure_Form.pdfLead_Based_Paint_Disclosure.pdf
Auto-Analysis: ✅ Yes (queued immediately on upload)
API Impact: Full analysis (10-50 Gemini Vision calls per document)
Important (Auto-Analyzed)
Purpose: Documents providing significant financial or legal value.
Filename Patterns:
(?i)loan.*estimate
(?i)closing.*disclosure
(?i)mortgage.*application
(?i)pre.*approval
(?i)financing.*terms
(?i)lending.*disclosure
(?i)appraisal.*report
(?i)property.*valuation
(?i)comparative.*market.*analysis
(?i)cma.*report
(?i)title.*report
(?i)preliminary.*title
(?i)title.*commitment
Example Filenames:
Loan_Estimate_Wells_Fargo.pdfClosing_Disclosure_Final.pdfAppraisal_Report_456_Oak_Ave.pdfPreliminary_Title_Report.pdfMortgage_Pre_Approval_Letter.pdf
Auto-Analysis: ✅ Yes (queued immediately on upload)
API Impact: Full analysis (5-20 Gemini Vision calls per document)
Optional (On-Demand)
Purpose: Documents analyzed only when user specifically asks about them.
Filename Patterns:
(?i)purchase.*agreement
(?i)sales.*contract
(?i)listing.*agreement
(?i)addendum
(?i)amendment
(?i)buyer.*representation
(?i)agency.*disclosure
(?i)escrow.*instructions
(?i)wire.*transfer.*authorization
Example Filenames:
Purchase_Agreement_Residential.pdfSales_Contract_Signed.pdfAddendum_A_Repairs.pdfBuyer_Representation_Agreement.pdfEscrow_Instructions.pdf
Auto-Analysis: ❌ No (analyzed when user asks)
API Impact: None unless triggered by user question
Trigger Example:
User: "What does the purchase agreement say about contingencies?"
System: Detects "purchase" + "agreement" keywords
→ Finds unanalyzed Purchase_Agreement.pdf
→ Queues for background analysis
→ Notifies user analysis is in progress
Noise (Not Analyzed)
Purpose: Administrative documents that don't require AI analysis.
Filename Patterns:
(?i)receipt
(?i)acknowledgement
(?i)acknowledgment
(?i)notice.*to.*perform
(?i)contingency.*removal
(?i)hoa.*rules
(?i)cc&r
(?i)covenants.*conditions.*restrictions
(?i)calendar
(?i)schedule
(?i)checklist
Example Filenames:
Receipt_Earnest_Money.pdfAcknowledgement_Of_Receipt.pdfHOA_Rules_And_Regulations.pdfCC&Rs_Community_Association.pdfClosing_Checklist.pdf
Auto-Analysis: ❌ No (metadata only)
API Impact: Zero (never analyzed)
Implementation Details
Service Location
File: /backend/app/services/document_categorizer.py
Class Structure
from app.services.document_categorizer import get_document_categorizer
categorizer = get_document_categorizer()
# Categorize by filename only (fast)
result = categorizer.categorize_by_filename("Home_Inspection_Report.pdf")
# Categorize with AI fallback (if low confidence)
result = await categorizer.categorize(
filename="unclear_document.pdf",
first_page_text=extracted_text, # Optional
gemini_client=gemini_client # Optional
)
# Check if should auto-analyze
should_analyze = categorizer.should_auto_analyze(result.category)
CategorizationResult Object
@dataclass
class CategorizationResult:
category: str # critical, important, optional, noise, uncategorized
confidence: float # 0.0 to 1.0
method: str # filename_pattern, ai_scan, manual
matched_patterns: list # Regex patterns that matched
reasoning: str # Human-readable explanation
Example Results
High-Confidence Filename Match
result = categorizer.categorize_by_filename("Home_Inspection_Report_2024.pdf")
# Output:
# category: "critical"
# confidence: 0.95
# method: "filename_pattern"
# matched_patterns: ["(?i)inspection.*report", "(?i)home.*inspection"]
# reasoning: "Filename matches critical patterns: (?i)inspection.*report, (?i)home.*inspection"
AI Scan Fallback
result = await categorizer.categorize(
filename="property_doc_scan_001.pdf",
first_page_text="SELLER'S REAL PROPERTY DISCLOSURE STATEMENT...",
gemini_client=gemini
)
# Output (after AI analysis):
# category: "critical"
# confidence: 0.92
# method: "ai_scan"
# matched_patterns: ["ai_analysis"]
# reasoning: "Document is a seller property disclosure form required by law, contains material facts about property condition"
Integration Examples
API Endpoint (properties.py)
from app.services.document_categorizer import get_document_categorizer
@router.post("/{property_id}/upload")
async def upload_documents(property_id: str, files: list[UploadFile]):
categorizer = get_document_categorizer()
for file in files:
# Categorize document
categorization = await categorizer.categorize(
filename=file.filename,
first_page_text=None, # Fast upload, skip AI scan
gemini_client=None
)
# Store in database
db.insert({
"category": categorization.category,
"categorization_metadata": {
"method": categorization.method,
"confidence": categorization.confidence,
"matched_patterns": categorization.matched_patterns,
"reasoning": categorization.reasoning,
}
})
# Queue for analysis if critical/important
if categorizer.should_auto_analyze(categorization.category):
background_tasks.add_task(analyze_document, file)
Chat Endpoint (chat.py)
async def check_and_trigger_on_demand_analysis(property_id, query):
"""Detect when user asks about unanalyzed documents"""
# Get unanalyzed documents
docs = db.query(
"SELECT id, original_filename, category "
"FROM analyses "
"WHERE property_id = ? AND is_analyzed = false",
property_id
)
# Keyword matching
query_lower = query.lower()
doc_keywords = {
"purchase": ["purchase", "agreement", "contract"],
"loan": ["loan", "mortgage", "financing"],
"disclosure": ["disclosure", "seller"],
# ... more keywords
}
# Find relevant docs
for doc in docs:
filename = doc["original_filename"].lower()
for doc_type, keywords in doc_keywords.items():
if any(kw in query_lower for kw in keywords) and doc_type in filename:
# Queue for analysis
queue_analysis(doc["id"])
return
Adding Custom Patterns
1. Define New Patterns
Edit /backend/app/services/document_categorizer.py:
class DocumentCategorizer:
CRITICAL_PATTERNS = [
# Existing patterns...
r"(?i)radon.*test", # Add radon test reports
r"(?i)asbestos.*inspection", # Add asbestos inspections
]
2. Test Patterns
import re
pattern = r"(?i)radon.*test"
filename = "Radon_Test_Results_2024.pdf"
if re.search(pattern, filename):
print("Pattern matches!")
3. Deploy
Restart the API server:
docker-compose restart api
Tuning Confidence Thresholds
Current Threshold: 0.8
Documents with confidence < 0.8 trigger AI scan (if available).
Adjust threshold in categorizer.categorize():
async def categorize(self, filename, first_page_text=None, gemini_client=None):
result = self.categorize_by_filename(filename)
# Lower threshold = more AI scans (higher accuracy, higher cost)
# Higher threshold = fewer AI scans (lower accuracy, lower cost)
CONFIDENCE_THRESHOLD = 0.8 # Adjust here
if result.confidence < CONFIDENCE_THRESHOLD and first_page_text and gemini_client:
ai_result = await self.categorize_by_ai_scan(filename, first_page_text, gemini_client)
if ai_result.confidence > result.confidence:
result = ai_result
return result
Threshold Recommendations:
- 0.7: More aggressive AI scanning (best accuracy, 10-20% more API calls)
- 0.8: Balanced (current default)
- 0.9: Conservative (fewer AI scans, may miss some categorizations)
Performance Metrics
Categorization Speed
| Method | Speed | API Cost | Accuracy |
|---|---|---|---|
| Filename Patterns | <1ms | Free | ~80% |
| AI Scan (Gemini 2.0 Flash) | 500-1500ms | 1 call | ~95% |
| Hybrid (filename + AI fallback) | 1-1500ms | 0-1 calls | ~90% |
Resource Savings
Scenario: 50 documents uploaded
Without categorization:
- All 50 documents analyzed
- 50 docs × 10 images/doc × 1 Gemini call = 500 API calls
With categorization:
- 50 filename checks (free)
- ~10 AI scans for ambiguous docs = 10 calls
- ~7 critical docs × 10 images = 70 calls
- ~5 optional docs (when user asks) × 10 images = 50 calls
- Total: 130 API calls (74% reduction)
Testing
Unit Tests
# tests/test_document_categorizer.py
import pytest
from app.services.document_categorizer import get_document_categorizer
@pytest.fixture
def categorizer():
return get_document_categorizer()
def test_critical_inspection_report(categorizer):
result = categorizer.categorize_by_filename("Home_Inspection_Report.pdf")
assert result.category == "critical"
assert result.confidence >= 0.9
def test_important_loan_estimate(categorizer):
result = categorizer.categorize_by_filename("Loan_Estimate_2024.pdf")
assert result.category == "important"
assert result.confidence >= 0.9
def test_optional_purchase_agreement(categorizer):
result = categorizer.categorize_by_filename("Purchase_Agreement.pdf")
assert result.category == "optional"
assert result.confidence >= 0.8
def test_noise_receipt(categorizer):
result = categorizer.categorize_by_filename("Receipt_Earnest_Money.pdf")
assert result.category == "noise"
assert result.confidence >= 0.8
def test_uncategorized_generic_name(categorizer):
result = categorizer.categorize_by_filename("document.pdf")
assert result.category == "uncategorized"
assert result.confidence == 0.0
@pytest.mark.asyncio
async def test_ai_scan_fallback(categorizer, gemini_client):
result = await categorizer.categorize(
filename="scan_001.pdf",
first_page_text="SELLER'S PROPERTY DISCLOSURE STATEMENT...",
gemini_client=gemini_client
)
assert result.category == "critical"
assert result.method == "ai_scan"
assert result.confidence > 0.8
Integration Tests
# Test multi-file upload with categorization
curl -X POST "http://localhost:8000/v1/properties/{id}/upload" \
-H "Authorization: Bearer hi_test_dev_key_12345" \
-F "files=@tests/fixtures/Home_Inspection_Report.pdf" \
-F "files=@tests/fixtures/Loan_Estimate.pdf" \
-F "files=@tests/fixtures/Receipt.pdf" \
| jq '.documents[] | {filename, category, confidence}'
# Expected output:
# {
# "filename": "Home_Inspection_Report.pdf",
# "category": "critical",
# "confidence": 0.95
# }
# {
# "filename": "Loan_Estimate.pdf",
# "category": "important",
# "confidence": 0.90
# }
# {
# "filename": "Receipt.pdf",
# "category": "noise",
# "confidence": 0.85
# }
Troubleshooting
Problem: Documents Miscategorized
Check pattern matching:
import re
from app.services.document_categorizer import DocumentCategorizer
cat = DocumentCategorizer()
filename = "unclear_document.pdf"
# Test each pattern set
for pattern in cat.CRITICAL_PATTERNS:
if re.search(pattern, filename):
print(f"Matches CRITICAL: {pattern}")
# ... repeat for IMPORTANT, OPTIONAL, NOISE
Solution: Add or refine patterns in document_categorizer.py
Problem: Too Many AI Scans
Check confidence distribution:
SELECT
categorization_metadata->>'method' as method,
COUNT(*) as count
FROM analyses
GROUP BY categorization_metadata->>'method';
-- Expected:
-- filename_pattern: ~80%
-- ai_scan: ~20%
Solution: Lower confidence threshold or improve filename patterns
Problem: AI Scan Errors
Check logs:
docker-compose logs api | grep "AI categorization failed"
Common causes:
- Gemini API rate limit
- Invalid first page text
- Network timeout
Solution:
# Graceful fallback to filename categorization
try:
ai_result = await self.categorize_by_ai_scan(...)
except Exception as e:
logger.error(f"AI scan failed: {e}")
# Use filename result as fallback
Best Practices
1. Use Descriptive Filenames
✅ Good: Home_Inspection_Report_123_Main_St_2024.pdf
❌ Bad: scan.pdf, document1.pdf, file.pdf
2. Test New Patterns
Always test new regex patterns before deploying:
import re
pattern = r"(?i)new.*pattern"
test_files = [
"New_Pattern_Document.pdf", # Should match
"other_file.pdf", # Should not match
]
for filename in test_files:
matches = bool(re.search(pattern, filename))
print(f"{filename}: {matches}")
3. Monitor Categorization Accuracy
-- Check categorization distribution
SELECT category, COUNT(*) as count
FROM analyses
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY category
ORDER BY count DESC;
-- Expected distribution (varies by use case):
-- critical: 10-20%
-- important: 10-20%
-- optional: 20-30%
-- noise: 20-40%
-- uncategorized: <10%
4. Log Ambiguous Cases
if result.confidence < 0.7:
logger.warning(
f"Low confidence categorization: {filename} "
f"-> {result.category} (confidence: {result.confidence})"
)
Support
- Add Pattern Request: Open GitHub issue with example filenames
- Report Miscategorization: Email support@homeinsightai.com with filename
- Custom Categories: Contact enterprise@homeinsightai.com