Database Schema Updates - Phase 2

Migration: 004_multi_file_auto_categorization.sql Applied: November 28, 2025 Purpose: Support multi-file uploads, auto-categorization, and on-demand analysis

Overview

Phase 2 introduces database schema changes to support intelligent document management with auto-categorization and selective analysis. This migration adds new columns to track document categories, analysis status, and automated property-level counters.

Schema Changes

1. Analyses Table - New Columns

The analyses table now includes the following columns:

Column	Type	Default	Description
`category`	TEXT	'uncategorized'	Document importance category
`is_analyzed`	BOOLEAN	`false`	Whether full analysis (vision + embeddings) is complete
`categorization_metadata`	JSONB		Details about how document was categorized
`original_filename`	TEXT	NULL	Original uploaded filename for display
`file_size_bytes`	BIGINT	NULL	File size in bytes
`error_message`	TEXT	NULL	Processing error if analysis failed

Category Values

The category column is constrained to these values:

SQL

CHECK (category IN ('critical', 'important', 'optional', 'noise', 'uncategorized'))

Category Definitions:

critical: Inspection reports, property disclosures, structural reports (auto-analyzed)
important: Loan documents, appraisals, title reports (auto-analyzed)
optional: Purchase agreements, contracts, forms (analyzed on-demand)
noise: Receipts, acknowledgements, HOA rules (not analyzed)
uncategorized: Documents that couldn't be categorized

Categorization Metadata Structure

The categorization_metadata JSONB field stores details about the categorization process:

JSON

{
  "method": "filename_pattern",
  "confidence": 0.95,
  "matched_patterns": [
    "(?i)inspection.*report",
    "(?i)home.*inspection"
  ],
  "reasoning": "Filename matches critical patterns",
  "timestamp": "2025-11-28T10:30:00Z"
}

Fields:

method: How document was categorized (filename_pattern, ai_scan, manual)
confidence: Confidence score 0.0-1.0
matched_patterns: Regex patterns that matched (for filename_pattern method)
reasoning: Human-readable explanation
timestamp: When categorization occurred

Analysis Status Flow

Documents go through the following states:

1. Upload → is_analyzed: false, status: 'processing'
2. Categorization → category: 'critical', categorization_metadata: {...}
3. Background Analysis → is_analyzed: true, status: 'completed'

Example Queries:

SQL

-- Find all unanalyzed critical documents for a property
SELECT id, original_filename, category
FROM analyses
WHERE property_id = '550e8400-e29b-41d4-a716-446655440000'
  AND is_analyzed = false
  AND category IN ('critical', 'important');

-- Get analysis completion rate for a property
SELECT
  COUNT(*) as total_docs,
  COUNT(*) FILTER (WHERE is_analyzed = true) as analyzed_docs,
  ROUND(100.0 * COUNT(*) FILTER (WHERE is_analyzed = true) / COUNT(*), 2) as completion_rate
FROM analyses
WHERE property_id = '550e8400-e29b-41d4-a716-446655440000';

2. Properties Table - New Columns

The properties table now includes denormalized counters for performance:

Column	Type	Description
`total_documents`	INTEGER	Total number of documents for this property
`analyzed_documents`	INTEGER	Count of documents with is_analyzed = true
`critical_documents`	INTEGER	Count of critical/important documents

These counters are automatically maintained by database triggers (see below).

Example Queries:

SQL

-- Get properties with unanalyzed critical documents
SELECT id, address, total_documents, analyzed_documents, critical_documents
FROM properties
WHERE critical_documents > analyzed_documents;

-- Get properties sorted by document count
SELECT id, address, total_documents, analyzed_documents
FROM properties
ORDER BY total_documents DESC
LIMIT 10;

Database Triggers

Auto-Update Property Counters

A trigger automatically updates property-level counters whenever documents are added, updated, or deleted.

Trigger: trigger_update_property_counts Function: update_property_document_counts() Fires: AFTER INSERT, UPDATE, DELETE on analyses

How It Works

SQL

-- Trigger fires on any change to analyses table
INSERT INTO analyses (...) VALUES (...);  -- Trigger updates property counts
UPDATE analyses SET is_analyzed = true WHERE id = 'ana_123';  -- Trigger updates
DELETE FROM analyses WHERE id = 'ana_123';  -- Trigger updates

The trigger function:

Identifies which property was affected
Recalculates all three counters from the analyses table
Updates the properties table
Sets updated_at to current timestamp

Performance Note: The trigger uses COUNT queries which are fast for moderate document counts (<1000 per property). For high-volume properties, consider async counter updates.

Example Trigger Execution

SQL

-- Before: property has no documents
SELECT total_documents, analyzed_documents FROM properties WHERE id = 'prop_123';
-- Result: 0, 0

-- Upload 3 documents
INSERT INTO analyses (property_id, category, is_analyzed) VALUES
  ('prop_123', 'critical', false),
  ('prop_123', 'important', false),
  ('prop_123', 'optional', false);

-- After: trigger automatically updated counts
SELECT total_documents, analyzed_documents, critical_documents FROM properties WHERE id = 'prop_123';
-- Result: 3, 0, 2

-- Background analysis completes for 2 documents
UPDATE analyses SET is_analyzed = true WHERE property_id = 'prop_123' AND category IN ('critical', 'important');

-- After: trigger updated analyzed count
SELECT total_documents, analyzed_documents, critical_documents FROM properties WHERE id = 'prop_123';
-- Result: 3, 2, 2

Indexes

New indexes optimize common query patterns:

Single-Column Indexes

SQL

-- Fast lookup by property
CREATE INDEX idx_analyses_property_id ON analyses(property_id);

-- Filter by category
CREATE INDEX idx_analyses_category ON analyses(category);

-- Filter by analysis status
CREATE INDEX idx_analyses_is_analyzed ON analyses(is_analyzed);

Composite Index

SQL

-- Optimize multi-condition queries
CREATE INDEX idx_analyses_property_category ON analyses(property_id, category, is_analyzed);

Query Optimization Examples:

SQL

-- Uses idx_analyses_property_category
SELECT * FROM analyses
WHERE property_id = 'prop_123'
  AND category = 'critical'
  AND is_analyzed = false;

-- Uses idx_analyses_property_id
SELECT * FROM analyses
WHERE property_id = 'prop_123'
ORDER BY created_at DESC;

-- Uses idx_analyses_category
SELECT category, COUNT(*) FROM analyses
GROUP BY category;

Migration SQL

Full migration file: /backend/migrations/004_multi_file_auto_categorization.sql

Key Sections

1. Add Columns

SQL

ALTER TABLE analyses
  ADD COLUMN IF NOT EXISTS category TEXT DEFAULT 'uncategorized'
    CHECK (category IN ('critical', 'important', 'optional', 'noise', 'uncategorized')),
  ADD COLUMN IF NOT EXISTS is_analyzed BOOLEAN DEFAULT false,
  ADD COLUMN IF NOT EXISTS categorization_metadata JSONB DEFAULT '{}'::jsonb,
  ADD COLUMN IF NOT EXISTS original_filename TEXT,
  ADD COLUMN IF NOT EXISTS file_size_bytes BIGINT,
  ADD COLUMN IF NOT EXISTS error_message TEXT;

2. Create Indexes

SQL

CREATE INDEX IF NOT EXISTS idx_analyses_property_id ON analyses(property_id);
CREATE INDEX IF NOT EXISTS idx_analyses_category ON analyses(category);
CREATE INDEX IF NOT EXISTS idx_analyses_is_analyzed ON analyses(is_analyzed);
CREATE INDEX IF NOT EXISTS idx_analyses_property_category ON analyses(property_id, category, is_analyzed);

3. Add Property Counters

SQL

ALTER TABLE properties
  ADD COLUMN IF NOT EXISTS total_documents INTEGER DEFAULT 0,
  ADD COLUMN IF NOT EXISTS analyzed_documents INTEGER DEFAULT 0,
  ADD COLUMN IF NOT EXISTS critical_documents INTEGER DEFAULT 0;

4. Create Trigger Function

SQL

CREATE OR REPLACE FUNCTION update_property_document_counts()
RETURNS TRIGGER AS $$
DECLARE
  target_property_id UUID;
BEGIN
  -- Determine which property to update
  IF TG_OP = 'DELETE' THEN
    target_property_id := OLD.property_id;
  ELSE
    target_property_id := NEW.property_id;
  END IF;

  -- Update all counters atomically
  UPDATE properties
  SET
    total_documents = (SELECT COUNT(*) FROM analyses WHERE property_id = target_property_id),
    analyzed_documents = (SELECT COUNT(*) FROM analyses WHERE property_id = target_property_id AND is_analyzed = true),
    critical_documents = (SELECT COUNT(*) FROM analyses WHERE property_id = target_property_id AND category IN ('critical', 'important')),
    updated_at = NOW()
  WHERE id = target_property_id;

  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

5. Attach Trigger

SQL

CREATE TRIGGER trigger_update_property_counts
  AFTER INSERT OR UPDATE OR DELETE ON analyses
  FOR EACH ROW
  EXECUTE FUNCTION update_property_document_counts();

6. Backfill Existing Data

SQL

-- Mark existing documents as analyzed
UPDATE analyses
SET
  category = 'critical',
  is_analyzed = (status = 'completed'),
  categorization_metadata = jsonb_build_object('method', 'migration_backfill', 'timestamp', NOW())
WHERE category IS NULL OR is_analyzed IS NULL;

-- Backfill property counters
UPDATE properties p
SET
  total_documents = (SELECT COUNT(*) FROM analyses WHERE property_id = p.id),
  analyzed_documents = (SELECT COUNT(*) FROM analyses WHERE property_id = p.id AND is_analyzed = true),
  critical_documents = (SELECT COUNT(*) FROM analyses WHERE property_id = p.id AND category IN ('critical', 'important'))
WHERE id IN (SELECT DISTINCT property_id FROM analyses WHERE property_id IS NOT NULL);

Rollback Strategy

To rollback this migration:

SQL

-- 1. Drop trigger
DROP TRIGGER IF EXISTS trigger_update_property_counts ON analyses;
DROP FUNCTION IF EXISTS update_property_document_counts();

-- 2. Drop indexes
DROP INDEX IF EXISTS idx_analyses_property_id;
DROP INDEX IF EXISTS idx_analyses_category;
DROP INDEX IF EXISTS idx_analyses_is_analyzed;
DROP INDEX IF EXISTS idx_analyses_property_category;

-- 3. Remove columns from properties
ALTER TABLE properties
  DROP COLUMN IF EXISTS total_documents,
  DROP COLUMN IF EXISTS analyzed_documents,
  DROP COLUMN IF EXISTS critical_documents;

-- 4. Remove columns from analyses
ALTER TABLE analyses
  DROP COLUMN IF EXISTS category,
  DROP COLUMN IF EXISTS is_analyzed,
  DROP COLUMN IF EXISTS categorization_metadata,
  DROP COLUMN IF EXISTS original_filename,
  DROP COLUMN IF EXISTS file_size_bytes,
  DROP COLUMN IF EXISTS error_message;

Warning: Rollback will delete all categorization data. Ensure you have a database backup before rolling back.

Performance Considerations

Index Maintenance

Indexes are automatically maintained by PostgreSQL
No manual REINDEX required for normal operations
Consider VACUUM ANALYZE after bulk inserts

Trigger Performance

Trigger fires on EVERY row change
For bulk operations, consider disabling trigger temporarily:

SQL

-- Disable trigger for bulk insert
ALTER TABLE analyses DISABLE TRIGGER trigger_update_property_counts;

-- Bulk insert
INSERT INTO analyses (...) SELECT ... FROM ...;

-- Re-enable trigger
ALTER TABLE analyses ENABLE TRIGGER trigger_update_property_counts;

-- Manually update counts once
UPDATE properties p
SET total_documents = (SELECT COUNT(*) FROM analyses WHERE property_id = p.id),
    analyzed_documents = (SELECT COUNT(*) FROM analyses WHERE property_id = p.id AND is_analyzed = true),
    critical_documents = (SELECT COUNT(*) FROM analyses WHERE property_id = p.id AND category IN ('critical', 'important'));

Query Optimization

Use the composite index for common queries:

SQL

-- Optimized: uses composite index
EXPLAIN ANALYZE
SELECT * FROM analyses
WHERE property_id = 'prop_123'
  AND category IN ('critical', 'important')
  AND is_analyzed = false;

-- Expected: Index Scan using idx_analyses_property_category

Monitoring

Check Migration Status

SQL

-- Verify columns exist
SELECT column_name, data_type, column_default
FROM information_schema.columns
WHERE table_name = 'analyses'
  AND column_name IN ('category', 'is_analyzed', 'categorization_metadata', 'original_filename', 'file_size_bytes', 'error_message');

-- Verify indexes exist
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'analyses'
  AND indexname LIKE 'idx_analyses_%';

-- Verify trigger exists
SELECT trigger_name, event_manipulation, event_object_table
FROM information_schema.triggers
WHERE trigger_name = 'trigger_update_property_counts';

Data Quality Checks

SQL

-- Check for uncategorized documents
SELECT COUNT(*), category
FROM analyses
GROUP BY category;

-- Check for documents missing filenames
SELECT COUNT(*) FROM analyses WHERE original_filename IS NULL;

-- Check for properties with incorrect counters
SELECT p.id, p.address,
  p.total_documents as stored_count,
  COUNT(a.id) as actual_count
FROM properties p
LEFT JOIN analyses a ON a.property_id = p.id
GROUP BY p.id
HAVING p.total_documents != COUNT(a.id);

Best Practices

1. Always Set Category on Insert

SQL

-- Good: category specified
INSERT INTO analyses (id, property_id, category, is_analyzed)
VALUES ('ana_123', 'prop_456', 'critical', false);

-- Acceptable: uses default
INSERT INTO analyses (id, property_id)
VALUES ('ana_123', 'prop_456');  -- category defaults to 'uncategorized'

2. Update is_analyzed After Processing

SQL

-- Mark document as analyzed after background processing completes
UPDATE analyses
SET
  is_analyzed = true,
  status = 'completed'
WHERE id = 'ana_123';

-- Trigger automatically updates property counters

3. Store Categorization Details

SQL

-- Store rich categorization metadata
UPDATE analyses
SET
  category = 'critical',
  categorization_metadata = jsonb_build_object(
    'method', 'filename_pattern',
    'confidence', 0.95,
    'matched_patterns', ARRAY['(?i)inspection.*report'],
    'reasoning', 'Filename matches critical patterns',
    'timestamp', NOW()
  )
WHERE id = 'ana_123';

4. Handle Errors Gracefully

SQL

-- Record processing errors
UPDATE analyses
SET
  status = 'failed',
  error_message = 'Gemini Vision API timeout after 30s',
  is_analyzed = false
WHERE id = 'ana_123';

Support

For database-related questions:

Migration Issues: Check /backend/migrations/ directory
Query Performance: Use EXPLAIN ANALYZE
Data Integrity: Run data quality checks above
Supabase Console: https://app.supabase.com/project/omwmzlmmdrkfhylvgkcy