[Feature] LinkedIn Cache System

Please see the below specification.

# Specification

## Overview

Create a caching layer for LinkedIn profile data to improve performance, reduce BrightData API costs, and provide better data management for repeated LinkedIn profile requests.

## Database Schema

### New Table: `linkedin_cache`

```sql
CREATE TABLE linkedin_cache (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  
  -- URL Management
  linkedin_url text UNIQUE NOT NULL,
  normalized_url text UNIQUE NOT NULL, -- For consistent lookups
  
  -- BrightData Integration
  brightdata_snapshot_id text,
  
  -- Extracted Key Fields (for quick queries without parsing JSON)
  full_name text,
  headline text,
  location text,
  connections integer,
  current_company text,
  current_title text,
  
  -- Data Counts
  experience_count integer DEFAULT 0,
  skills_count integer DEFAULT 0,
  education_count integer DEFAULT 0,
  
  -- Data Availability Flags
  profile_accessible boolean DEFAULT true,
  has_experience boolean DEFAULT false,
  has_skills boolean DEFAULT false,
  has_education boolean DEFAULT false,
  has_activity boolean DEFAULT false,
  
  -- Structured Data Extracts (for easy access without parsing full JSON)
  experience_data jsonb,    -- Array of experience objects with company, title, duration, description
  education_data jsonb,     -- Array of education objects with school, degree, years, location
  skills_data jsonb,        -- Array of skill strings or skill objects
  
  -- Raw Data Storage
  raw_data jsonb,           -- Original BrightData response
  processed_data jsonb,     -- Our standardized LinkedInData format
  
  -- Quality & Metadata
  data_quality_score integer, -- 1-100 based on completeness
  scraped_at timestamptz DEFAULT now(),
  last_accessed_at timestamptz DEFAULT now(),
  access_count integer DEFAULT 1,
  
  -- Error Tracking
  last_error text,
  error_count integer DEFAULT 0,
  
  -- Timestamps
  created_at timestamptz DEFAULT now(),
  updated_at timestamptz DEFAULT now()
);
```

### Indexes

```sql
CREATE INDEX idx_linkedin_cache_normalized_url ON linkedin_cache(normalized_url);
CREATE INDEX idx_linkedin_cache_linkedin_url ON linkedin_cache(linkedin_url);
CREATE INDEX idx_linkedin_cache_brightdata_snapshot_id ON linkedin_cache(brightdata_snapshot_id);
CREATE INDEX idx_linkedin_cache_scraped_at ON linkedin_cache(scraped_at);
CREATE INDEX idx_linkedin_cache_profile_accessible ON linkedin_cache(profile_accessible);
CREATE INDEX idx_linkedin_cache_data_quality_score ON linkedin_cache(data_quality_score);

-- GIN indexes for JSONB structured data queries
CREATE INDEX idx_linkedin_cache_experience_data ON linkedin_cache USING gin(experience_data);
CREATE INDEX idx_linkedin_cache_education_data ON linkedin_cache USING gin(education_data);
CREATE INDEX idx_linkedin_cache_skills_data ON linkedin_cache USING gin(skills_data);
```

## Structured Data Formats

### Experience Data Format
```jsonb
[
  {
    "company": "Google",
    "title": "Senior Software Engineer", 
    "duration": "2021 - Present",
    "location": "San Francisco, CA",
    "description": "Leading development of...",
    "company_exists": true
  },
  {
    "company": "Meta",
    "title": "Software Engineer",
    "duration": "2019 - 2021", 
    "location": "Menlo Park, CA",
    "description": "Developed scalable...",
    "company_exists": true
  }
]
```

### Education Data Format
```jsonb
[
  {
    "school": "Stanford University",
    "degree": "Bachelor of Science in Computer Science",
    "years": "2015 - 2019",
    "location": "Stanford, CA",
    "school_exists": true
  },
  {
    "school": "MIT",
    "degree": "Master of Science in Computer Science", 
    "years": "2019 - 2021",
    "location": "Cambridge, MA",
    "school_exists": true
  }
]
```

### Skills Data Format
```jsonb
[
  "JavaScript",
  "Python", 
  "React",
  "Node.js",
  "AWS",
  "Docker"
]
```

### Data Quality Principles

1. **No Fake Data Generation**: Never create or infer data that wasn't explicitly provided
2. **Missing Data Marking**: Clearly mark when experience/education data is missing vs empty
3. **Presence Validation**: Use `has_experience`, `has_education`, `has_skills` flags to indicate data availability
4. **Null vs Empty**: 
   - `null` = data section not available/accessible
   - `[]` = data section exists but is empty
   - `[...]` = data section exists with content

## Functions & Utilities

### 1. URL Normalization

```sql
CREATE FUNCTION normalize_linkedin_url(url text) RETURNS text
```

Normalizes LinkedIn URLs for consistent lookups by:
- Converting to lowercase
- Removing protocol (http/https)
- Removing www prefix
- Removing trailing slashes
- Normalizing multiple slashes

### 2. Data Quality Scoring

```sql
CREATE FUNCTION calculate_linkedin_data_quality(cache_row linkedin_cache) RETURNS integer
```

Calculates quality score (1-100) based on:
- **Basic Info (40pts)**: name (10), headline (10), location (10), connections (10)
- **Experience (30pts)**: has_experience (20), current_company (5), current_title (5)
- **Skills (15pts)**: has_skills and skills_count > 0
- **Education (10pts)**: has_education and education_count > 0  
- **Activity (5pts)**: has_activity data

### 3. Upsert Function

```sql
CREATE FUNCTION upsert_linkedin_cache(
  p_linkedin_url text,
  p_brightdata_snapshot_id text DEFAULT NULL,
  p_raw_data jsonb DEFAULT NULL,
  p_processed_data jsonb DEFAULT NULL,
  p_profile_accessible boolean DEFAULT true,
  p_error_message text DEFAULT NULL
) RETURNS uuid
```

Handles inserting new records or updating existing ones with conflict resolution.

**Key Responsibilities:**
- Extract structured data from `p_processed_data` into separate JSONB columns
- Set data availability flags based on actual data presence
- Calculate data quality score based on completeness
- Handle conflict resolution for duplicate URLs
- Never generate fake data - only store what was actually scraped

**Data Extraction Logic:**
```sql
-- Extract experience data
IF p_processed_data->'experience' IS NOT NULL THEN
  experience_data := p_processed_data->'experience';
  experience_count := jsonb_array_length(experience_data);
  has_experience := experience_count > 0;
  
  -- Extract current company/title from first experience entry
  IF has_experience THEN
    current_company := experience_data->0->>'company';
    current_title := experience_data->0->>'title';
  END IF;
END IF;

-- Extract education data  
IF p_processed_data->'education' IS NOT NULL THEN
  education_data := p_processed_data->'education';
  education_count := jsonb_array_length(education_data);
  has_education := education_count > 0;
END IF;

-- Extract skills data
IF p_processed_data->'skills' IS NOT NULL THEN
  skills_data := p_processed_data->'skills';
  skills_count := jsonb_array_length(skills_data);
  has_skills := skills_count > 0;
END IF;
```

## Querying Structured Data

### Experience Queries
```sql
-- Find profiles with experience at specific company
SELECT full_name, linkedin_url 
FROM linkedin_cache 
WHERE experience_data @> '[{"company": "Google"}]';

-- Find profiles with current title containing "Engineer"
SELECT full_name, current_title 
FROM linkedin_cache 
WHERE current_title ILIKE '%Engineer%';

-- Get all experience for a profile
SELECT experience_data 
FROM linkedin_cache 
WHERE normalized_url = normalize_linkedin_url('...');
```

### Education Queries  
```sql
-- Find profiles from specific school
SELECT full_name, linkedin_url 
FROM linkedin_cache 
WHERE education_data @> '[{"school": "Stanford University"}]';

-- Find profiles with Computer Science degrees
SELECT full_name, education_data 
FROM linkedin_cache 
WHERE education_data @> '[{"degree": "Bachelor of Science in Computer Science"}]';
```

### Skills Queries
```sql
-- Find profiles with specific skill
SELECT full_name, linkedin_url 
FROM linkedin_cache 
WHERE skills_data @> '["JavaScript"]';

-- Find profiles with multiple skills
SELECT full_name, skills_data 
FROM linkedin_cache 
WHERE skills_data @> '["JavaScript", "React"]';
```

### Data Quality Queries
```sql
-- Profiles with missing critical data
SELECT full_name, linkedin_url, data_quality_score
FROM linkedin_cache 
WHERE has_experience = false OR has_education = false
ORDER BY data_quality_score DESC;

-- High quality profiles with complete data
SELECT full_name, linkedin_url, data_quality_score
FROM linkedin_cache 
WHERE data_quality_score >= 80 AND has_experience = true AND has_education = true;
```

## API Integration Points

### Cache Lookup Flow

1. **Before BrightData API Call**:
   ```typescript
   // Check cache first
   const cachedData = await getCachedLinkedInData(linkedinUrl);
   if (cachedData && isDataFresh(cachedData)) {
     return cachedData;
   }
   ```

2. **After Successful Scraping**:
   ```typescript
   // Save successful result
   await saveLinkedInDataToCache(
     linkedinUrl, 
     processedData, 
     snapshotId, 
     rawData, 
     true
   );
   ```

3. **After Failed Scraping**:
   ```typescript
   // Save failure to prevent retry
   await saveLinkedInDataToCache(
     linkedinUrl, 
     null, 
     snapshotId, 
     undefined, 
     false, 
     errorMessage
   );
   ```

## Cache Strategy

### Cache Hit Scenarios

1. **Perfect Match**: Exact normalized URL exists with good quality data
2. **Accessible but Empty**: Profile was accessible but returned minimal data
3. **Known Inaccessible**: Profile marked as private/blocked - avoid retry

### Cache Miss Scenarios

1. **New URL**: Never scraped before
2. **Stale Data**: Data older than configured threshold
3. **Failed Previous Attempts**: But retry allowed after cooldown period

### Cache Invalidation

- **Time-based**: Data older than X days (configurable)
- **Manual**: Admin can force refresh specific profiles
- **Error-based**: Failed attempts have shorter cache time

## Performance Benefits

### Cost Reduction
- Avoid duplicate BrightData API calls for same LinkedIn URLs
- Reuse existing snapshots when available
- Cache failed attempts to prevent wasted retry cycles

### Speed Improvement  
- Instant response for cached profiles
- No polling wait time for cached data
- Reduced database queries with extracted key fields

### Data Management
- Track profile accessibility patterns
- Monitor data quality across profiles
- Analytics on LinkedIn profile completion rates

## Implementation Phases

### Phase 1: Core Cache Infrastructure
- Create database table and functions
- Add URL normalization utilities
- Implement basic cache lookup functions

### Phase 2: API Integration
- Update `processLinkedInUrl()` to check cache first
- Update API route to save results to cache
- Handle cache hits in processing pipeline

### Phase 3: Cache Management
- Add cache refresh capabilities
- Implement data quality monitoring
- Add admin interface for cache management

## Configuration Options

```typescript
// Environment variables
LINKEDIN_CACHE_TTL_DAYS=30        // Cache lifetime for successful results
LINKEDIN_CACHE_ERROR_TTL_DAYS=7   // Cache lifetime for failed attempts  
LINKEDIN_CACHE_MIN_QUALITY=60     // Minimum quality score to serve cached data
LINKEDIN_CACHE_ENABLED=true       // Enable/disable caching system
```

## Monitoring & Analytics

### Metrics to Track
- Cache hit rate vs miss rate
- Data quality distribution
- Profile accessibility rates
- BrightData API cost savings
- Response time improvements

### Alerts
- Cache hit rate drops below threshold
- High error rates for specific domains
- Data quality scores trending down
- Cache storage approaching limits

## Security Considerations

### Row Level Security (RLS)
- Service role: Full access for API operations
- Authenticated users: Read-only for admin interfaces
- Public: No access

### Data Privacy
- Cache respects LinkedIn data usage terms
- Failed attempts don't store personal data
- Configurable cache retention periods
- GDPR compliance for data deletion

## Testing Strategy

### Unit Tests
- URL normalization edge cases
- Data quality scoring accuracy
- Upsert function conflict handling
- Cache lookup performance

### Integration Tests  
- End-to-end cache flow with real LinkedIn URLs
- BrightData API integration with caching
- Error handling and retry logic
- Database triggers and functions

### Performance Tests
- Cache lookup speed vs database queries
- Memory usage with large cached datasets
- Concurrent access patterns
- Cache invalidation performance

[Feature] LinkedIn Cache System #73

Description

Specification

Overview

Database Schema

New Table: linkedin_cache

Indexes

Structured Data Formats

Experience Data Format

Education Data Format

Skills Data Format

Data Quality Principles

Functions & Utilities

1. URL Normalization

2. Data Quality Scoring

3. Upsert Function

Querying Structured Data

Experience Queries

Education Queries

Skills Queries

Data Quality Queries

API Integration Points

Cache Lookup Flow

Cache Strategy

Cache Hit Scenarios

Cache Miss Scenarios

Cache Invalidation

Performance Benefits

Cost Reduction

Speed Improvement

Data Management

Implementation Phases

Phase 1: Core Cache Infrastructure

Phase 2: API Integration

Phase 3: Cache Management

Configuration Options

Monitoring & Analytics

Metrics to Track

Alerts

Security Considerations

Row Level Security (RLS)

Data Privacy

Testing Strategy

Unit Tests

Integration Tests

Performance Tests

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

New Table: `linkedin_cache`