Skip to content

[Feature] LinkedIn Cache System #73

@EddyDavies

Description

@EddyDavies

Please see the below specification.

Specification

Overview

Create a caching layer for LinkedIn profile data to improve performance, reduce BrightData API costs, and provide better data management for repeated LinkedIn profile requests.

Database Schema

New Table: linkedin_cache

CREATE TABLE linkedin_cache (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  
  -- URL Management
  linkedin_url text UNIQUE NOT NULL,
  normalized_url text UNIQUE NOT NULL, -- For consistent lookups
  
  -- BrightData Integration
  brightdata_snapshot_id text,
  
  -- Extracted Key Fields (for quick queries without parsing JSON)
  full_name text,
  headline text,
  location text,
  connections integer,
  current_company text,
  current_title text,
  
  -- Data Counts
  experience_count integer DEFAULT 0,
  skills_count integer DEFAULT 0,
  education_count integer DEFAULT 0,
  
  -- Data Availability Flags
  profile_accessible boolean DEFAULT true,
  has_experience boolean DEFAULT false,
  has_skills boolean DEFAULT false,
  has_education boolean DEFAULT false,
  has_activity boolean DEFAULT false,
  
  -- Structured Data Extracts (for easy access without parsing full JSON)
  experience_data jsonb,    -- Array of experience objects with company, title, duration, description
  education_data jsonb,     -- Array of education objects with school, degree, years, location
  skills_data jsonb,        -- Array of skill strings or skill objects
  
  -- Raw Data Storage
  raw_data jsonb,           -- Original BrightData response
  processed_data jsonb,     -- Our standardized LinkedInData format
  
  -- Quality & Metadata
  data_quality_score integer, -- 1-100 based on completeness
  scraped_at timestamptz DEFAULT now(),
  last_accessed_at timestamptz DEFAULT now(),
  access_count integer DEFAULT 1,
  
  -- Error Tracking
  last_error text,
  error_count integer DEFAULT 0,
  
  -- Timestamps
  created_at timestamptz DEFAULT now(),
  updated_at timestamptz DEFAULT now()
);

Indexes

CREATE INDEX idx_linkedin_cache_normalized_url ON linkedin_cache(normalized_url);
CREATE INDEX idx_linkedin_cache_linkedin_url ON linkedin_cache(linkedin_url);
CREATE INDEX idx_linkedin_cache_brightdata_snapshot_id ON linkedin_cache(brightdata_snapshot_id);
CREATE INDEX idx_linkedin_cache_scraped_at ON linkedin_cache(scraped_at);
CREATE INDEX idx_linkedin_cache_profile_accessible ON linkedin_cache(profile_accessible);
CREATE INDEX idx_linkedin_cache_data_quality_score ON linkedin_cache(data_quality_score);

-- GIN indexes for JSONB structured data queries
CREATE INDEX idx_linkedin_cache_experience_data ON linkedin_cache USING gin(experience_data);
CREATE INDEX idx_linkedin_cache_education_data ON linkedin_cache USING gin(education_data);
CREATE INDEX idx_linkedin_cache_skills_data ON linkedin_cache USING gin(skills_data);

Structured Data Formats

Experience Data Format

[
  {
    "company": "Google",
    "title": "Senior Software Engineer", 
    "duration": "2021 - Present",
    "location": "San Francisco, CA",
    "description": "Leading development of...",
    "company_exists": true
  },
  {
    "company": "Meta",
    "title": "Software Engineer",
    "duration": "2019 - 2021", 
    "location": "Menlo Park, CA",
    "description": "Developed scalable...",
    "company_exists": true
  }
]

Education Data Format

[
  {
    "school": "Stanford University",
    "degree": "Bachelor of Science in Computer Science",
    "years": "2015 - 2019",
    "location": "Stanford, CA",
    "school_exists": true
  },
  {
    "school": "MIT",
    "degree": "Master of Science in Computer Science", 
    "years": "2019 - 2021",
    "location": "Cambridge, MA",
    "school_exists": true
  }
]

Skills Data Format

[
  "JavaScript",
  "Python", 
  "React",
  "Node.js",
  "AWS",
  "Docker"
]

Data Quality Principles

  1. No Fake Data Generation: Never create or infer data that wasn't explicitly provided
  2. Missing Data Marking: Clearly mark when experience/education data is missing vs empty
  3. Presence Validation: Use has_experience, has_education, has_skills flags to indicate data availability
  4. Null vs Empty:
    • null = data section not available/accessible
    • [] = data section exists but is empty
    • [...] = data section exists with content

Functions & Utilities

1. URL Normalization

CREATE FUNCTION normalize_linkedin_url(url text) RETURNS text

Normalizes LinkedIn URLs for consistent lookups by:

  • Converting to lowercase
  • Removing protocol (http/https)
  • Removing www prefix
  • Removing trailing slashes
  • Normalizing multiple slashes

2. Data Quality Scoring

CREATE FUNCTION calculate_linkedin_data_quality(cache_row linkedin_cache) RETURNS integer

Calculates quality score (1-100) based on:

  • Basic Info (40pts): name (10), headline (10), location (10), connections (10)
  • Experience (30pts): has_experience (20), current_company (5), current_title (5)
  • Skills (15pts): has_skills and skills_count > 0
  • Education (10pts): has_education and education_count > 0
  • Activity (5pts): has_activity data

3. Upsert Function

CREATE FUNCTION upsert_linkedin_cache(
  p_linkedin_url text,
  p_brightdata_snapshot_id text DEFAULT NULL,
  p_raw_data jsonb DEFAULT NULL,
  p_processed_data jsonb DEFAULT NULL,
  p_profile_accessible boolean DEFAULT true,
  p_error_message text DEFAULT NULL
) RETURNS uuid

Handles inserting new records or updating existing ones with conflict resolution.

Key Responsibilities:

  • Extract structured data from p_processed_data into separate JSONB columns
  • Set data availability flags based on actual data presence
  • Calculate data quality score based on completeness
  • Handle conflict resolution for duplicate URLs
  • Never generate fake data - only store what was actually scraped

Data Extraction Logic:

-- Extract experience data
IF p_processed_data->'experience' IS NOT NULL THEN
  experience_data := p_processed_data->'experience';
  experience_count := jsonb_array_length(experience_data);
  has_experience := experience_count > 0;
  
  -- Extract current company/title from first experience entry
  IF has_experience THEN
    current_company := experience_data->0->>'company';
    current_title := experience_data->0->>'title';
  END IF;
END IF;

-- Extract education data  
IF p_processed_data->'education' IS NOT NULL THEN
  education_data := p_processed_data->'education';
  education_count := jsonb_array_length(education_data);
  has_education := education_count > 0;
END IF;

-- Extract skills data
IF p_processed_data->'skills' IS NOT NULL THEN
  skills_data := p_processed_data->'skills';
  skills_count := jsonb_array_length(skills_data);
  has_skills := skills_count > 0;
END IF;

Querying Structured Data

Experience Queries

-- Find profiles with experience at specific company
SELECT full_name, linkedin_url 
FROM linkedin_cache 
WHERE experience_data @> '[{"company": "Google"}]';

-- Find profiles with current title containing "Engineer"
SELECT full_name, current_title 
FROM linkedin_cache 
WHERE current_title ILIKE '%Engineer%';

-- Get all experience for a profile
SELECT experience_data 
FROM linkedin_cache 
WHERE normalized_url = normalize_linkedin_url('...');

Education Queries

-- Find profiles from specific school
SELECT full_name, linkedin_url 
FROM linkedin_cache 
WHERE education_data @> '[{"school": "Stanford University"}]';

-- Find profiles with Computer Science degrees
SELECT full_name, education_data 
FROM linkedin_cache 
WHERE education_data @> '[{"degree": "Bachelor of Science in Computer Science"}]';

Skills Queries

-- Find profiles with specific skill
SELECT full_name, linkedin_url 
FROM linkedin_cache 
WHERE skills_data @> '["JavaScript"]';

-- Find profiles with multiple skills
SELECT full_name, skills_data 
FROM linkedin_cache 
WHERE skills_data @> '["JavaScript", "React"]';

Data Quality Queries

-- Profiles with missing critical data
SELECT full_name, linkedin_url, data_quality_score
FROM linkedin_cache 
WHERE has_experience = false OR has_education = false
ORDER BY data_quality_score DESC;

-- High quality profiles with complete data
SELECT full_name, linkedin_url, data_quality_score
FROM linkedin_cache 
WHERE data_quality_score >= 80 AND has_experience = true AND has_education = true;

API Integration Points

Cache Lookup Flow

  1. Before BrightData API Call:

    // Check cache first
    const cachedData = await getCachedLinkedInData(linkedinUrl);
    if (cachedData && isDataFresh(cachedData)) {
      return cachedData;
    }
  2. After Successful Scraping:

    // Save successful result
    await saveLinkedInDataToCache(
      linkedinUrl, 
      processedData, 
      snapshotId, 
      rawData, 
      true
    );
  3. After Failed Scraping:

    // Save failure to prevent retry
    await saveLinkedInDataToCache(
      linkedinUrl, 
      null, 
      snapshotId, 
      undefined, 
      false, 
      errorMessage
    );

Cache Strategy

Cache Hit Scenarios

  1. Perfect Match: Exact normalized URL exists with good quality data
  2. Accessible but Empty: Profile was accessible but returned minimal data
  3. Known Inaccessible: Profile marked as private/blocked - avoid retry

Cache Miss Scenarios

  1. New URL: Never scraped before
  2. Stale Data: Data older than configured threshold
  3. Failed Previous Attempts: But retry allowed after cooldown period

Cache Invalidation

  • Time-based: Data older than X days (configurable)
  • Manual: Admin can force refresh specific profiles
  • Error-based: Failed attempts have shorter cache time

Performance Benefits

Cost Reduction

  • Avoid duplicate BrightData API calls for same LinkedIn URLs
  • Reuse existing snapshots when available
  • Cache failed attempts to prevent wasted retry cycles

Speed Improvement

  • Instant response for cached profiles
  • No polling wait time for cached data
  • Reduced database queries with extracted key fields

Data Management

  • Track profile accessibility patterns
  • Monitor data quality across profiles
  • Analytics on LinkedIn profile completion rates

Implementation Phases

Phase 1: Core Cache Infrastructure

  • Create database table and functions
  • Add URL normalization utilities
  • Implement basic cache lookup functions

Phase 2: API Integration

  • Update processLinkedInUrl() to check cache first
  • Update API route to save results to cache
  • Handle cache hits in processing pipeline

Phase 3: Cache Management

  • Add cache refresh capabilities
  • Implement data quality monitoring
  • Add admin interface for cache management

Configuration Options

// Environment variables
LINKEDIN_CACHE_TTL_DAYS=30        // Cache lifetime for successful results
LINKEDIN_CACHE_ERROR_TTL_DAYS=7   // Cache lifetime for failed attempts  
LINKEDIN_CACHE_MIN_QUALITY=60     // Minimum quality score to serve cached data
LINKEDIN_CACHE_ENABLED=true       // Enable/disable caching system

Monitoring & Analytics

Metrics to Track

  • Cache hit rate vs miss rate
  • Data quality distribution
  • Profile accessibility rates
  • BrightData API cost savings
  • Response time improvements

Alerts

  • Cache hit rate drops below threshold
  • High error rates for specific domains
  • Data quality scores trending down
  • Cache storage approaching limits

Security Considerations

Row Level Security (RLS)

  • Service role: Full access for API operations
  • Authenticated users: Read-only for admin interfaces
  • Public: No access

Data Privacy

  • Cache respects LinkedIn data usage terms
  • Failed attempts don't store personal data
  • Configurable cache retention periods
  • GDPR compliance for data deletion

Testing Strategy

Unit Tests

  • URL normalization edge cases
  • Data quality scoring accuracy
  • Upsert function conflict handling
  • Cache lookup performance

Integration Tests

  • End-to-end cache flow with real LinkedIn URLs
  • BrightData API integration with caching
  • Error handling and retry logic
  • Database triggers and functions

Performance Tests

  • Cache lookup speed vs database queries
  • Memory usage with large cached datasets
  • Concurrent access patterns
  • Cache invalidation performance

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions