-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Milestone
Description
Please see the below specification.
Specification
Overview
Create a caching layer for LinkedIn profile data to improve performance, reduce BrightData API costs, and provide better data management for repeated LinkedIn profile requests.
Database Schema
New Table: linkedin_cache
CREATE TABLE linkedin_cache (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
-- URL Management
linkedin_url text UNIQUE NOT NULL,
normalized_url text UNIQUE NOT NULL, -- For consistent lookups
-- BrightData Integration
brightdata_snapshot_id text,
-- Extracted Key Fields (for quick queries without parsing JSON)
full_name text,
headline text,
location text,
connections integer,
current_company text,
current_title text,
-- Data Counts
experience_count integer DEFAULT 0,
skills_count integer DEFAULT 0,
education_count integer DEFAULT 0,
-- Data Availability Flags
profile_accessible boolean DEFAULT true,
has_experience boolean DEFAULT false,
has_skills boolean DEFAULT false,
has_education boolean DEFAULT false,
has_activity boolean DEFAULT false,
-- Structured Data Extracts (for easy access without parsing full JSON)
experience_data jsonb, -- Array of experience objects with company, title, duration, description
education_data jsonb, -- Array of education objects with school, degree, years, location
skills_data jsonb, -- Array of skill strings or skill objects
-- Raw Data Storage
raw_data jsonb, -- Original BrightData response
processed_data jsonb, -- Our standardized LinkedInData format
-- Quality & Metadata
data_quality_score integer, -- 1-100 based on completeness
scraped_at timestamptz DEFAULT now(),
last_accessed_at timestamptz DEFAULT now(),
access_count integer DEFAULT 1,
-- Error Tracking
last_error text,
error_count integer DEFAULT 0,
-- Timestamps
created_at timestamptz DEFAULT now(),
updated_at timestamptz DEFAULT now()
);Indexes
CREATE INDEX idx_linkedin_cache_normalized_url ON linkedin_cache(normalized_url);
CREATE INDEX idx_linkedin_cache_linkedin_url ON linkedin_cache(linkedin_url);
CREATE INDEX idx_linkedin_cache_brightdata_snapshot_id ON linkedin_cache(brightdata_snapshot_id);
CREATE INDEX idx_linkedin_cache_scraped_at ON linkedin_cache(scraped_at);
CREATE INDEX idx_linkedin_cache_profile_accessible ON linkedin_cache(profile_accessible);
CREATE INDEX idx_linkedin_cache_data_quality_score ON linkedin_cache(data_quality_score);
-- GIN indexes for JSONB structured data queries
CREATE INDEX idx_linkedin_cache_experience_data ON linkedin_cache USING gin(experience_data);
CREATE INDEX idx_linkedin_cache_education_data ON linkedin_cache USING gin(education_data);
CREATE INDEX idx_linkedin_cache_skills_data ON linkedin_cache USING gin(skills_data);Structured Data Formats
Experience Data Format
[
{
"company": "Google",
"title": "Senior Software Engineer",
"duration": "2021 - Present",
"location": "San Francisco, CA",
"description": "Leading development of...",
"company_exists": true
},
{
"company": "Meta",
"title": "Software Engineer",
"duration": "2019 - 2021",
"location": "Menlo Park, CA",
"description": "Developed scalable...",
"company_exists": true
}
]
Education Data Format
[
{
"school": "Stanford University",
"degree": "Bachelor of Science in Computer Science",
"years": "2015 - 2019",
"location": "Stanford, CA",
"school_exists": true
},
{
"school": "MIT",
"degree": "Master of Science in Computer Science",
"years": "2019 - 2021",
"location": "Cambridge, MA",
"school_exists": true
}
]
Skills Data Format
[
"JavaScript",
"Python",
"React",
"Node.js",
"AWS",
"Docker"
]
Data Quality Principles
- No Fake Data Generation: Never create or infer data that wasn't explicitly provided
- Missing Data Marking: Clearly mark when experience/education data is missing vs empty
- Presence Validation: Use
has_experience,has_education,has_skillsflags to indicate data availability - Null vs Empty:
null= data section not available/accessible[]= data section exists but is empty[...]= data section exists with content
Functions & Utilities
1. URL Normalization
CREATE FUNCTION normalize_linkedin_url(url text) RETURNS textNormalizes LinkedIn URLs for consistent lookups by:
- Converting to lowercase
- Removing protocol (http/https)
- Removing www prefix
- Removing trailing slashes
- Normalizing multiple slashes
2. Data Quality Scoring
CREATE FUNCTION calculate_linkedin_data_quality(cache_row linkedin_cache) RETURNS integerCalculates quality score (1-100) based on:
- Basic Info (40pts): name (10), headline (10), location (10), connections (10)
- Experience (30pts): has_experience (20), current_company (5), current_title (5)
- Skills (15pts): has_skills and skills_count > 0
- Education (10pts): has_education and education_count > 0
- Activity (5pts): has_activity data
3. Upsert Function
CREATE FUNCTION upsert_linkedin_cache(
p_linkedin_url text,
p_brightdata_snapshot_id text DEFAULT NULL,
p_raw_data jsonb DEFAULT NULL,
p_processed_data jsonb DEFAULT NULL,
p_profile_accessible boolean DEFAULT true,
p_error_message text DEFAULT NULL
) RETURNS uuidHandles inserting new records or updating existing ones with conflict resolution.
Key Responsibilities:
- Extract structured data from
p_processed_datainto separate JSONB columns - Set data availability flags based on actual data presence
- Calculate data quality score based on completeness
- Handle conflict resolution for duplicate URLs
- Never generate fake data - only store what was actually scraped
Data Extraction Logic:
-- Extract experience data
IF p_processed_data->'experience' IS NOT NULL THEN
experience_data := p_processed_data->'experience';
experience_count := jsonb_array_length(experience_data);
has_experience := experience_count > 0;
-- Extract current company/title from first experience entry
IF has_experience THEN
current_company := experience_data->0->>'company';
current_title := experience_data->0->>'title';
END IF;
END IF;
-- Extract education data
IF p_processed_data->'education' IS NOT NULL THEN
education_data := p_processed_data->'education';
education_count := jsonb_array_length(education_data);
has_education := education_count > 0;
END IF;
-- Extract skills data
IF p_processed_data->'skills' IS NOT NULL THEN
skills_data := p_processed_data->'skills';
skills_count := jsonb_array_length(skills_data);
has_skills := skills_count > 0;
END IF;Querying Structured Data
Experience Queries
-- Find profiles with experience at specific company
SELECT full_name, linkedin_url
FROM linkedin_cache
WHERE experience_data @> '[{"company": "Google"}]';
-- Find profiles with current title containing "Engineer"
SELECT full_name, current_title
FROM linkedin_cache
WHERE current_title ILIKE '%Engineer%';
-- Get all experience for a profile
SELECT experience_data
FROM linkedin_cache
WHERE normalized_url = normalize_linkedin_url('...');Education Queries
-- Find profiles from specific school
SELECT full_name, linkedin_url
FROM linkedin_cache
WHERE education_data @> '[{"school": "Stanford University"}]';
-- Find profiles with Computer Science degrees
SELECT full_name, education_data
FROM linkedin_cache
WHERE education_data @> '[{"degree": "Bachelor of Science in Computer Science"}]';Skills Queries
-- Find profiles with specific skill
SELECT full_name, linkedin_url
FROM linkedin_cache
WHERE skills_data @> '["JavaScript"]';
-- Find profiles with multiple skills
SELECT full_name, skills_data
FROM linkedin_cache
WHERE skills_data @> '["JavaScript", "React"]';Data Quality Queries
-- Profiles with missing critical data
SELECT full_name, linkedin_url, data_quality_score
FROM linkedin_cache
WHERE has_experience = false OR has_education = false
ORDER BY data_quality_score DESC;
-- High quality profiles with complete data
SELECT full_name, linkedin_url, data_quality_score
FROM linkedin_cache
WHERE data_quality_score >= 80 AND has_experience = true AND has_education = true;API Integration Points
Cache Lookup Flow
-
Before BrightData API Call:
// Check cache first const cachedData = await getCachedLinkedInData(linkedinUrl); if (cachedData && isDataFresh(cachedData)) { return cachedData; }
-
After Successful Scraping:
// Save successful result await saveLinkedInDataToCache( linkedinUrl, processedData, snapshotId, rawData, true );
-
After Failed Scraping:
// Save failure to prevent retry await saveLinkedInDataToCache( linkedinUrl, null, snapshotId, undefined, false, errorMessage );
Cache Strategy
Cache Hit Scenarios
- Perfect Match: Exact normalized URL exists with good quality data
- Accessible but Empty: Profile was accessible but returned minimal data
- Known Inaccessible: Profile marked as private/blocked - avoid retry
Cache Miss Scenarios
- New URL: Never scraped before
- Stale Data: Data older than configured threshold
- Failed Previous Attempts: But retry allowed after cooldown period
Cache Invalidation
- Time-based: Data older than X days (configurable)
- Manual: Admin can force refresh specific profiles
- Error-based: Failed attempts have shorter cache time
Performance Benefits
Cost Reduction
- Avoid duplicate BrightData API calls for same LinkedIn URLs
- Reuse existing snapshots when available
- Cache failed attempts to prevent wasted retry cycles
Speed Improvement
- Instant response for cached profiles
- No polling wait time for cached data
- Reduced database queries with extracted key fields
Data Management
- Track profile accessibility patterns
- Monitor data quality across profiles
- Analytics on LinkedIn profile completion rates
Implementation Phases
Phase 1: Core Cache Infrastructure
- Create database table and functions
- Add URL normalization utilities
- Implement basic cache lookup functions
Phase 2: API Integration
- Update
processLinkedInUrl()to check cache first - Update API route to save results to cache
- Handle cache hits in processing pipeline
Phase 3: Cache Management
- Add cache refresh capabilities
- Implement data quality monitoring
- Add admin interface for cache management
Configuration Options
// Environment variables
LINKEDIN_CACHE_TTL_DAYS=30 // Cache lifetime for successful results
LINKEDIN_CACHE_ERROR_TTL_DAYS=7 // Cache lifetime for failed attempts
LINKEDIN_CACHE_MIN_QUALITY=60 // Minimum quality score to serve cached data
LINKEDIN_CACHE_ENABLED=true // Enable/disable caching systemMonitoring & Analytics
Metrics to Track
- Cache hit rate vs miss rate
- Data quality distribution
- Profile accessibility rates
- BrightData API cost savings
- Response time improvements
Alerts
- Cache hit rate drops below threshold
- High error rates for specific domains
- Data quality scores trending down
- Cache storage approaching limits
Security Considerations
Row Level Security (RLS)
- Service role: Full access for API operations
- Authenticated users: Read-only for admin interfaces
- Public: No access
Data Privacy
- Cache respects LinkedIn data usage terms
- Failed attempts don't store personal data
- Configurable cache retention periods
- GDPR compliance for data deletion
Testing Strategy
Unit Tests
- URL normalization edge cases
- Data quality scoring accuracy
- Upsert function conflict handling
- Cache lookup performance
Integration Tests
- End-to-end cache flow with real LinkedIn URLs
- BrightData API integration with caching
- Error handling and retry logic
- Database triggers and functions
Performance Tests
- Cache lookup speed vs database queries
- Memory usage with large cached datasets
- Concurrent access patterns
- Cache invalidation performance
Metadata
Metadata
Assignees
Labels
No labels