This project provides insights into job market trends, leveraging data from the JobSpy library and analytics powered by the Gemini Pro API. It's designed to help users understand the demand for various technology skills, geographical job distribution, and more.
Jobs Analyzer uses a sophisticated data pipeline to fetch, analyze, and visualize job listing data. Here's how it works:
- Data Acquisition: The data is sourced from the JobSpy library, which aggregates job listings from various platforms.
- Data Analysis: The analysis is powered by the Gemini Pro API, offering deep insights into the job market trends and skill demands.
- Data Pipeline: The entire pipeline is run using Plombery on a free VM provided by Oracle Cloud, ensuring cost-effective scalability.
- Data Storage: Processed data is stored in a free PostgreSQL instance hosted on NeonDB and mirrored into Elasticsearch for fast search and analytical workloads.
- Data Distribution: Each pipeline run exports the latest job inventory to Cloudflare R2 (
me-data-jobsbucket) as a public JSON feed. - Dashboard: The results are visualized through a dashboard hosted on Streamlit Cloud, available at https://jobs-analyzer.streamlit.app/.
To explore the job market trends and insights, visit the Jobs Analyzer dashboard. For developers interested in contributing or setting up their version of the pipeline, refer to the setup instructions below.
- Clone this repository to your local machine or cloud environment.
- Ensure you have access to the JobSpy library and Gemini Pro API.
- Set up a free VM on Oracle Cloud and configure it according to the project requirements.
- Create a free PostgreSQL database on NeonDB and configure the connection parameters in the project.
- Provision an Elasticsearch cluster (or use an existing one) and set
host,username,password, andjobs_indexinsrc/plombery/config/config.ini. - Set up a Cloudflare R2 bucket (default:
me-data-jobs) and update the credentials plusJOBS_BUCKET/JOBS_EXPORT_KEYvalues insrc/plombery/config/config.ini. - Deploy the Streamlit dashboard to Streamlit Cloud, using the provided configuration files.
- The Crswtch scraper lives in
src/plombery/crswth_crs.py. It mirrors the Dubizzle flow: scrape, enrich with detail page data, and index into Elasticsearch. - Configure the behaviour in the
[crstch]section ofsrc/plombery/config/config.ini(listing URL template, page count, back-off timings, and ES index name). - Ensure the Elasticsearch config contains either
carswitch_indexor setes_indexinside the[crstch]block. - Run the task manually with
plombery run crswtch_pipelineor let the scheduled trigger (06:00 Asia/Dubai) execute it daily.- The pipeline includes an export step that writes
crswth_listings.jsonand uploads it to Cloudflare R2 atdata/crswth_listings.jsonwithin the configured bucket. - The scraper attempts to capture posted/published timestamps (created/published/posted/added/discountAppliedAt) when present, normalizes them to
*_isofields, and includes them in ES and in the JSON export. - For efficiency, listings that already exist in ES (matched by document
_id==id) are skipped to avoid re-fetching the detail page.
- The pipeline includes an export step that writes
- The residential scraper lives in
src/plombery/allsopp_crs.py. It now harvests both sales and lettings inventories, enriches each fresh record with detail-page data, indexes them into Elasticsearch, and ships a combined JSON snapshot to Cloudflare R2. - Configure the behaviour via the
[allsopp]block insrc/plombery/config/config.ini. Uselisting_url/pages/es_indexfor sales andlettings_listing_url/lettings_pages/lettings_es_indexfor rentals; the delay/retry parameters are shared across both modes. - Each mode short-circuits pagination when it encounters an ID already present in the respective Elasticsearch index, avoiding redundant detail fetches when older properties resurface.
- Raw CSV dumps are kept under
saved_data/allsopp/<segment>/page_<n>.csv, while the mergedallsopp_listings.json(with alisting_categoryflag ofsalesorlettings) is written locally and uploaded todata/allsopp_listings.jsonin Cloudflare R2 using the[cloudflare]PROP_BUCKET(falling back toBUCKET) credentials. - Run the full pipeline with
plombery run allsopp_pipelineor rely on the scheduled trigger at 05:30 Asia/Dubai.
- The Dubai Land Department open-data scraper lives in
src/plombery/dld_open_data.py. It now scrapes the public “Real Estate Data” webpage (Next.js) instead of the legacy CKAN API and still supports the Transactions, Rents, Projects, Valuations, Land, Building, Unit, Broker, and Developer tabs. - Configure tab slugs, primary date columns, and Elasticsearch indices inside the
[dld_open_data]section ofsrc/plombery/config/config.ini. The defaultpage_urlpoints athttps://dubailand.gov.ae/en/open-data/real-estate-data/, whilelookback_daysand optional per-dataset*_buffer_dayscontrol incremental windows when deriving theFromDatefilters. - The scraper persists the most recent date per dataset in
saved_data/dld_open_data/state.json, subtracting a small buffer (default three days) on every run to guard against late-arriving records. Artefacts are stored undersaved_data/dld_open_data/<dataset>/, and records are indexed with_dataset,_source_url, and_extracted_at_isometadata for downstream consumers. - Run
plombery run dld_open_data_pipelineto ingest immediately or rely on the built-in trigger (06:00 Asia/Dubai). The scraper now automatically retries when the website serves a temporary reCAPTCHA challenge ("I'm not a robot"), backing off between attempts before ultimately raising aRecaptchaBlockedErrorif the block persists. - To minimise the chance of challenges in the first place, the HTTP client now relies on
curl_cffiimpersonation profiles with HTTP/2 enabled and realisticsec-ch-*headers/user agents. This "ultra-modern" fingerprint keeps the pipeline aligned with how a real Chrome 124 browser negotiates TLS.
- Run
python -m pytest tests/test_crswtch_parser.pyandpython -m pytest tests/test_allsopp_parser.py(orpython -m unittest) to validate the vehicle and property parsers against the embedded fixtures undertests/fixtures. - UI helpers leveraged by the Streamlit dashboards are covered in
tests/test_streamlit_ui.py; runpython -m pytest tests/test_streamlit_ui.pyto confirm filtering logic stays intact. - Helper utilities for the DLD scraper are covered in
tests/test_dld_open_data.py.
- The pipeline writes a
jobs.jsonsnapshot to Cloudflare R2 using the bucket/key defined inJOBS_BUCKETandJOBS_EXPORT_KEY. - The default public URL is
https://6d9a56e137a3328cc52e48656dd30d91.r2.cloudflarestorage.com/me-data-jobs/jobs.json. - Update
JOBS_CACHE_CONTROLif you need different CDN caching behaviour.
Contributions to Jobs Analyzer are welcome! Whether it's adding new features, improving the data analysis, or suggesting UI enhancements for the dashboard, feel free to fork this repository and submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.