Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GlassDoor support (fix and re-enable) #87

Open
PaulMcInnis opened this issue Aug 21, 2020 · 9 comments
Open

GlassDoor support (fix and re-enable) #87

PaulMcInnis opened this issue Aug 21, 2020 · 9 comments

Comments

@PaulMcInnis
Copy link
Owner

Issue

Description

Currently we get the second page of glassdoor via the URL of the 2 button, but this no longer works as it redirects you to the first page. This is the case wether we use the webdriver or not.

Steps to Reproduce

  1. navigate to https://www.glassdoor.ca/Job/waterloo-python-jobs-SRCH_IL.0,8_IC2280158_KO9,15.htm?radius=12&p=2

Expected behavior

We get to the second page of jobs

Actual behavior

We are redirected to the first page during the GET, which leads to every single page of jobs being a duplicate of the first page, with loads of TFIDF duplicate detection hits.

If you click the 2 button yourself, you will get toast RE: subscribing to email notifications, which will then navigate you to the second page.

Environment

@PaulMcInnis
Copy link
Owner Author

If anyone has knowledge of react or javascript, I would super appreciate the help!

@Zenahr
Copy link
Contributor

Zenahr commented Aug 28, 2020

I made an isolated script that does just that.
You can find it here: https://github.com/Zenahr/selenium-glassdoor-page-jumper
Feel free to use it. (I made it specifically for JobFunnel and this issue but I haven't made a PR yet since I ran into a lot of import issues)

@PaulMcInnis
Copy link
Owner Author

This is great! Thanks to your effort I can add back the Glassdoor scraper.

I will refer the driver logic to this issue and to your user name if you like. Otherwise you are welcome to contribute on the eventual PR.

@PaulMcInnis
Copy link
Owner Author

PaulMcInnis commented Sep 13, 2020

Leaving a note to myself here that we can use TravisCI to run seleneium if we follow steps here: https://docs.travis-ci.com/user/gui-and-headless-browsers/

They have an API for this.

@PaulMcInnis PaulMcInnis removed their assignment Sep 30, 2020
@PaulMcInnis
Copy link
Owner Author

Unassigning myself for now just because I want to see about if there is a way to avoid use of a web-driver. I dislike the latency introduced by this approach, but I do recognise that there may not be another way.

In the near future I'm going to focus more on squashing bugs with the current engine around status updates and duplicates.

@PaulMcInnis PaulMcInnis modified the milestones: 3.0.1, 3.0.2 Oct 11, 2020
@PaulMcInnis PaulMcInnis changed the title Unable to get second page of job results from GlassDoor GlassDoor support (fix and re-enable) Nov 25, 2021
@PaulMcInnis PaulMcInnis modified the milestones: 3.0.3, 4.0 Nov 25, 2021
@datatalking
Copy link

@PaulMcInnis is this issue needing help that is on the backend integrating the script that @Zenahr created, sidestepping the glass door captcha? I'm only recently finding this repo and still learning to use it, but can help contribute.

@sammytheindi
Copy link
Collaborator

So the glassdoor scraper seems completely broken at the moment. I am going to use this issue just to share my initial thoughts on it. I think @PaulMcInnis and other contributors have already figured a lot of this out, so this may just be more like documentation. Glassdoor seem to have a more obfuscated method of requesting jobs as compared to Indeed:

Here is an example request of software engineering jobs in London

"https://www.glassdoor.ca/Job/london-england-uk-software-engineer-jobs-SRCH_IL.0,17_IC2671300_KO18,35.htm"

  • C in IC is location type
  • The final number is the length of the input. So for example, 35 is london-england-uk-software-engineer
  • The penultimate number 18 is the length until the end of the location input. So for example, 18 is london-england-uk-
  • The number following IC/IS is the location code. We can get this from the location search
  • The after SRCH_IL.0 does not seem to matter. I am going to stick with 17 for now

The location code is currently the part I am trying to work around. It is obtained through a call to

"https://www.glassdoor.ca/autocomplete/location?locationTypeFilters=CITY,STATE,COUNTRY&caller=jobs&term=San%20Francisco,%20CA"

This will give a response with the codes, something akin to

[
    {
        "id": 57297,
        "label": "London",
        "locationId": 7297,
        "metroId": -1,
        "stateId": 7297,
        "countryId": 2,
        "locationType": "S",
        "locationName": "London",
        "longName": "London, UK",
        "cityName": null,
        "stateName": "London",
        "countryName": "United Kingdom",
        "stateAbbreviation": "H",
        "country2LetterIso": "GB"
    },
    {
        "id": 57412,
        "label": "Greater London",
        "locationId": 7412,
        "metroId": -1,
        "stateId": 7412,
        "countryId": 2,
        "locationType": "S",
        "locationName": "Greater London",
        "longName": "Greater London, UK",
        "cityName": null,
        "stateName": "Greater London",
        "countryName": "United Kingdom",
        "stateAbbreviation": "35",
        "country2LetterIso": "GB"
    },
]

We can just take the id of the first one. This page is unfortunately protected by Cloudflare. I checked the mobile application, and it requires a login to get data, unlike Indeed. This is not ideal, as then everyone would need to put in their own credentials. I tried cloudscraper and it works, sometimes, but not guaranteed. Any thoughts?

Once we have this issue solved, it seems like we can make calls directly to https://www.glassdoor.ca/graph and obtain results as JSON.

@PaulMcInnis
Copy link
Owner Author

PaulMcInnis commented Sep 13, 2024

maybe an auth token might honestly be the way though this - but I'd also like to point out that they actually have an API https://www.glassdoor.ca/developer/jobsApiActions.htm (not clear if this can be used here or not)

I think with glassdoor using real credentials and just making it a fairly slow scrape (potentially with some user-interaction at the start), but a working one was better than not having it at all.

@sammytheindi
Copy link
Collaborator

API looks interesting, but I'm not sure it will work for our use case. The parameters show that every Glassdoor API call requires a partner key and id, so you would likely need to be an API partner. Would be happy to hear otherwise if anyone has any experience/further knowledge on this. The documentation seems very sparse.

Screenshot 2024-09-13 at 12 29 59 PM

You may be right regarding using real credentials. I will keep that approach as a backup strategy, for now I have a few ideas I am experimenting with, will share results here when I get them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants