misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation #85

PaulMcInnis · 2020-08-01T20:20:25Z

Description

Currently we are using the JobFunnel class for to much, I want to break it down into the following:

Job(object):
    def __init__(self, title: str, company: str, location: str, tags: List[str], post_date: datetime.date, key_id: str, url: str) -> None:
        ...

Scraper(ABC):

    @abstractmethod
    def scrape(self) -> List[Job]:
        pass
    

main():
    
    # instantiate scrapers

    # run filter on list of Job

    # dump pickle

    # writeout CSV

Note: if I get to it, I'd also like our filters to be an ABC.

Steps to Reproduce

This is a structural technical debt issue. (n/a)

Expected behavior

Abstract base class should not be halfway abstract, Need seperation between JobFunnel and main() and inherited scrapers.

Actual behavior

JobFunnel being monolithic and half-abstract has allowed us to implement three script-like scrapers which share too many methods, without an actual Job object.

Environment

n/a

Current Status:

Future work:

Google jobs scraper
Ycombinator job scraper
Assess the update experience from V2.0 --> V3.0, provide a guide
cut a release
Add WAGE scraping to Indeed
Add REMOTE scraping to Indeed
Add REMOTE scraping to Monster

The text was updated successfully, but these errors were encountered:

PaulMcInnis · 2020-08-01T20:21:37Z

Since I have a long weekend and some beers ahead of me, I am going to take this on.

If you would like some design input, please speak up ASAP.

The result should be easier to work with, but without creating breaking changes.

🍻

PaulMcInnis · 2020-08-01T23:14:39Z

I am working on the branch ABCJobFunnel

thebigG · 2020-08-01T23:56:54Z

This is awesome, Paul! Much needed improvement. When I was doing Indeed testing, I had similar thoughts about our current model in regards to, how you say, misused abstractions+ monolithic model. What I sort of was working on which is similar is here--> thebigG@dc6b317

Not sure if it will help because your ABC implementation looks well thought out and polished.

I'm really excited to see this change being merged into master! I don't know how long this implementation will take to finish but I'll hold off testing, at least for the scrapers, until these architecture changes come through so we don't end up having tests we'll have to refactor entirely within the month 😅 .

Really excited for these changes 🚀

thebigG · 2020-08-02T00:24:00Z

I'm curious about something; you want to have a class per scraper-per-country? Meaning if Scraper X supports countries A and B, with this new model, we would have class BaseXScraper, class XScraperA and class XScraperB? Am I understanding the thinking with this model correctly? I ask because, at least to me, this model makes a lot of sense. I like it a lot because it really just handles Internationalization, at least at the ABC abstract level, very well. And this might be the way to go to handle internalization because it's hard to assume(just like we've talked about in the past regarding this issue) that job sites will play nicely across different countries when scraping.

Hope this question makes sense.

Cheers!

PaulMcInnis · 2020-08-02T00:46:55Z

Yeah, defo check out the code as it is rn in the scrapers dir for details, but ive ended up with a BaseIndeedScraper(Scraper), with all the logic and then to define the headers and locales i have very simple inherited classes for canada and american indeed, though in this case i think they are the same.

Defo recommend holding off on making changes for a bit, if you have some time we can work together later to re-up any missing coverage, it should be easier to test alot of this.

thebigG · 2020-08-02T01:01:12Z

Yeah, definitely! I have a better grip of my schedule now so I should be available throughout the week. I'll be merging the ABC branch with my testing branch(https://github.com/thebigG/JobFunnel/tree/testing) and update tests accordingly with your changes 👍 .

PaulMcInnis · 2020-08-03T23:38:58Z

OK, i'm just about happy with it, leaving FIXMEs as I go. I've got all the functionality going for indeed right now.

I've built it so that there are no changes to the header of CSV's so as not to suprise people upgrading, we can always do a major upgrade but we ought to provide an upgrade script.

Unfortunately it's a pretty hefty change rn, will make diffs complex for any open branches.

thebigG · 2020-08-05T01:23:00Z

I saw that you removed config_factory and change_nested_dict which makes a lot of sense because those functions are only used by testing suite. If you don't mind, I'll be placing those functions insise conftest.py so the testing suite can use since all of our testing depend on it. This is all assuming that the way(for the most part) the command line arguments are parsed won't change, aside from some code re-organization(?).

PaulMcInnis · 2020-08-05T01:25:19Z

Bringing monster up to speed, opted to keep existing CLI args and YAML for now.

In the future we will want to have our schema validation using the localization somehow.

I want to have settings be such that you specify a country and language somehow so that we choose the scraper (if it exists) vs specifying the scraper directly with a convoluted name. Might require some registries or something.

Still work to do! I will definitely appreciate help getting testing back up to speed once the PR is in a better state.

thebigG · 2020-08-05T01:44:06Z

I'll be changing the testing suite little by little so that, hopefully, when the ABC re-modeling is done, the testing for the new code base is up to speed with the changes.

PaulMcInnis · 2020-08-05T22:03:26Z

👌 I'm taking a little longer than I wanted, still redesigning a bit, want to get the right level of automation and flexibility.

Enjoying working on jobfunnel right now though, this is a good challenge.

PaulMcInnis · 2020-08-06T23:31:22Z

ok. since I need to add locale to the settings YAML i'm going to also make some improvements to how we handle defaults, and validate.

PaulMcInnis · 2020-08-10T13:12:19Z

Ok, I am now happy with the BaseScraper and IndeedScraper.X design.

I recommend checking those out for a clearer idea of how we can test with them, it's pretty flexible but also it makes writing the scrapers pretty quick with loads of validation.

big TODO:s rn are to get Glassdoor and Monster going, and to update all the documentation.

Open to feedback ofc.

it'll be a new major revision update because I have redesigned the YAML and CLI to better reflect the localization that we now support.

thebigG · 2020-08-11T01:23:39Z

Awesome. Will be updating the testing suite ASAP.

thebigG · 2020-08-11T02:15:14Z

Looking through the code, there is a lot going on; a lot has changed. So it'll definitely take me sometime to catch up 😅. I suspect that testing is going to change drastically with all these changes, so when I'm all caught up, I'll be updating the testing suite to adhere to the new OOP model. It's looking very good by the way! Really like the encapsulation implemented; it'll definitely allow us to extend and maintain JobFunnel with far more ease going forward.

PaulMcInnis · 2020-08-11T02:39:58Z

Yeah its a big one - really I should have separated it out a bit more but i didn't wanna compromise. Would super appreciate testing help, indeed and monster are now in a good state - take your time though. Ill fix the build and implement some basic testing before i open the PR. This will give us internationalization finally too 👍

PaulMcInnis · 2020-08-13T01:50:06Z

Status:
(moved checklist to main desc for visibility)

PaulMcInnis · 2020-08-20T01:45:13Z

Making progress, just a bit slow because I need to fix the Static glassdoor scraping, which seems to yield a high number of duplicates, still investigating, but I added Wage scraping while I was working on it.

Getting faster at using chrome's inspect elements with beautiful soup 4 to make short work of issues.

Also added a remote field since with COVID remote work is becoming a new norm and we should support scraping it.

thebigG · 2020-08-21T01:21:14Z

The new code base is making a lot of sense. I will very soon start testing the new parser, which is on cli.py. Yeah, the remote field makes a lot of sense.

PaulMcInnis · 2020-08-21T01:23:46Z

RE: dynamic and static glassdoor implementation, it makes more sense for us to use the webdriver only when we either get CAPTCHA'd, to bypass the captcha, or to deal with issue #87 . Due to that issue I've also disabled Glassdoor for now.

I've also added REMOTE and WAGE jobfelds which as this information is available on indeed and monster.

PaulMcInnis · 2020-08-21T01:24:32Z

@thebigG

The new code base is making a lot of sense. I will very soon start testing the new parser, which is on cli.py. Yeah, the remote field makes a lot of sense.

Glad to hear it! I think this will make it alot easier to maintain and we can see about getting more internationalized scrapers once this goes in.

FYI I Just pushed some further fixes for cli.py which further improve the schema defaults and cli interaction

PaulMcInnis · 2020-08-22T02:55:28Z

ending up spending some time with TFIDF vectorizer again, need to streamline filtering in general. I'm also improving the error handling in this method as well.

Ideally we want to be able to prempt individual jobs before we are getting/setting any delayed stuff. This is accomplishable but warrants a class-based TFIDF tool.

I also went through my todo list and captured everything, this is getting really big but I am aiming to open a PR by the end of this weekend, as most of the heavy lifting is now done.

Review the TODO for an idea of what will still be changing.

PaulMcInnis · 2020-08-24T12:45:26Z

Big update to filtering, now we emit lists of DuplicateJob which allows us to handle updates to our master CSV when we see new duplicates.

PaulMcInnis · 2020-08-24T22:39:36Z

Reviewed the current open issues, this also addresses / makes possible to fix:
#83 - added preemption for job get/set when an attribute fails a check.
#12 - added warnings and handling for empty search results
#60, #45, #37 - added support for localised scrapers with custom URLs and customisation of scraping without needing to copy/re-write lots of code (via inheritance)

PaulMcInnis · 2020-08-25T14:02:37Z

Identified a major bug which is that we need to manage getting and setting job fields which have dependencies.

i.e. worker 1 needs Job.KEY_ID to be populated so it can add a Job.DESCRIPTION, but worker 2 is getting or setting other things.

Im going to try and resolve this using asyncio so that workers can perform non-blocking waits for their needed fields to be populated.

Ideally we would have clear dependencies configured so we could calculate a dependency map and work from leaves-first, but I think await will suffice.

PaulMcInnis · 2020-08-27T13:53:14Z

In resolving this synchronisation issue, I will also need to perform on-the-fly delaying, and a queue for get requests so that they can be handled in one place. This will have the advantage of not requiring the scraper to define what get/set operations require delaying and should let us run at maximum speed (i.e. delays are exactly the minimum).

The remaining issue is to somehow share Job amongst the workers so that when we are setting a job field with another job field we aren't needing to re-calculate the pre-requisite - so far the only solution I can think of is some kind of queue or event with callbacks for emitting worker results, or a job-wise cache with access guarded by mutex.

thebigG · 2020-08-28T00:59:32Z

I have already started testing. Like I said before starting to test cli.py and will take it from there.

Love the new JobFunnel architecture by the way, great job 🚀

PaulMcInnis · 2020-08-29T01:37:45Z

@thebigG FYI, I've just fixed a bug with the CLI parser where YAML paths were not being respected.

PaulMcInnis · 2020-08-29T03:20:22Z

Almost there, just need to do some testing to discover all the bugs I don't know about yet.

Feel free to try it out anyone, if you find a bug post about it here.

PaulMcInnis · 2020-08-29T16:16:04Z

fixed another bug I found where non-existant config keys were being accessed, pushed already @thebigG

Starting to wrap up the work now, but I need to add a Session class that will handle respectul delaying so that workers cannot perform staggered delays.

PaulMcInnis · 2020-09-05T16:29:25Z

OK looks like there is still CLI parsing bugs... going to need to write some actual unit tests to fix this properly. Will spend some time on this in the coming week so we can prepare to merge.

Interested in getting this out within next 2 weeks so that we can fix broken scrapers on current release.

PaulMcInnis added the enhancement label Aug 1, 2020

PaulMcInnis self-assigned this Aug 1, 2020

PaulMcInnis added this to the 2.0.0 milestone Aug 7, 2020

PaulMcInnis changed the title ~~misuse of abstract base classes + monolithic JobFunnel class~~ misuse of abstract base classes + monolithic JobFunnel class + improved schema validation + localization Aug 7, 2020

PaulMcInnis changed the title ~~misuse of abstract base classes + monolithic JobFunnel class + improved schema validation + localization~~ misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation Aug 7, 2020

thebigG mentioned this issue Aug 18, 2020

[Feature Request] Export as JSON #86

Closed

PaulMcInnis mentioned this issue Aug 21, 2020

GlassDoor support (fix and re-enable) #87

Open

PaulMcInnis modified the milestone: 2.0.0 Aug 21, 2020

PaulMcInnis mentioned this issue Aug 29, 2020

JobFunnel 3.0 with localization, ABC and improved scraping #90

Merged

17 tasks

PaulMcInnis linked a pull request Aug 29, 2020 that will close this issue

JobFunnel 3.0 with localization, ABC and improved scraping #90

Merged

17 tasks

PaulMcInnis closed this as completed in #90 Sep 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation #85

misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation #85

PaulMcInnis commented Aug 1, 2020 •

edited

Loading

PaulMcInnis commented Aug 1, 2020

PaulMcInnis commented Aug 1, 2020

thebigG commented Aug 1, 2020

thebigG commented Aug 2, 2020 •

edited

Loading

PaulMcInnis commented Aug 2, 2020 •

edited

Loading

thebigG commented Aug 2, 2020

PaulMcInnis commented Aug 3, 2020 •

edited

Loading

thebigG commented Aug 5, 2020

PaulMcInnis commented Aug 5, 2020 •

edited

Loading

thebigG commented Aug 5, 2020

PaulMcInnis commented Aug 5, 2020

PaulMcInnis commented Aug 6, 2020

PaulMcInnis commented Aug 10, 2020 •

edited

Loading

thebigG commented Aug 11, 2020

thebigG commented Aug 11, 2020

PaulMcInnis commented Aug 11, 2020 via email •

edited

Loading

PaulMcInnis commented Aug 13, 2020 •

edited

Loading

PaulMcInnis commented Aug 20, 2020 •

edited

Loading

thebigG commented Aug 21, 2020

PaulMcInnis commented Aug 21, 2020

PaulMcInnis commented Aug 21, 2020 •

edited

Loading

PaulMcInnis commented Aug 22, 2020 •

edited

Loading

PaulMcInnis commented Aug 24, 2020

PaulMcInnis commented Aug 24, 2020

PaulMcInnis commented Aug 25, 2020

PaulMcInnis commented Aug 27, 2020

thebigG commented Aug 28, 2020 •

edited

Loading

PaulMcInnis commented Aug 29, 2020

PaulMcInnis commented Aug 29, 2020

PaulMcInnis commented Aug 29, 2020

PaulMcInnis commented Sep 5, 2020

misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation #85

misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation #85

Comments

PaulMcInnis commented Aug 1, 2020 • edited Loading

Description

Steps to Reproduce

Expected behavior

Actual behavior

Environment

Current Status:

Future work:

PaulMcInnis commented Aug 1, 2020

PaulMcInnis commented Aug 1, 2020

thebigG commented Aug 1, 2020

thebigG commented Aug 2, 2020 • edited Loading

PaulMcInnis commented Aug 2, 2020 • edited Loading

thebigG commented Aug 2, 2020

PaulMcInnis commented Aug 3, 2020 • edited Loading

thebigG commented Aug 5, 2020

PaulMcInnis commented Aug 5, 2020 • edited Loading

thebigG commented Aug 5, 2020

PaulMcInnis commented Aug 5, 2020

PaulMcInnis commented Aug 6, 2020

PaulMcInnis commented Aug 10, 2020 • edited Loading

thebigG commented Aug 11, 2020

thebigG commented Aug 11, 2020

PaulMcInnis commented Aug 11, 2020 via email • edited Loading

PaulMcInnis commented Aug 13, 2020 • edited Loading

PaulMcInnis commented Aug 20, 2020 • edited Loading

thebigG commented Aug 21, 2020

PaulMcInnis commented Aug 21, 2020

PaulMcInnis commented Aug 21, 2020 • edited Loading

PaulMcInnis commented Aug 22, 2020 • edited Loading

PaulMcInnis commented Aug 24, 2020

PaulMcInnis commented Aug 24, 2020

PaulMcInnis commented Aug 25, 2020

PaulMcInnis commented Aug 27, 2020

thebigG commented Aug 28, 2020 • edited Loading

PaulMcInnis commented Aug 29, 2020

PaulMcInnis commented Aug 29, 2020

PaulMcInnis commented Aug 29, 2020

PaulMcInnis commented Sep 5, 2020

PaulMcInnis commented Aug 1, 2020 •

edited

Loading

thebigG commented Aug 2, 2020 •

edited

Loading

PaulMcInnis commented Aug 2, 2020 •

edited

Loading

PaulMcInnis commented Aug 3, 2020 •

edited

Loading

PaulMcInnis commented Aug 5, 2020 •

edited

Loading

PaulMcInnis commented Aug 10, 2020 •

edited

Loading

PaulMcInnis commented Aug 11, 2020 via email •

edited

Loading

PaulMcInnis commented Aug 13, 2020 •

edited

Loading

PaulMcInnis commented Aug 20, 2020 •

edited

Loading

PaulMcInnis commented Aug 21, 2020 •

edited

Loading

PaulMcInnis commented Aug 22, 2020 •

edited

Loading

thebigG commented Aug 28, 2020 •

edited

Loading