-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation #85
Comments
Since I have a long weekend and some beers ahead of me, I am going to take this on. If you would like some design input, please speak up ASAP. The result should be easier to work with, but without creating breaking changes. 🍻 |
I am working on the branch |
This is awesome, Paul! Much needed improvement. When I was doing Indeed testing, I had similar thoughts about our current model in regards to, how you say, misused abstractions+ monolithic model. What I sort of was working on which is similar is here--> thebigG@dc6b317 Not sure if it will help because your ABC implementation looks well thought out and polished. I'm really excited to see this change being merged into Really excited for these changes 🚀 |
I'm curious about something; you want to have a class per scraper-per-country? Meaning if Scraper X supports countries A and B, with this new model, we would have Hope this question makes sense. Cheers! |
Yeah, defo check out the code as it is rn in the scrapers dir for details, but ive ended up with a BaseIndeedScraper(Scraper), with all the logic and then to define the headers and locales i have very simple inherited classes for canada and american indeed, though in this case i think they are the same. Defo recommend holding off on making changes for a bit, if you have some time we can work together later to re-up any missing coverage, it should be easier to test alot of this. |
Yeah, definitely! I have a better grip of my schedule now so I should be available throughout the week. I'll be merging the |
OK, i'm just about happy with it, leaving FIXMEs as I go. I've got all the functionality going for indeed right now. I've built it so that there are no changes to the header of CSV's so as not to suprise people upgrading, we can always do a major upgrade but we ought to provide an upgrade script. Unfortunately it's a pretty hefty change rn, will make diffs complex for any open branches. |
I saw that you removed |
Bringing monster up to speed, opted to keep existing CLI args and YAML for now. In the future we will want to have our schema validation using the localization somehow. I want to have settings be such that you specify a country and language somehow so that we choose the scraper (if it exists) vs specifying the scraper directly with a convoluted name. Might require some registries or something. Still work to do! I will definitely appreciate help getting testing back up to speed once the PR is in a better state. |
I'll be changing the testing suite little by little so that, hopefully, when the |
👌 I'm taking a little longer than I wanted, still redesigning a bit, want to get the right level of automation and flexibility. Enjoying working on jobfunnel right now though, this is a good challenge. |
ok. since I need to add |
Ok, I am now happy with the BaseScraper and IndeedScraper.X design. I recommend checking those out for a clearer idea of how we can test with them, it's pretty flexible but also it makes writing the scrapers pretty quick with loads of validation. big TODO:s rn are to get Glassdoor and Monster going, and to update all the documentation. Open to feedback ofc. it'll be a new major revision update because I have redesigned the YAML and CLI to better reflect the localization that we now support. |
Awesome. Will be updating the testing suite ASAP. |
Looking through the code, there is a lot going on; a lot has changed. So it'll definitely take me sometime to catch up 😅. I suspect that testing is going to change drastically with all these changes, so when I'm all caught up, I'll be updating the testing suite to adhere to the new OOP model. It's looking very good by the way! Really like the encapsulation implemented; it'll definitely allow us to extend and maintain JobFunnel with far more ease going forward. |
Yeah its a big one - really I should have separated it out a bit more but i
didn't wanna compromise.
Would super appreciate testing help, indeed and monster are now in a good
state - take your time though. Ill fix the build and implement some basic
testing before i open the PR.
This will give us internationalization finally too 👍
|
Status: |
Making progress, just a bit slow because I need to fix the Static glassdoor scraping, which seems to yield a high number of duplicates, still investigating, but I added Wage scraping while I was working on it. Getting faster at using chrome's inspect elements with beautiful soup 4 to make short work of issues. Also added a remote field since with COVID remote work is becoming a new norm and we should support scraping it. |
The new code base is making a lot of sense. I will very soon start testing the new parser, which is on |
RE: dynamic and static glassdoor implementation, it makes more sense for us to use the webdriver only when we either get CAPTCHA'd, to bypass the captcha, or to deal with issue #87 . Due to that issue I've also disabled Glassdoor for now. I've also added |
Glad to hear it! I think this will make it alot easier to maintain and we can see about getting more internationalized scrapers once this goes in. FYI I Just pushed some further fixes for |
ending up spending some time with TFIDF vectorizer again, need to streamline filtering in general. I'm also improving the error handling in this method as well. Ideally we want to be able to prempt individual jobs before we are getting/setting any delayed stuff. This is accomplishable but warrants a class-based TFIDF tool. I also went through my todo list and captured everything, this is getting really big but I am aiming to open a PR by the end of this weekend, as most of the heavy lifting is now done. Review the TODO for an idea of what will still be changing. |
Big update to filtering, now we emit lists of DuplicateJob which allows us to handle updates to our master CSV when we see new duplicates. |
Reviewed the current open issues, this also addresses / makes possible to fix: |
Identified a major bug which is that we need to manage getting and setting job fields which have dependencies. i.e. worker 1 needs Im going to try and resolve this using Ideally we would have clear dependencies configured so we could calculate a dependency map and work from leaves-first, but I think await will suffice. |
In resolving this synchronisation issue, I will also need to perform on-the-fly delaying, and a queue for get requests so that they can be handled in one place. This will have the advantage of not requiring the scraper to define what get/set operations require delaying and should let us run at maximum speed (i.e. delays are exactly the minimum). The remaining issue is to somehow share Job amongst the workers so that when we are setting a job field with another job field we aren't needing to re-calculate the pre-requisite - so far the only solution I can think of is some kind of queue or event with callbacks for emitting worker results, or a job-wise cache with access guarded by mutex. |
I have already started testing. Like I said before starting to test Love the new JobFunnel architecture by the way, great job 🚀 |
@thebigG FYI, I've just fixed a bug with the CLI parser where YAML paths were not being respected. |
Almost there, just need to do some testing to discover all the bugs I don't know about yet. Feel free to try it out anyone, if you find a bug post about it here. |
fixed another bug I found where non-existant config keys were being accessed, pushed already @thebigG Starting to wrap up the work now, but I need to add a Session class that will handle respectul delaying so that workers cannot perform staggered delays. |
OK looks like there is still CLI parsing bugs... going to need to write some actual unit tests to fix this properly. Will spend some time on this in the coming week so we can prepare to merge. Interested in getting this out within next 2 weeks so that we can fix broken scrapers on current release. |
Description
Currently we are using the
JobFunnel
class for to much, I want to break it down into the following:Note: if I get to it, I'd also like our filters to be an ABC.
Steps to Reproduce
This is a structural technical debt issue. (n/a)
Expected behavior
Abstract base class should not be halfway abstract, Need seperation between JobFunnel and main() and inherited scrapers.
Actual behavior
JobFunnel being monolithic and half-abstract has allowed us to implement three script-like scrapers which share too many methods, without an actual Job object.
Environment
n/a
Current Status:
Job Object
Support for Internationalization
BaseScraper with get/set scraping logic
New YAML and CLI implemented
Schema Validation with Cerberus
Caching
Filtering with lists
Indeed
Monster
GlassDoorStatic (works but seems like it has bugs so fixing this).
Wage Scraping
GlassDoor Dynamic/Driven
Duplicates list file support
Integrate TFIDF similarity filter (special case filter)
Prevent writing out empty CSVs in --no-scrape mode
Prevent delayed get/set for jobs which fail filters
Fix multi-page Monster scraping
handle duplicated jobs special case
Make JobFilter class
Add TAG scraping to Monster
Implement job filtering as own class
Fix paths from -s yaml being overwritten with defaults with CLI
Fix concurrency issue with dependencies for get/set
Monkey / general usability testing
Update main README
Update other READMEs + tutorials
Add versioning to cache files (i.e wrapper for dict with metadata)
Review various FIXMES in-code
Fix build (Travis CI)
Test setup.py
Fix demo GIF
Document how to write new scrapers with localization
Future work:
The text was updated successfully, but these errors were encountered: