- edit caching with decorator pattern - add all google search params to config - write functional tests - add sqlalchemy support for results - add better proxy handling - extend parsing functionality - update readme - prevent parsing config two times 04.11.2014: - Refactor code, change docstrings to google format https://google-styleguide.googlecode.com/svn/trunk/pyguide.html#Commentss [done] 15.11.2014: - add shell access with sqlalchemy session [done] - test selenium mode thoroughly [done] - double check selectors - add alternative selectors - Add gevent support - make all modes workable through proxies [done for http and sel] - update README [done] - write blog post that illustrates usage of GoogleScraper - some testing - release version 0.2.0 on the cheeseshop - released version 0.1.5 on pypy [done] 11.12.2014 - JSON output is still slightly corrupt - CSV output probably also not ideal. - Improve documentation after Google style guide - Maybe add other search engines! - finally implement async mode!!! 30.12.2014: - Fixed issue #45 [done] 02.01.2015: - Check output verbosity levels and modify them. [done] 13.01.2015: - Handle sigint. Then close all open files (csv, json). 15.01.2015: - Implement JSON static tests [done] - Implement CSV static tests [done] - Catch Resource warnings in testing [done] - Add no_results_selectors for all SE [done] - add test for no_results_selectors parsing [done] - Add page number selectors for all SE [done] - add static tests [done] - add fabfile (google a basic template) for [] - adding & committing and uploading to master [] - push to the cheeseshop [] - add function in fabfile that pushes to cheeseshop only after all tests were successful [] - Add functionality that distinguishes the page number of serp pages when caching [] - implement async mode [done] - reade 20 minutes about asyncio built in moduel and decide whether if feets my needs [done] 18.01.2015 - add four different examples: - a basic usage [done] - using selenium mode [done] - using http mode [done] - using async mode [done] - scraping with a keywords.py module - scraping images [done] - finding plagiarized content [done] - Add dynamic tests for selenium mode: - Add event: No results for this query. - Test Impossible query: -> Cannot have next_results page -> No results [done] -> But still save serp page. [done] -> add to missed keywords [] - What is the best way to detect that the page loaded???? -> Research, read about selenium - Add test for duckduckgo - Fix: If there's no internet connection, Malicious request detected is show. Show no internet connection instead. - FIGURE OUT: WHY THE HELLO DOES DUCKDUCKGO NOT WORK IN PHANTOMJS? 05.10.2015 - Switch configuration from INI format to plain python code [Done] - recode parse logic for configuration [Done] Command Line Settings > Command Line Configuration File > Builtin Configuration File - rebuild logging system. Create a dedicate logger for each submodule. [Done] Set the loglevel for each logger to the value which was specified in the configuration [Done] => Logging only reports events. Results are printed according to a dedicate option in the config file. - write tests for all search engines and for all major modes in the source directory. Enable Flag which runs the tests automatically. Differ between long tests and short ones. - Look at some big open source python projects where tests are stored (pelican, requests) 30.11.2015 - Find good resources about to learn how to test code correctly [DONE: 12min], found the following links: - http://docs.python-guide.org/en/latest/writing/tests/ ==> LEARNED: - put test suites that require some complex data structures to load (such as websites to scrape) in separate test suites - run all (fast) tests before committing code - run all (including slow ones) before pushing code to master - use tox for testing the code with multiple interpreter configurations - mock allows to monkey patch functionality in the code such that it returns whatever you want - http://codeutopia.net/blog/2015/04/11/what-are-unit-testing-integration-testing-and-functional-testing/ ==> LEARNED: - unittets don't make use of external resources such as databases or network - code that is hard to unit test is often poorly designed - integration test: tests how parts of the system work together - functional tests: test the complete functionality of the system - only a small amount of functional tests are required: They make sure the app works as a whole. - "testing common user interactions" - functional tests are validated in the same way as a user who uses the tool. - unit/integration tests are validated with code - don't make them too fine grained! - https://code.google.com/p/robotframework/wiki/HowToWriteGoodTestCases ==> LEARNED: - never sleep in the code: safety margins take too long in your code (use polls instead) - http://blog.agilistic.nl/how-writing-unit-tests-force-you-to-write-good-code-and-6-bad-arguments-why-you-shouldnt/ ==> LEARNED: - Classes should be loosly coupled - avoid cascade of changes when changing one class - maximize encapsulation in classes - classes should have one responsibility - avoid large and tightly coupled classes - unit test should test the function/class without any dependencies - unit test tests one thing - avoid like the PEST: tightly coupled functions/classes, difficult to understand classes/functions, functions that do many things, not intuitive classes/functions (bad interface) - http://www.toptal.com/python/an-introduction-to-mocking-in-python ==> LEARNED: - instead of testing a functions effects, we can mock the underlying operating system api by ensuring that a os function was called with certain parameters. This enables us to verify that os code was called with the correct parameters. - http://pytest.org/ ==> LEARNED: - How pytest can be invoked: http://pytest.org/latest/usage.html - pytest can yield more information in the traceback with the -l option - pytest can be called within python: http://pytest.org/latest/usage.html - how the directory structure for tests should look like: http://pytest.org/latest/goodpractises.html - Read and understand the test links collected in the previous task. [Done: 75min + 25min] - Add hook to run unit tests before committing code [Done: 9 min]: Found pre-commit hook that checks pep8 stuff and that runs unit tests here: https://gist.githubusercontent.com/snim2/6444684/raw/c7f1ec75c3cc0306bd8f36faee7dd201902528e8/pre-commit.py --- 12 + 100 + 9 + 5 = 126min --- 1.12.2015 - Read that again: http://pytest.org/latest/example/parametrize.html [Done: 9min], not learned anything really. Is about meta programming in test suites I guess. - Create virtualenv in Project directory. [Done: 5min] - Add hook that runs all tests before pushing to master [Done: 11min], Hook is a pre-commit hook and will execute all tests found in the directory tests/ - See whether existing test suites do work and fix all issues there. [Started: 122min], integration tests do work. Functional tests fail, because there is a issue in GoogleScraper. Update: Both integration and functional tests do work. --- 9 + 5 + 11 + 122 = 147min --- 2.12 - Find out why the test test_google_with_phantomjs_and_json_output fails. Why is it not possible to scrape 3 pages with Google in selenium mode? [Done: 42min]: Because the next page element cannot be located in phantomjs mode for some reason. - Why cannt phantomjs locate the next page? [Done: 46min]: - Check version of phantomjs: 1.9.0 is my version - Newest version of phantomjs: 2.0, but it is too hard to install/compile - Reason that search is interrupted: Exception is thrown in line - Read about worker and job patterns (consumer-producer patterns) in python. Learn about queues patterns. Read the following ressources: - http://www.bogotobogo.com/python/Multithread/python_multithreading_Synchronization_Producer_Consumer_using_Queue.php - https://pymotw.com/2/Queue/ - http://www.informit.com/articles/article.aspx?p=1850445&seqNum=8 - http://codefudge.com/2015/09/scraping-alchemist-celery-selenium-phantomjs-and-tor.html - read about casperJS and evaluate whether it might be interesting for GoogleScraper 3.12 - Make functional tests work again [Done: 120min] -- Fix bug in `GoogleScraper -q 'apples' -s google -m selenium --sel-browser phantomjs -p 10` 7.12 - test that serp rank is cumulative among pages [Done: 10min] Rank testing doesn't make any sense. Reasons: - ranks start again in different type of serp results (ads vs normal) - results aren't ordered by rank in json or csv/output - ranks doesn't need to be cumulative, since their absolute rank can be recalculated by multiplying with the page number. - fix functional test issues of `test_all_search_engines_in_http_mode`