Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add database #226

Merged
merged 97 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
e1eb1c1
Add subpackage 'database' - create and connect to database and databa…
Crinibus Sep 22, 2023
8042c5b
Add method to class 'Format' in 'format_to_new.py' - migrate from jso…
Crinibus Sep 22, 2023
83dfd00
Replace function 'clean_records_data' with 'clean_datapoints'
Crinibus Sep 22, 2023
7cd5d2c
Ignore lint error E501 for line with api link to Elgiganten
Crinibus Sep 22, 2023
7a95a6f
Create file 'database/functions' - add function 'get_product_by_prod…
Crinibus Sep 22, 2023
22a1fd3
Don't import Product and DataPoint as ...DB in format_to_new.py
Crinibus Sep 28, 2023
f0dd3fe
Rename Product and Datapoint to ProductInfo and DataPointInfo
Crinibus Sep 28, 2023
d253f92
Add category to database model Product
Crinibus Sep 28, 2023
d9d991f
Rename columns from 'productId' to 'product_code' for database Produc…
Crinibus Sep 28, 2023
6960524
Add sqlmodel to requirements.txt
Crinibus Oct 1, 2023
3fb5343
Delete methods in class Filemanager and create class FilemanagerLegac…
Crinibus Oct 1, 2023
6437784
Add short_url column to database Product
Crinibus Oct 1, 2023
ec23bc6
Update method 'from_json_to_db' in class Format (format_to_new.py) to…
Crinibus Oct 1, 2023
543e3a4
Add database functions - 'delete_all', 'get_product_by_product_code',…
Crinibus Oct 1, 2023
b428789
Update add_product.py to use database
Crinibus Oct 1, 2023
dcf61db
Add database functions - 'get_all_products', 'get_all_datapoints', 'g…
Crinibus Oct 1, 2023
be5ddb8
Update delete_data.py to use database
Crinibus Oct 1, 2023
aa2f09e
Add functions 'get_master_products' and 'get_products_from_master_pro…
Crinibus Oct 1, 2023
7840bf0
Move two database functions to the top of functions.py
Crinibus Oct 1, 2023
bf7179e
Reorder import from '.functions' in database/__init__.py
Crinibus Oct 1, 2023
675716d
Delete database functions 'add_product' and 'add_datapoint' and repla…
Crinibus Oct 1, 2023
788cf97
Fix database functions 'get_products_by_product_codes', 'get_products…
Crinibus Oct 1, 2023
4e94f02
Add logging to start of function 'delete'
Crinibus Oct 1, 2023
9843b89
Add database functions 'get_datapoints_by_categories', 'get_datapoint…
Crinibus Oct 1, 2023
51d10a6
Update reset_data.py to use database
Crinibus Oct 1, 2023
598882c
Update README to reflect changes of what reset does - deletes only da…
Crinibus Oct 1, 2023
c5138c1
Add database path to Filemanager and use it to create and connect to …
Crinibus Oct 2, 2023
a2235b0
Update ProductInfo - change field 'is_up_to_date' to a calculated pro…
Crinibus Oct 2, 2023
a635a27
Add database function 'get_all_products_with_datapoints'
Crinibus Oct 2, 2023
364a2cb
Add database function 'get_product_infos_from_products' and update 'g…
Crinibus Oct 2, 2023
a808ee1
Update visualize.py to use database
Crinibus Oct 2, 2023
47a6bbe
Add database function 'get_products_by_names_fuzzy'
Crinibus Oct 2, 2023
1d26869
Add database function 'get_all_unique_categories'
Crinibus Oct 2, 2023
544882e
Move database function 'get_all_unique_categories' further up in the …
Crinibus Oct 2, 2023
c2f1808
Update search_data.py to use database
Crinibus Oct 2, 2023
3706910
Delete import of List from typing - instead use built-in list as type
Crinibus Oct 2, 2023
aaa0c5d
Update function 'clean_datapoints' to use database functions instead …
Crinibus Oct 3, 2023
cf739de
Add database functions 'get_all_unique_domains', 'get_products_by_dom…
Crinibus Oct 3, 2023
ec216ac
Add static method 'scraper_to_db_product' to class Format
Crinibus Oct 3, 2023
2fd3fa6
Add static method 'db_product_to_scraper' to class Format
Crinibus Oct 3, 2023
f4a976f
Use parameter 'isActive' in method 'Format.scraper_to_db_product'
Crinibus Oct 3, 2023
77cb2c8
Move static method 'get_user_product_name' from class Format to class…
Crinibus Oct 3, 2023
13b79a1
Delete import of Config in Format.py
Crinibus Oct 3, 2023
42010e4
Add static method 'db_products_to_scrapers' to class Format
Crinibus Oct 3, 2023
ba9e007
Add database function 'get_all_active_products'
Crinibus Oct 3, 2023
9676eb1
Update database functions 'get_products_by_domains' and 'get_all_prod…
Crinibus Oct 10, 2023
4564cb2
Update scraper/__init__.py - add import of class Format and module da…
Crinibus Oct 10, 2023
ec0eb09
Update database function 'get_all_products' - add parameter to only s…
Crinibus Oct 10, 2023
c8b6ec4
Update database function 'get_all_products_with_datapoints' - add par…
Crinibus Oct 10, 2023
f07d007
Delete database function 'get_all_active_products'
Crinibus Oct 11, 2023
c4b9272
Add database functions 'group_products_by_domains' and 'group_product…
Crinibus Oct 11, 2023
61c8642
Update print_products.py to use database
Crinibus Oct 11, 2023
62fde49
Fix 'print_latest_datapoints_for_products' - group by name
Crinibus Oct 11, 2023
ff128b4
Add parameter 'categories' to function 'print_latest_datapoints'
Crinibus Oct 11, 2023
81ce2c8
Fix function 'get_master_products'
Crinibus Oct 11, 2023
fc21444
Rename functions 'add_new_product' and 'add_new_datapoint' to '..._to…
Crinibus Oct 11, 2023
a92310d
Add column 'created_at' to Product and DataPoint database tables
Crinibus Oct 11, 2023
f9fd311
Add condition to function 'add_new_datapoint_with_scraper'
Crinibus Oct 11, 2023
be855a4
Update main.py to use database
Crinibus Oct 11, 2023
d22d4cb
Update scrape.py - delete logic to save to json
Crinibus Oct 11, 2023
786f6a9
Delete misc.py
Crinibus Oct 14, 2023
eff2a46
Remove pandas from requirements.txt
Crinibus Oct 14, 2023
2166538
Add section to README about how data is stored after V3.0.0
Crinibus Oct 14, 2023
abb125b
Rename database column "isActive" to "is_active"
Crinibus Oct 14, 2023
3267877
Rename function 'active_existing_product' to 'set_existing_product_is…
Crinibus Oct 14, 2023
0663fb7
Update function 'print_all_products' - don't print product_code inste…
Crinibus Oct 14, 2023
da60462
Update function 'print_all_products' - add marker to indicate which p…
Crinibus Oct 14, 2023
a28102a
Update section "View all products" in README with new product activat…
Crinibus Oct 14, 2023
26ef5f4
Add function 'update_products_is_active_with_product_codes'
Crinibus Oct 14, 2023
f997d39
Add arguments --activate and --deactivate
Crinibus Oct 14, 2023
254c9dd
Fix missing rename of Product 'isActive' to 'is_active'
Crinibus Oct 14, 2023
9f83b78
Fix function 'search_product_name'
Crinibus Oct 14, 2023
9a8f5ee
Rename function 'search_categories' to 'search_category'
Crinibus Oct 14, 2023
c43704c
Update search_data.py - search for all search terms at the start and …
Crinibus Oct 14, 2023
8994e1d
Delete function 'search_category' and update function 'search_categor…
Crinibus Oct 14, 2023
1052416
Update database function 'get_product_infos_from_products' - handle i…
Crinibus Oct 14, 2023
4ef60aa
Update function 'get_products_by_names_fuzzy' - rename variable 'name…
Crinibus Oct 15, 2023
b3dac76
Update parameter type hint in function 'add_all'
Crinibus Oct 15, 2023
216ef11
Add logging to delete functions in delete_data.py
Crinibus Oct 15, 2023
149e02f
Rename and update method Config.get_user_product_names
Crinibus Oct 27, 2023
d6c1f35
Fix function 'scrape_with_threads'
Crinibus Oct 27, 2023
f5ce7c7
Delete function 'get_product_data'
Crinibus Nov 9, 2023
f0ff51c
Use build-in 'dict' type hint instead of 'from typing import Dict''
Crinibus Nov 9, 2023
fbdfbe5
Fix tests/test_website_handlers.py - update import of class 'Info' an…
Crinibus Nov 14, 2023
e9d5662
Delete comment
Crinibus Nov 14, 2023
1ec172a
Set attribute 'product_info' in class 'Scraper' to None in __init__ m…
Crinibus Nov 14, 2023
d38b767
Update method 'scrape_info' in class 'Scraper' to return product info
Crinibus Nov 14, 2023
95313a7
Update tests\test_add_product.py to reflect changes of database update
Crinibus Nov 14, 2023
df950bd
Add configurations for ruff in pyproject.toml
Crinibus Nov 14, 2023
30c0045
Update test_objects.json
Crinibus Nov 17, 2023
ed75b3c
Update EbayHandler
Crinibus Nov 17, 2023
add6ca4
Add if condition to function 'print_latest_datapoint'
Crinibus Nov 17, 2023
02255a3
Update README.md
Crinibus Nov 17, 2023
7ef14ec
Merge branch 'master' into add-database
Crinibus Nov 17, 2023
fccaf5b
Update main.py - create database and tables at the start of the script
Crinibus Nov 17, 2023
ecae5a3
Update README - reword the a sentence in section "UPDATE TO HOW DATA …
Crinibus Nov 17, 2023
2c8a945
Update README - Add a note to section "UPDATE TO HOW DATA IS STORED I…
Crinibus Nov 17, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 42 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,30 @@ In version v2.3.0, I have add the column ```short_url``` to ```products.csv```.
</p>
</details>

<details><summary><h2>UPDATE TO HOW DATA IS STORED IN V3.0.0</h2></summary>
<p>

In version v3.0.0, I have changed where data is stored from a json file to a SQLite database. If you have data from before v3.0.0, then run the following commands in an interactive python session to add the data from records.json to the database (**OBS: Pandas is required**):
```
>>> from scraper.format_to_new import Format
>>> Format.from_json_to_db()
```

<br/>

**NOTE:** This will replace the content in the database with what is in records.json. That means if you have products and/or datapoints in the database but not records.json, they will be deleted.


<br/>

OBS: If you doesn't have Pandas installed run this command:
```
pip3 install pandas
```

</p>
</details>

<br/>


Expand Down Expand Up @@ -147,6 +171,19 @@ python3 main.py -s --threads

<br/>

## Activating and deactivating products

When you add a new product the product is activated to be scraped. If you wish to not scrape a product anymore, you can deactivate the product with the following command:
```
python3 main.py --deactivate --id <id>
```

You can activate a product again with the following command:
```
python3 main.py --activate --id <id>
```

<br/>

## Delete data <a name="delete-data"></a>

Expand All @@ -171,14 +208,13 @@ Then just add products like described [here](#add-products).

<br/>

If you just want to reset your data for every product, deleting all datapoints inside every product, then run this command:
If you just want to delete all datapoints for every product, then run this command:
```
python3 main.py --reset --all
```
This deletes the data inside each product, such as id, url and all datapoints.


You can also just reset some products or all products in some categories:
You can also just delete datapoints for some products:
```
python3 main.py --reset --id <id>
```
Expand Down Expand Up @@ -274,8 +310,11 @@ This will print all the products in the following format:
CATEGORY
> PRODUCT NAME
- WEBSITE NAME - PRODUCT ID
- ✓ WEBSITE NAME - PRODUCT ID
```

The check mark (✓) shows that the product is activated.

<br/>


Expand Down
36 changes: 22 additions & 14 deletions main.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
from typing import List
import threading
import logging.config
import logging
Expand All @@ -13,7 +12,7 @@ def main() -> None:
args = scraper.argparse_setup()

if args.clean_data:
scraper.clean_records_data()
scraper.clean_datapoints()

if args.visualize:
scraper.visualize_data(args.all, args.category, args.id, args.name, args.up_to_date, args.compare)
Expand All @@ -24,6 +23,12 @@ def main() -> None:
if args.add:
scraper.add_products(args.category, args.url)

if args.activate:
scraper.update_products_is_active_with_product_codes(args.id, True)

if args.deactivate:
scraper.update_products_is_active_with_product_codes(args.id, False)

if args.search:
scraper.search(args.search)

Expand All @@ -34,7 +39,7 @@ def main() -> None:
scrape()

if args.latest_datapoint:
scraper.print_latest_datapoints(args.name, args.id)
scraper.print_latest_datapoints(args.name, args.id, args.category)

if args.print_all_products:
scraper.print_all_products()
Expand All @@ -47,18 +52,17 @@ def scrape() -> None:
print("Scraping...")

request_delay = scraper.Config.get_request_delay()
products_df = scraper.Filemanager.get_products_data()
active_products = scraper.db.get_all_products(select_only_active=True)

# Create instances of class "Scraper"
products = [scraper.Scraper(category, url) for category, url in zip(products_df["category"], products_df["url"])]
products = scraper.Format.db_products_to_scrapers(active_products)

with alive_progress.alive_bar(len(products), title="Scraping") as bar:
# Scrape and save scraped data for each product (sequentially)
for product in products:
bar.text = f"-> {product.url}"
time.sleep(request_delay)
product.scrape_info()
product.save_info()
scraper.add_product.add_new_datapoint_with_scraper(product)
bar()


Expand All @@ -67,18 +71,21 @@ def scrape_with_threads() -> None:

request_delay = scraper.Config.get_request_delay()

products_df = scraper.Filemanager.get_products_data()
domain_grouped_products_df = scraper.get_products_df_grouped_by_domains(products_df)
grouped_products = scraper.get_products_grouped_by_domain(domain_grouped_products_df)
grouped_db_products = scraper.db.get_all_products_grouped_by_domains(select_only_active=True)
grouped_products: list[list[scraper.Scraper]] = []

for db_products in grouped_db_products:
products = scraper.Format.db_products_to_scrapers(db_products)
grouped_products.append(products)

grouped_scraper_threads: List[List[threading.Thread]] = []
grouped_scraper_threads: list[list[threading.Thread]] = []

# Create scraper threads and group by domain
for products in grouped_products.values():
for products in grouped_products:
scraper_threads = [threading.Thread(target=product.scrape_info) for product in products]
grouped_scraper_threads.append(scraper_threads)

products_flatten = [product for products in grouped_products.values() for product in products]
products_flatten = [product for products in grouped_products for product in products]

with alive_progress.alive_bar(len(products_flatten), title="Scraping with threads") as progress_bar:
# Create master threads to manage scraper threads sequentially for each domain
Expand All @@ -97,10 +104,11 @@ def scrape_with_threads() -> None:

# Save scraped data for each product (sequentially)
for product in products_flatten:
product.save_info()
scraper.add_product.add_new_datapoint_with_scraper(product)


if __name__ == "__main__":
scraper.db.create_db_and_tables()
logging.config.fileConfig(
fname=scraper.Filemanager.logging_ini_path,
defaults={"logfilename": scraper.Filemanager.logfile_path},
Expand Down
6 changes: 6 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,2 +1,8 @@
[tool.black]
line-length = 127

[tool.ruff]
line-length = 127

[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["E402"]
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
requests>=2.24.0
beautifulsoup4>=4.9.1
plotly>=4.12.0
pandas>=1.1.3
pytest>=7.1.2
pytest-mock>=3.8.2
alive-progress>=2.4.1
flake8>=6.0.0
sqlmodel>=0.0.8
7 changes: 4 additions & 3 deletions scraper/__init__.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
from .scrape import Scraper, start_threads_sequentially
from .arguments import argparse_setup
from .add_product import add_products
from .add_product import add_products, update_products_is_active_with_product_codes
from .filemanager import Filemanager, Config
from .visualize import visualize_data
from .clean_data import clean_records_data
from .clean_data import clean_datapoints
from .delete_data import delete
from .reset_data import reset
from .search_data import search
from .print_products import print_latest_datapoints, print_all_products
from .misc import get_products_df_grouped_by_domains, get_products_grouped_by_domain
from .format import Format
import scraper.database as db


__author__ = "Crinibus"
106 changes: 55 additions & 51 deletions scraper/add_product.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
from typing import List
import logging
from datetime import datetime

import scraper.database as db
from scraper.exceptions import WebsiteNotSupported, URLMissingSchema
from scraper.format import Format
from scraper.scrape import Scraper
from scraper.filemanager import Filemanager
from scraper.domains import get_website_name, SUPPORTED_DOMAINS
from scraper.constants import URL_SCHEMES


def add_products(categories: List[str], urls: List[str]) -> None:
def add_products(categories: list[str], urls: list[str]) -> None:
for category, url in zip(categories, urls):
try:
add_product(category, url)
Expand All @@ -31,77 +33,79 @@ def add_product(category: str, url: str) -> None:
logger.info(f"Adding product with category '{category}' and url '{url}'")

new_product = Scraper(category, url)
new_product.scrape_info()
new_product_info = new_product.scrape_info()

product_in_db = db.get_product_by_product_code(new_product_info.id)

if not check_if_product_exists(new_product):
save_product(new_product)
if product_in_db is None:
add_new_product_to_db(new_product)
add_new_datapoint_with_scraper(new_product)
return

logger.info("Product with the same product code already exists in database")

if product_in_db.is_active:
print("Product with the same product code already exists in database and is active")
return

user_input = input(
"A product with the same name and from the same website already exist in your data, "
"do you want to override this product? (y/n) > "
"A product with the same product id already exist in the database but is not active, "
"do you want to activate it? (y/n) > "
)

if user_input.lower() in ("y", "yes"):
print("Overriding product...")
save_product(new_product)
print("Activating product...")
set_existing_product_is_active(product_in_db, True)
logger.info("Product has been activated")
else:
print("Product was not added nor overrided")
logger.info("Adding product cancelled")


def check_if_product_exists(product: Scraper) -> bool:
data = Filemanager.get_record_data()
print("Product has not been activated")
logger.info("Product not activated")

category = product.category
product_name = product.product_info.name
website_name = product.website_handler.website_name

try:
data[category][product_name][website_name]
except KeyError:
return False
def add_new_product_to_db(product: Scraper) -> None:
product_to_db = Format.scraper_to_db_product(product, True)
db.add(product_to_db)

return True

def add_new_datapoint_to_db(product_code: str, price: float, currency: str, date: str | None = None):
"""Parameter 'date' defaults to the date of today in the format: YYYY-MM-DD"""
if date is None:
date = datetime.today().strftime("%Y-%m-%d")

def save_product(product: Scraper) -> None:
add_product_to_records(product)

if not check_if_product_exists_csv(product):
Filemanager.add_product_to_csv(product.category, product.url, product.website_handler.get_short_url())

product.save_info()

new_datapoint = db.DataPoint(
product_code=product_code,
date=date,
price=price,
currency=currency,
)

def add_product_to_records(product: Scraper) -> None:
data = Filemanager.get_record_data()
db.add(new_datapoint)

category = product.category
product_name = product.product_info.name
website_name = product.website_handler.website_name

empty_product_dict = {website_name: {"info": {}, "datapoints": []}}
def add_new_datapoint_with_scraper(product: Scraper, date: str | None = None) -> None:
if not product.product_info or not product.product_info.valid:
print(f"Product info is not valid - category: '{product.category}' - url: {product.url}")
return

if not data.get(category):
data.update({category: {product_name: empty_product_dict}})
product_code = product.product_info.id
price = product.product_info.price
currency = product.product_info.currency

if data[category].get(product_name):
data[category][product_name].update(empty_product_dict)
else:
data[category].update({product_name: empty_product_dict})
add_new_datapoint_to_db(product_code, price, currency, date)

Filemanager.save_record_data(data)

def update_products_is_active_with_product_codes(product_codes: list[str], is_active: bool) -> None:
action = "Activating" if is_active else "Deactivating"

def check_if_product_exists_csv(product: Scraper) -> bool:
products_df = Filemanager.get_products_data()
for product_code in product_codes:
print(f"{action} {product_code}")
product = db.get_product_by_product_code(product_code)
set_existing_product_is_active(product, is_active)

for category, url in zip(products_df["category"], products_df["url"]):
if product.category.lower() == category.lower() and product.url == url:
return True

return False
def set_existing_product_is_active(product: db.Product, is_active: bool) -> None:
product.is_active = is_active
db.add(product)


def is_missing_url_schema(url: str) -> bool:
Expand Down
14 changes: 12 additions & 2 deletions scraper/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ def argparse_setup() -> argparse.Namespace:

parser.add_argument("-u", "--url", help="the url to the product", type=str, nargs="*", action="extend")

parser.add_argument("--activate", help="activate a product to be scraped", action="store_true")

parser.add_argument("--deactivate", help="deactivate a product to not be scraped", action="store_true")

parser.add_argument(
"-v",
"--visualize",
Expand Down Expand Up @@ -140,6 +144,12 @@ def validate_arguments(parser: argparse.ArgumentParser) -> argparse.Namespace:
if args.add and args.visualize:
parser.error("Cannot use --add and --visualize at the same time")

if args.activate and args.deactivate:
parser.error("Cannot use --activate and --deactivate at the same time")

if (args.activate or args.deactivate) and not args.id:
parser.error("When using --activate or --deactivate, then --id is required")

if args.delete:
if args.all and any([args.category, args.name, args.id]):
parser.error("When using --delete and --all, then using --category, --name or --id does nothing")
Expand All @@ -163,7 +173,7 @@ def validate_arguments(parser: argparse.ArgumentParser) -> argparse.Namespace:
)

if args.latest_datapoint:
if not args.name and not args.id:
parser.error("When using --latest-datapoint, then --name or --id is required")
if not any([args.name, args.id, args.category]):
parser.error("When using --latest-datapoint, then --name, --id or --category is required")

return args
Loading
Loading