-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leo/bsi b soup #94
Leo/bsi b soup #94
Conversation
…inks to the certified products page.
…ems to work properly, now working on a good demo script
As of 9 June, the following is done:
As of 9 June, the following remains to be done:
Feedback from 1. July@KeleranV see the subtasks here:
|
…one, last step is to integrate it in the current class architecture
…here is no need to search infinitely for subpages, the handler will have one level of sub-handler so it shouldn't be hard to navigate through all of the handler.
…oop, leading to a recursive/infinite process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some intermediate review. Once done with your work, be sure to get rid of non-descriptive comments and make the code (at least somewhat) PEP-8 compliant.
sec_certs/bsi_parse.py
Outdated
|
||
def process(root_url: str): | ||
browser = BsiBrowser(root_url) | ||
def process(base_url: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels that this whole method should be part of BsiBrowser.parse()
method.
sec_certs/bsi_parse.py
Outdated
def process(root_url: str): | ||
browser = BsiBrowser(root_url) | ||
def process(base_url: str): | ||
browser = BsiBrowser(base_url) | ||
browser.parse() | ||
tmp_list = [] | ||
for handler in browser.handler_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole for loop can be written as a list comprehension and you can call tmp_list.extend([the_comprehension])
sec_certs/bsi_parse.py
Outdated
|
||
|
||
# To end the processing, a last class will be used with final links | ||
# to retrieve simple data written on the page | ||
|
||
|
||
class Bsitmp(ComplexSerializableType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of class is not really explanatory. Should be named something like BSICertificate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, you probably won't be calling BSICertificate(link)
but some alternative constructor BSICertificate.from_url(url)
sec_certs/bsi_parse.py
Outdated
for arch_handler in handler.handler_list: | ||
for arch_url in arch_handler.link_list: | ||
arch_temp = Bsitmp(arch_url) | ||
tmp_list.append(arch_temp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, use comprehension and .extend()
notation
sec_certs/bsi_parse.py
Outdated
def process(root_url: str): | ||
browser = BsiBrowser(root_url) | ||
def process(base_url: str): | ||
browser = BsiBrowser(base_url) | ||
browser.parse() | ||
tmp_list = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could then be instance variable
sec_certs/bsi_parse.py
Outdated
[BsiHandler("https://www.bsi.bund.de/" + a['href']) | ||
for a in self.soup.find_all('a', href=True, recursive=True, title=re.compile('Archive'))] | ||
|
||
if len(self.handler_list) != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.handler_list:
sec_certs/bsi_parse.py
Outdated
test_url = "https://www.bsi.bund.de/SharedDocs/Zertifikate_CC/CC/Digitale_Signatur_Kartenlesegeraete/1046.html" \ | ||
";jsessionid=AF4A9C0B7D27992808B8EA8C426454BE.internet461?nn=513452 " | ||
|
||
root_url = "https://www.bsi.bund.de/EN/Topics/Certification/certified_products/digital_signature" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declare with typing.Final
type and underscored uppercase to denote constants. Furthemore, these can be moved to BSIParser
class afaik.
sec_certs/bsi_parse.py
Outdated
valid_until: str | ||
soup: BeautifulSoup | ||
id: str | ||
pdf_links: list[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't work. You need to use List[str]
from typing
module.
sec_certs/bsi_parse.py
Outdated
self.handler_list = \ | ||
[BsiHandler("https://www.bsi.bund.de/" + a['href']) | ||
for a in self.soup.find_all('a', href=True, recursive=True) | ||
if 'c-navigation-teaser' in str(a.get('class'))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can search directly by link class with find_all(class_='c-navigation-teaser')
…e use the same logic as bsi_parse.py, but adapt it to the layout - attempt to use a dictionary in bsi_parse.py, last fix in the constructors to match the requirement of the serialization. - issue encountered : - the websites seems to block some requests, after putting some delay between each request, it seems to be working better - the last field used as a key in the dictionary wasn't retrieved, testing a new one
…g this kind of key : 'BSI - smart cards and similar devices - BSI-DSZ-CC-0954-2015' todo: do the same thing on ANSSI
…ictionary, with the key being the reference of the product
This is now done in #247. |
No description provided.