-
Notifications
You must be signed in to change notification settings - Fork 225
Scraping the Web
The Naive Method of scraping the web is using the static tag, and static attributes (key and value pairs).
Using BeautifulSoup4 we can scrape the following HTML text encapsulated by the tag
<div class="location"> Some text in here... </div>
s.find('div', attrs={'class': 'location'}).text.strip()
Here the static tag is 'div'
, the static attribute key and value pair is 'class': 'location'
.
The advantage of the Naive Method is that it is incredibly accurate. The disadvantage of the Naive Method is that it must be consistently maintained, that is, the web page HTML format may update over time.
The Cosine Similarity Method allows us to look for a more abstract tag, or attribute.
Using SciKitLearn we can construct a cosine similarity matrix.
- Extract all tags into a
list
using thefind_all
method from BeautifulSoup4.
tags = [a.attrs for a in s.find_all(self.html_tags)]
- Determine the tags and key phrases to search for in the attribute key and values pairs.
html_tags = ['a', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li', 'span', 'time']
kew_words = ['title', 'company', 'location']
key_phrases = dict(zip(KEY_WORDS,
[['title', 'jobtitle', 'job-title'],
['company', 'jobcompany', 'job-company',
'employer', 'jobemployer', 'job-employer'],
['location', 'joblocation', 'job-location']]))
- Construct the vectorizer and similarity matrix using SciKitLearn. We first must flatten the list of dictionaries in the case that a value is of type
list
.
values = [tag.values() for tag in tags]
keys = [tag.keys() for tag in tags]
flat_values = []
flat_keys = []
for i, value in enumerate(values):
for j, subvalue in enumerate(value):
if isinstance(subvalue, list):
for element in subvalue:
flat_values.append(element)
flat_keys.append([*keys[i]][j])
else:
flat_values.append(subvalue)
flat_keys.append([*keys[i]][j])
# init vectorizer with ngram of (2, 2) to filter out single character matches
vectorizer = TfidfVectorizer(strip_accents='unicode', lowercase=True,
analyzer='char_wb', ngram_range=(2, 2))
# fit vectorizer to entire corpus
vectorizer.fit(flat_values + flat_keys)
# set reference tfidf for cosine similarity later
key_word = 'title' # for example
words = key_phrases[key_word]
reference = vectorizer.transform(words)
# calculate cosine similarity between reference and current tags
similarities = cosine_similarity(
vectorizer.transform(flat_values + flat_keys), reference)
- Search the similarity matrix for the highest similarity and scrape the key and value pair.
# get text with highest similarity
ritem = (None, None, None)
rsim = 0
for i, (sim, item) in enumerate(zip(similarities, zip(flat_values + flat_keys, flat_keys + flat_values))):
max_sim = np_max(sim)
if max_sim >= max_similarity and max_sim > rsim:
if i < len(flat_values):
ritem = (False, *reversed(item))
else:
ritem = (True, *item)
rsim = max_sim
is_key, key, value = ritem
if key and value:
if is_key:
job[key_word] = value
else:
job[key_word] = s.find(self.html_tags, attrs={key: value}).text.strip()
else:
job[key_word] = ''
We can display the similarities visually using Matplotlib.
plt.imshow(np.array(similarities)[0:3,:], cmap=plt.cm.BuGn, interpolation='none', aspect='auto')
plt.colorbar()
plt.show()
The advantage of the Cosine Similarity Method is that it does not rely on any static structure, only that data is encapsulated in tags with meaningful attributes. The disadvantage of the Cosine Similarity Method is that data may not be encapsulated in tags with meaningful attributes.