You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am figuring out how the digraph.pickle file is created because I need to create mine in my research on a subset of Wikipedia. So I downloaded a small subset of Wikipedia and extracted the plain text using WikiExtractor.
I am getting this error while running the following command:
File "DEER/extract_wiki.py", line 1073, in
cal_freq = CalFreq(path_pattern_count_file)
File "DEER/extract_wiki.py", line 389, in init
c, log_max_cnt = load_pattern_freq(path_freq_file)
File "DEER/extract_wiki.py", line 382, in load_pattern_freq
One problem is that "save_pair_files" are empty! In the following code, the "len(ents) <= 1:" is always less than or equal to the 1, so data is always empty. Therefore, the path_pattern.pickle and the sub_path_pattern.pickle contain just counter()!
print('cal_cooccur_similarity')
for f_id, save_cooccur__file in enumerate(tqdm.tqdm(save_cooccur__files)):
with open(save_cooccur__file) as f_in:
cooccurs = f_in.read().split('\n')
print("cooccurs is: ", cooccurs)
data = []
for line in cooccurs:
ents = line.split('\t')
certain_len = len(ents)
**if len(ents) <= 1:
data.append('')**
else:
temp_data = []
# valid_entities = []
matrix = []
for ent in ents:
try:
vec = w2vec.get_entity_vector(ent)
except:
vec = np.zeros(100, dtype=np.float32)
matrix.append(vec)
# Collect pairs between certain entities
matrix = np.array(matrix)
result = cosine_similarity(matrix, matrix)
for i in range(certain_len):
for j in range(i+1, certain_len):
tup = (float(result[i, j]), ents[i], ents[j])
temp_data.append(str(tup))
data.append('\t'.join(temp_data))
my_write(save_pair_files[f_id], data)
Can you please help me to realize where the problem is?
The text was updated successfully, but these errors were encountered:
@esmailza Hi, sorry for the late response. I think there is a possible reason for len(ents) equals 1. Does the Wikipedia file you use contain hyperlinks in the passages? The code uses hyperlinks in the passages to identify the entities. An example passage used in our work is shown as follows:
<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
Anarchism
Anarchism is a <a href="political%20philosophy">political philosophy</a> and <a href="Political%20movement">movement</a> that is sceptical of <a href="authority">authority</a> and rejects all involuntary, coercive forms of <a href="hierarchy">hierarchy</a>. Anarchism calls for the abolition of the <a href="State%20%28polity%29">state</a>, which it holds to be undesirable, unnecessary, and harmful. It is usually described alongside <a href="libertarian%20Marxism">libertarian Marxism</a> as the <a href="libertarian">libertarian</a> wing (<a href="libertarian%20socialism">libertarian socialism</a>) of the <a href="socialist%20movement">socialist movement</a> and as having a historical association with <a href="anti-capitalism">anti-capitalism</a> and <a href="socialism">socialism</a>.
Could you please check if this is the issue? Thanks!
I am figuring out how the digraph.pickle file is created because I need to create mine in my research on a subset of Wikipedia. So I downloaded a small subset of Wikipedia and extracted the plain text using WikiExtractor.
I am getting this error while running the following command:
python extract_wiki.py preprocess_wikipedia [Wikipedia/folder]
File "DEER/extract_wiki.py", line 1073, in
cal_freq = CalFreq(path_pattern_count_file)
File "DEER/extract_wiki.py", line 389, in init
c, log_max_cnt = load_pattern_freq(path_freq_file)
File "DEER/extract_wiki.py", line 382, in load_pattern_freq
One problem is that "save_pair_files" are empty! In the following code, the "len(ents) <= 1:" is always less than or equal to the 1, so data is always empty. Therefore, the path_pattern.pickle and the sub_path_pattern.pickle contain just counter()!
Can you please help me to realize where the problem is?
The text was updated successfully, but these errors were encountered: