wiki_ij_pr.dat files are empty #2

esmailza · 2023-05-01T21:31:31Z

I am figuring out how the digraph.pickle file is created because I need to create mine in my research on a subset of Wikipedia. So I downloaded a small subset of Wikipedia and extracted the plain text using WikiExtractor.
I am getting this error while running the following command:

python extract_wiki.py preprocess_wikipedia [Wikipedia/folder]

File "DEER/extract_wiki.py", line 1073, in
cal_freq = CalFreq(path_pattern_count_file)
File "DEER/extract_wiki.py", line 389, in init
c, log_max_cnt = load_pattern_freq(path_freq_file)
File "DEER/extract_wiki.py", line 382, in load_pattern_freq

One problem is that "save_pair_files" are empty! In the following code, the "len(ents) <= 1:" is always less than or equal to the 1, so data is always empty. Therefore, the path_pattern.pickle and the sub_path_pattern.pickle contain just counter()!

    print('cal_cooccur_similarity')
    
    for f_id, save_cooccur__file in enumerate(tqdm.tqdm(save_cooccur__files)):
        with open(save_cooccur__file) as f_in:
            cooccurs = f_in.read().split('\n')
            print("cooccurs is: ", cooccurs)
        data = []
        for line in cooccurs:
            ents = line.split('\t')
            certain_len = len(ents)
            **if len(ents) <= 1:
                data.append('')**
            else:
                temp_data = []
                # valid_entities = []
                matrix = []
                for ent in ents:
                    try:
                        vec = w2vec.get_entity_vector(ent)
                    except:
                        vec = np.zeros(100, dtype=np.float32)
                    matrix.append(vec)
                # Collect pairs between certain entities
                matrix = np.array(matrix)
                result = cosine_similarity(matrix, matrix)
                for i in range(certain_len):
                    for j in range(i+1, certain_len):
                        tup = (float(result[i, j]), ents[i], ents[j])
                        temp_data.append(str(tup))
                data.append('\t'.join(temp_data))
        my_write(save_pair_files[f_id], data)

Can you please help me to realize where the problem is?

The text was updated successfully, but these errors were encountered:

ZhuKerui · 2023-05-09T02:04:39Z

@esmailza Hi, sorry for the late response. I think there is a possible reason for len(ents) equals 1. Does the Wikipedia file you use contain hyperlinks in the passages? The code uses hyperlinks in the passages to identify the entities. An example passage used in our work is shown as follows:

<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
Anarchism

Anarchism is a <a href="political%20philosophy">political philosophy</a> and <a href="Political%20movement">movement</a> that is sceptical of <a href="authority">authority</a> and rejects all involuntary, coercive forms of <a href="hierarchy">hierarchy</a>. Anarchism calls for the abolition of the <a href="State%20%28polity%29">state</a>, which it holds to be undesirable, unnecessary, and harmful. It is usually described alongside <a href="libertarian%20Marxism">libertarian Marxism</a> as the <a href="libertarian">libertarian</a> wing (<a href="libertarian%20socialism">libertarian socialism</a>) of the <a href="socialist%20movement">socialist movement</a> and as having a historical association with <a href="anti-capitalism">anti-capitalism</a> and <a href="socialism">socialism</a>.

Could you please check if this is the issue? Thanks!

flyrobot27 · 2024-02-29T23:27:13Z

What if you add this to the readme

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wiki_ij_pr.dat files are empty #2

wiki_ij_pr.dat files are empty #2

esmailza commented May 1, 2023

ZhuKerui commented May 9, 2023 •

edited

Loading

Uh oh!

flyrobot27 commented Feb 29, 2024

Uh oh!

wiki_ij_pr.dat files are empty #2

wiki_ij_pr.dat files are empty #2

Comments

esmailza commented May 1, 2023

ZhuKerui commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flyrobot27 commented Feb 29, 2024

Uh oh!

ZhuKerui commented May 9, 2023 •

edited

Loading