Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wiki_ij_pr.dat files are empty #2

Open
esmailza opened this issue May 1, 2023 · 2 comments
Open

wiki_ij_pr.dat files are empty #2

esmailza opened this issue May 1, 2023 · 2 comments

Comments

@esmailza
Copy link

esmailza commented May 1, 2023

I am figuring out how the digraph.pickle file is created because I need to create mine in my research on a subset of Wikipedia. So I downloaded a small subset of Wikipedia and extracted the plain text using WikiExtractor.
I am getting this error while running the following command:

python extract_wiki.py preprocess_wikipedia [Wikipedia/folder]

File "DEER/extract_wiki.py", line 1073, in
cal_freq = CalFreq(path_pattern_count_file)
File "DEER/extract_wiki.py", line 389, in init
c, log_max_cnt = load_pattern_freq(path_freq_file)
File "DEER/extract_wiki.py", line 382, in load_pattern_freq

One problem is that "save_pair_files" are empty! In the following code, the "len(ents) <= 1:" is always less than or equal to the 1, so data is always empty. Therefore, the path_pattern.pickle and the sub_path_pattern.pickle contain just counter()!

    print('cal_cooccur_similarity')
    
    for f_id, save_cooccur__file in enumerate(tqdm.tqdm(save_cooccur__files)):
        with open(save_cooccur__file) as f_in:
            cooccurs = f_in.read().split('\n')
            print("cooccurs is: ", cooccurs)
        data = []
        for line in cooccurs:
            ents = line.split('\t')
            certain_len = len(ents)
            **if len(ents) <= 1:
                data.append('')**
            else:
                temp_data = []
                # valid_entities = []
                matrix = []
                for ent in ents:
                    try:
                        vec = w2vec.get_entity_vector(ent)
                    except:
                        vec = np.zeros(100, dtype=np.float32)
                    matrix.append(vec)
                # Collect pairs between certain entities
                matrix = np.array(matrix)
                result = cosine_similarity(matrix, matrix)
                for i in range(certain_len):
                    for j in range(i+1, certain_len):
                        tup = (float(result[i, j]), ents[i], ents[j])
                        temp_data.append(str(tup))
                data.append('\t'.join(temp_data))
        my_write(save_pair_files[f_id], data)

Can you please help me to realize where the problem is?

@ZhuKerui
Copy link
Collaborator

ZhuKerui commented May 9, 2023

@esmailza Hi, sorry for the late response. I think there is a possible reason for len(ents) equals 1. Does the Wikipedia file you use contain hyperlinks in the passages? The code uses hyperlinks in the passages to identify the entities. An example passage used in our work is shown as follows:

<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
Anarchism

Anarchism is a <a href="political%20philosophy">political philosophy</a> and <a href="Political%20movement">movement</a> that is sceptical of <a href="authority">authority</a> and rejects all involuntary, coercive forms of <a href="hierarchy">hierarchy</a>. Anarchism calls for the abolition of the <a href="State%20%28polity%29">state</a>, which it holds to be undesirable, unnecessary, and harmful. It is usually described alongside <a href="libertarian%20Marxism">libertarian Marxism</a> as the <a href="libertarian">libertarian</a> wing (<a href="libertarian%20socialism">libertarian socialism</a>) of the <a href="socialist%20movement">socialist movement</a> and as having a historical association with <a href="anti-capitalism">anti-capitalism</a> and <a href="socialism">socialism</a>.

Could you please check if this is the issue? Thanks!

@flyrobot27
Copy link

What if you add this to the readme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants