Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identify languages #52

Open
matyaskopp opened this issue May 11, 2023 · 3 comments
Open

identify languages #52

matyaskopp opened this issue May 11, 2023 · 3 comments
Assignees

Comments

@matyaskopp
Copy link
Member

Identify languages of sentences

  • input: is sentence-segmented text
  • output: add the attribute xml:lang to every sentence <s>
@matyaskopp
Copy link
Member Author

matyaskopp commented Oct 25, 2023

workflow:

  1. segment seg to sentences (temporary element tmpSentence), for id use concat(seg/@xml:id,'.sent',position())
  2. extract text content of tmpSentence and ids to tsv file: id text
  3. annotate texts with language to tsc file: id text language
  4. load languages to tmpSentence elements
  5. join adjected tmpSentence with the same language to tmpLangSeg, add ids to it concat(seg/@xml:id,'.lang',position())
  6. annotate tmpLangSeg with udpipe and nametag
  7. remove tmpLangSeg elements (preserve only content), and set seg language to most common language in paragraph (number of words). note: special treat needs join=right which will need to be removed in some cases
  8. backpropagate <seg> language to non-annotated version

@matyaskopp
Copy link
Member Author

  1. annotate texts with language to tsc file: id text language

@olgakanishcheva
I have written(with chatGPT help :-)) a simple script for language identification that uses a lingua package:
https://github.com/ufal/ParlaMint-UA/blob/21864363f0b9e9622aa081bdd4d202e2c81ad00c/Scripts/lang-ident.py
can be a starting point for you. The result on the testing data is here:
https://github.com/ufal/ParlaMint-UA/tree/data/DEVEL-tsv-sent-lang

I will use this script in my pipeline until you develop something more sophisticated.

@olgakanishcheva
Copy link
Collaborator

@matyaskopp Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants