Building-multilingual-dialogue-dataset

Couple of useful codes for building a database containing multiple conversations in a specified language. Description of the files present

link to the Dataset folder on dropbox: https://www.dropbox.com/sh/9m3dhhsydyonksc/AAAieYzU0ptzFUF2qs6NxgnSa?dl=0

source urls vm.txt - contains the source url of different websites from where data can be loaded

spanish dictionary.txt - a txt format of the words in spanish dictionary

StatsDataSample_spa_final.txt - complete statistics of all the words that are to be analyzed including the one from EU proceedings

getDataURL_kidsico.py - script to download plays from the web and then process it

outputFile.py - the main script to generate the xml file. One has to be careful to execute this paths needs to be modified and the Dataset on which the script is running also needs to be checked. In the current state it will build a corpora from the EU proceedings.

outputStats.py - script to outptut the main statistics of a file

outputStats_amar.py - a modified version of outputStats.py for a defined purpose of knowing the statistics of an exisiting corpora

freq_common_count.py - calculates the frequency of all the words in a data. The input file has to be a file like "StatsDataSample_spa_final.txt"

Corpus_spa_final.xml - The xml file which is required by the project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building-multilingual-dialogue-dataset

link to the Dataset folder on dropbox: https://www.dropbox.com/sh/9m3dhhsydyonksc/AAAieYzU0ptzFUF2qs6NxgnSa?dl=0

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Corpus_spa_final.xml		Corpus_spa_final.xml
README.md		README.md
StatsDataSample_spa_final.txt		StatsDataSample_spa_final.txt
freq_common_count.py		freq_common_count.py
getDataURL_kidsinco.py		getDataURL_kidsinco.py
outputFile.py		outputFile.py
outputStats.py		outputStats.py
outputStats_amar.py		outputStats_amar.py
source urls vm.txt		source urls vm.txt
spanish_dictionary.txt		spanish_dictionary.txt

Amarkr1/Building-multilingual-dialogue-dataset

Folders and files

Latest commit

History

Repository files navigation

Building-multilingual-dialogue-dataset

link to the Dataset folder on dropbox: https://www.dropbox.com/sh/9m3dhhsydyonksc/AAAieYzU0ptzFUF2qs6NxgnSa?dl=0

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages