Skip to content
This repository has been archived by the owner on Oct 10, 2022. It is now read-only.

Upcoming releases and support #12

Open
snakers4 opened this issue Sep 3, 2019 · 23 comments
Open

Upcoming releases and support #12

snakers4 opened this issue Sep 3, 2019 · 23 comments
Labels
help wanted Extra attention is needed

Comments

@snakers4
Copy link
Owner

snakers4 commented Sep 3, 2019

We are planning new cool releases sometime in future (with a twist you are not expecting), soon!

Also now you can support our initiative directly using open collective
image

@snakers4 snakers4 added the help wanted Extra attention is needed label Sep 3, 2019
@32r81b
Copy link

32r81b commented Jan 14, 2020

Hi, @snakers4 .
I wrote little python program which convert data from unpacked datasets and prepare train, dev, test files according Deepspeech format. Also programm create transcript file for preparing kenlm language model. Is it interesting for you? May I push script to your repository?

@snakers4
Copy link
Owner Author

Please create a pull request

@snakers4
Copy link
Owner Author

Current status update

Given the current figures:

  • ~US$30 per month community support (3 backers via open collective)
  • ~US$300 direct hosting fees for 16 days of February, i.e. US$500-600 per month
  • Only 3 users downloading the torrent vs. at least 100 direct downloads this month + ~300-400 total direct downloads vs. ~10-15 total torrent downloads

And the fact that some people downloading the dataset are clearly abusing our licence (i.e. obviously commercial companies claiming to use our dataset for "research" purposes) - I have decided to temporarily suspend the direct downloads.

Please - if you are an open collective backer and you need a direct link, please ping me, I will send you a private link.

Further ideas:

  • We will migrate the whole dataset to opus, most likely the whole dataset will be shared ONLY via torrent, at least until we get around US$200-300 support per month via open-collective
  • Still undecided whether to include new domains / languages given lack of community support

P.S.

От себя персонально скажу - если бы из примерно 400 скачавших хотя бы 10% бы поддерживали нас на 10 долларов в месяц, то датасет был бы доступен для всех по прямой ссылке. Но статистика выше в сочетании с отношением некоторых компаний наводят меня на определенные мысли.

@Advencher
Copy link

Hi, @32r81b .
I wrote little python program which convert data from unpacked datasets and prepare train, dev, test files according Deepspeech format. Also programm create transcript file for preparing kenlm language model. Is it interesting for you? May I push script to your repository?

Can you please share the script (I wrote my own but it runs slow and dosent read Russian characters)
Я был бы очень благодарен

@32r81b
Copy link

32r81b commented Mar 29, 2020

Извини файл еще в разработке. Ниже ссылка на черновой вариант.
Если хочешь обрабатывать несколько больших файлов лучше параллелить.
Скрипт читает csv файл со списком исключений (public_exclude_file_v5.csv)+ читает файл по датасету (public_youtube700.csv).
Далее отбрасывает записи из public_youtube700.csv по списку public_exclude_file_v5.csv, считывает оставшиеся файлы с диска, определяет длительность аудио, конвертит в 8 нужный формат и кладет в отдельную папку. Так же отсеиваются очень короткие и длинные аудио. В конце сохраняется текстовый файл в формате обучения deepspeech.

https://github.com/32r81b/open_stt/blob/master/utils/0.1%20open_stt_prepeare%200.py

@snakers4
Copy link
Owner Author

Academic torrents is down
Wrote to their admin to see what happens

@snakers4
Copy link
Owner Author

ru_open_stt_wav_v10.zip

The torrent file
Not sure how to manually add peers yet

@snakers4
Copy link
Owner Author

Working on hosting the torrent elsewhere

@snakers4
Copy link
Owner Author

https://rutracker.org/forum/viewtopic.php?t=5880804

Not approved yet
Not sure how my client (QBittorrent) will properly support uploading via several trackers (I am a bit rusty in how torrents work under the hood on lower levels)

@snakers4
Copy link
Owner Author

Ppl, please seed

image

@snakers4
Copy link
Owner Author

My upload speed is 20-30 MiB/s, so it definitely works

@snakers4
Copy link
Owner Author

also a magnet until the rutracker page gets approved

magnet:?xt=urn:btih:A7929F1D8108A2A6BA2785F67D722423F088E6BA&tr=http%3A%2F%2Fbt3.t-ru.org%2Fann%3Fmagnet&dn=Russian%20Open%20Speech%20To%20Text%20(STT%2FASR)%20Dataset%20%5B100%2C%2016000000%5D

@snakers4
Copy link
Owner Author

academic torrents is back up
so no worries

@Advencher
Copy link

@32r81b а где взять файл public_exclude_file_v5.csv ? его нет в торренте

@Advencher
Copy link

@32r81b спасибо за ответ

@snakers4
Copy link
Owner Author

snakers4 commented May 4, 2020

@32r81b а где взять файл public_exclude_file_v5.csv ? его нет в торренте

он лежит в тикетах

@snakers4
Copy link
Owner Author

snakers4 commented May 4, 2020

@snakers4
Copy link
Owner Author

snakers4 commented May 5, 2020

@snakers4
Copy link
Owner Author

snakers4 commented May 9, 2020

A few announcements

  • wav torrent to be deprecated shortly please switch to opus
  • opus reader helpers and build instructions available
  • there was a surge in using some legacy links - all of them will be permanently disabled shortly

@snakers4
Copy link
Owner Author

A few announcements

  • Academic torrents moved to a new infrastructure
  • Please seed if you have downloaded the torrent
  • Microsoft is not sharing the download (or any whatsoever) statistics regarding their hosting - please leave any form of feedback on their direct links

@snakers4
Copy link
Owner Author

Managed to fix seeding issues with new server OS version

#34

@snakers4
Copy link
Owner Author

snakers4 commented Jun 4, 2021

Update 2021-06-04

Added Zenodo direct link mirrors as well.

@snakers4
Copy link
Owner Author

snakers4 commented Jun 4, 2021

Azure links were reported to be very slow

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants