Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about your ChFi-nAnn dataset and bert model #6

Open
pjfeng opened this issue Dec 2, 2019 · 11 comments
Open

question about your ChFi-nAnn dataset and bert model #6

pjfeng opened this issue Dec 2, 2019 · 11 comments

Comments

@pjfeng
Copy link

pjfeng commented Dec 2, 2019

I'm interested in how you created the ChFi-nAnn dataset. And I want to know more details about the way you did. And also, when I run bert model, it doesn't work too. Thank you.

@shun-zheng
Copy link
Owner

We create the ChFinAnn dataset by distant supervision, and processing details are included by the main paper (Section 4) and the supplementary material (Section A.1).

I will refactor the BERT part later.

@pjfeng
Copy link
Author

pjfeng commented Dec 5, 2019

Thank you. I want to create a annual report dataset of China A-share companies.

@shun-zheng
Copy link
Owner

Sounds cool!
Currently, Doc2EDAG assumes that the input is a sequence of sentences, so it is suitable for event-related documents, which are mainly expressed by the natural language.
As for the annual report, you may need to handle much semi-structured information, such as tables, figures, etc.
I would like to recommend Founder, which was the pioneering work for the extraction from richly formatted documents.

@pjfeng
Copy link
Author

pjfeng commented Dec 6, 2019

Thank you for your advice.

I work at the department of finance, and we have the Bloomberg, Wind, and Thomson Reuters. So we want to utilize text mining and nlp to process the financial news, and want to do some news event identification, risk identification and quantitative factor analysis based on the historical news data of individual stocks.

In this way, we can make some predictions at a certain point in the future, which can be added in the quantitative trading strategy. A product, like Kensho made in USA, and they make a good product and have acquired by Goldman Sachs.

Your work surprises me very well. If possible, we can take a talk.

@xiaocuigit
Copy link

Sounds cool!
Currently, Doc2EDAG assumes that the input is a sequence of sentences, so it is suitable for event-related documents, which are mainly expressed by the natural language.
As for the annual report, you may need to handle much semi-structured information, such as tables, figures, etc.
I would like to recommend Founder, which was the pioneering work for the extraction from richly formatted documents.

Hi~
Did you use the Founder to build the knowledge base in your paper?

@shun-zheng
Copy link
Owner

@pjfeng What you have mentioned is a very challenging topic, and many startups also worked on it in recent years.
Discussions are welcome.
Thanks for your interests in our work.

@shun-zheng
Copy link
Owner

@xiaocuigit Founder is about extracting inter-entity relations from richly formatted documents, while Doc2EDAG focuses on extracting various event records (each with multiple entities) from a text document.
At present, Doc2EDAG does not support richly formatted documents.
But we think that is a meaningful direction to explore.

@pjfeng
Copy link
Author

pjfeng commented Dec 18, 2019

@dolphin-zs Thank you. I have talked to my Professor who is on quantitative trading strategies using NLP. He is very interested in your research. Do you have time to talk about NLP in the finance field?

@YuanEric88
Copy link

@dolphin-zs Could you show more details on how to use DS-based method to generate labeled data? I am currently working on event extraction for news data, but I am stuck in the lack of data source. I would like to implement method shown in the paper to generate the news domain dataset.

@KyrieIrving24
Copy link

@dolphin-zs Could you show more details on how to use DS-based method to generate labeled data? I am currently working on event extraction for news data, but I am stuck in the lack of data source. I would like to implement method shown in the paper to generate the news domain dataset.

同学请问你们的数据集做出来了吗,最近我也想用DS做一个

@BEILOP
Copy link

BEILOP commented Nov 17, 2021

@dolphin-zs你能展示更多关于如何使用基于 DS 的方法来生成标记数据的细节吗?我目前正在研究新闻数据的事件提取,但我陷入了缺乏数据源的困境。我想实现论文中显示的方法来生成新闻域数据集。

您现在效果如何了,我也在做新闻领域事件抽取,希望可以交流

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants