Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve issues while running Automatic update of daily frequency data (from yahoo finance) for US region #1358

Merged
merged 10 commits into from
Dec 5, 2022

Conversation

HyeongminMoon
Copy link
Contributor

@HyeongminMoon HyeongminMoon commented Nov 15, 2022

Description

According to README.md, I tried automatic update with US region.

python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --trading_date <start date> --end_date <end date> --region us

Current implements are based on CN index, so I entered some issues:

  1. There is no method name for US index normalize, YahooNormalizeUS1dExtend
    (AttributeError: module 'collector' has no attribute 'YahooNormalizeUS1dExtend' while running update_data_to_bin #1196)
    -> resolved at commit b2a76df
  2. While downloading qlib 1d data, exists_qlib_data always return False
    -> Not resolved. I resolved this by removing "PRN" from the qlib_data/us_index/instruments/all.txt, but I have no idea how can I integrate this in code. Anyway this is not critial issue(no error, just takes ~2min time additionally)
  3. While normalizing, some symbols cause error in pd.read_csv because of their names
    -> resolved at commit a9cb66b
  4. While parsing index, it always tries cn_index/collector
    -> resolved at commit 0be6b99

I am totally ready for accepting your guidance if there is. Thanks.

Motivation and Context

Motivated by #1196, making automatic update available for US region.
We will be able to use automatic update(with crontab!) for US region after adopting this request.

How Has This Been Tested?

  • Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • Pass the test by running: python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --trading_date <start date> --end_date <end date> --region us.

Screenshots of Test Results (if appropriate):

Types of changes

  • Fix bugs
  • Add new feature
  • Update documentation

@you-n-g
Copy link
Collaborator

you-n-g commented Nov 27, 2022

Please merge master to fix the CI error. @HyeongminMoon

@SunsetWolf
Copy link
Collaborator

Regarding the second point of your description, my guess is that it is not possible to determine whether qlib_dir exists or not, which causes GetData to be executed every time update_to_bin is run, which doesn't affect anything, but takes some unnecessary time. Is my guess correct?

@HyeongminMoon
Copy link
Contributor Author

Regarding the second point of your description, my guess is that it is not possible to determine whether qlib_dir exists or not, which causes GetData to be executed every time update_to_bin is run, which doesn't affect anything, but takes some unnecessary time. Is my guess correct?

Correct.

@SunsetWolf
Copy link
Collaborator

Regarding the second point of your description, my guess is that it is not possible to determine whether qlib_dir exists or not, which causes GetData to be executed every time update_to_bin is run, which doesn't affect anything, but takes some unnecessary time. Is my guess correct?

Correct.

If that's the case, I think changing the not in this judgment statement might solve the problem.

@HyeongminMoon
Copy link
Contributor Author

Regarding the second point of your description, my guess is that it is not possible to determine whether qlib_dir exists or not, which causes GetData to be executed every time update_to_bin is run, which doesn't affect anything, but takes some unnecessary time. Is my guess correct?

Correct.

If that's the case, I think changing the not in this judgment statement might solve the problem.

No, I think this is not from exists_qlib_data, but from a bug during the download sessions. I found there is a symbol name 'PRN' in the qlib_data/us_index/instruments/all.txt while PRN.csv does not exist in downloaded source directory.

@SunsetWolf
Copy link
Collaborator

Regarding the second point of your description, my guess is that it is not possible to determine whether qlib_dir exists or not, which causes GetData to be executed every time update_to_bin is run, which doesn't affect anything, but takes some unnecessary time. Is my guess correct?

Correct.

If that's the case, I think changing the not in this judgment statement might solve the problem.

No, I think this is not from exists_qlib_data, but from a bug during the download sessions. I found there is a symbol name 'PRN' in the qlib_data/us_index/instruments/all.txt while PRN.csv does not exist in downloaded source directory.

Since windows cannot create some special name folders, "prn" is one of them, in order to ensure that users of various systems can also use qlib, we add "_qlib_" prefix to these special names, so the "prn" seen in instruments/all.txt, and features/_qlib_prn/ are matched , because in exists_qlib_data, there is no remove prefix for the names in features, so returns False, we provide a method to remove the prefix, I think this method can be used in the exists_qlib_data code, to solve the problem.

@HyeongminMoon
Copy link
Contributor Author

Regarding the second point of your description, my guess is that it is not possible to determine whether qlib_dir exists or not, which causes GetData to be executed every time update_to_bin is run, which doesn't affect anything, but takes some unnecessary time. Is my guess correct?

Correct.

If that's the case, I think changing the not in this judgment statement might solve the problem.

No, I think this is not from exists_qlib_data, but from a bug during the download sessions. I found there is a symbol name 'PRN' in the qlib_data/us_index/instruments/all.txt while PRN.csv does not exist in downloaded source directory.

Since windows cannot create some special name folders, "prn" is one of them, in order to ensure that users of various systems can also use qlib, we add "qlib" prefix to these special names, so the "prn" seen in instruments/all.txt, and features/_qlib_prn/ are matched , because in exists_qlib_data, there is no remove prefix for the names in features, so returns False, we provide a method to remove the prefix, I think this method can be used in the exists_qlib_data code, to solve the problem.

I'll try this too. Thanks.

@HyeongminMoon
Copy link
Contributor Author

Now all issues resolved, I checked the script once again too.

@@ -289,7 +289,18 @@ def __init__(

def _executor(self, file_path: Path):
file_path = Path(file_path)
df = pd.read_csv(file_path)

default_na = pd._libs.parsers.STR_NA_VALUES
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could add more context and comments here.
It is hard to understand the code without an explanation of motivation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review! I added explanation comments:)

@you-n-g
Copy link
Collaborator

you-n-g commented Dec 5, 2022

It looks perfect now.
Thanks!

@you-n-g you-n-g merged commit 9d8a8c6 into microsoft:main Dec 5, 2022
@you-n-g you-n-g added the bug Something isn't working label Dec 9, 2022
qianyun210603 pushed a commit to qianyun210603/qlib that referenced this pull request Mar 23, 2023
… (from yahoo finance) for US region (microsoft#1358)

* Update YahooNormalizeUS1dExtend(microsoft#1196)

* Prevent pandas read_csv errors while running update_data_to_bin for US region

* Fix parse_index error while running update_data_to_bin for US region

* prevent pandas.read_csv error on specific symbol names

* Reordering parameters for better rendering

* removes prefix during feature_dir existence checking

* add explanation comments
qianyun210603 pushed a commit to qianyun210603/qlib that referenced this pull request Mar 23, 2023
… (from yahoo finance) for US region (microsoft#1358)

* Update YahooNormalizeUS1dExtend(microsoft#1196)

* Prevent pandas read_csv errors while running update_data_to_bin for US region

* Fix parse_index error while running update_data_to_bin for US region

* prevent pandas.read_csv error on specific symbol names

* Reordering parameters for better rendering

* removes prefix during feature_dir existence checking

* add explanation comments
qianyun210603 pushed a commit to qianyun210603/qlib that referenced this pull request Mar 23, 2023
… (from yahoo finance) for US region (microsoft#1358)

* Update YahooNormalizeUS1dExtend(microsoft#1196)

* Prevent pandas read_csv errors while running update_data_to_bin for US region

* Fix parse_index error while running update_data_to_bin for US region

* prevent pandas.read_csv error on specific symbol names

* Reordering parameters for better rendering

* removes prefix during feature_dir existence checking

* add explanation comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants