-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Description
🐛 Bug Description
The Yahoo data collector (scripts/data_collector/yahoo/collector.py) has multiple critical issues that prevent successful data collection and normalization. These include date parsing errors, deprecated
pandas method warnings, type conversion failures, timezone handling problems, and performance issues.
To Reproduce
Steps to reproduce the behavior:
- Run the Yahoo data collector command:
python scripts/data_collector/yahoo/collector.py -m 64 \
update_data_to_bin \
--qlib_data_1d_dir ~/.qlib/qlib_data/cn_data \
--trading_date 2000-08-07 \
--end_date 2025-08-13
- Observe multiple errors occurring during execution:
- Date parsing ValueError at position 1183
- FutureWarning about deprecated fillna method
- TypeError for string/float division
- AttributeError for tz_localize on Index objects
- Process fails or runs extremely slowly (2-4 files/second for 5000+ files)
Expected Behavior
The data collector should:
-
Successfully parse all date formats without errors
-
Run without pandas deprecation warnings
-
Handle data type conversions properly
-
Process files efficiently
-
Complete the full data collection and normalization pipeline
Screenshot
ValueError: unconverted data remains when parsing with format "%Y-%m-%d": " 00:00:00", at position 1183. You might want to try:
- passing
formatif your strings have a consistent format; - passing
format='ISO8601'if your strings are all ISO8601 but not necessarily in exactly the same format; - passing
format='mixed', and the format will be inferred for each element individually.
FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
TypeError: unsupported operand type(s) for /: 'str' and 'float'
AttributeError: 'Index' object has no attribute 'tz_localize'
Environment
- passing
-
Qlib version: main branch
-
Python version: 3.10
-
OS: macOS (Darwin 24.3.0)
-
Commit number: 1b42650
Additional Notes
Issues Identified:
- Date Parsing Error (base.py:308, collector.py:395)
- pd.to_datetime() fails on mixed date formats ("YYYY-MM-DD" vs "YYYY-MM-DD HH:MM:SS")
- Fix: Add format='mixed' parameter
- FutureWarning (collector.py:374, collector.py:462)
- fillna(method="ffill") is deprecated
- Fix: Replace with .ffill()
- Type Conversion Error (collector.py:506)
- String data in numeric columns causes division errors
- Fix: Add pd.to_numeric(df[_col], errors='coerce') before calculations
- Timezone Handling (collector.py:396)
- Mixed timezone warning and AttributeError on Index objects
- Fix: Add utc=True and proper DatetimeIndex conversion
- Performance Issues
- ProcessPoolExecutor inefficient for I/O-bound tasks
- Redundant CSV reads for column detection
- Fix: Use ThreadPoolExecutor and optimize CSV reading
- Missing Recovery Option
- No way to skip download after interruption
- Fix: Add --skip_download parameter