Skip to content

Multiple Issues in Yahoo Data Collector Causing Failures #1981

@bigwhite37

Description

@bigwhite37

🐛 Bug Description

The Yahoo data collector (scripts/data_collector/yahoo/collector.py) has multiple critical issues that prevent successful data collection and normalization. These include date parsing errors, deprecated
pandas method warnings, type conversion failures, timezone handling problems, and performance issues.

To Reproduce

Steps to reproduce the behavior:

  1. Run the Yahoo data collector command:
  python scripts/data_collector/yahoo/collector.py -m 64 \
    update_data_to_bin \
    --qlib_data_1d_dir ~/.qlib/qlib_data/cn_data \
    --trading_date 2000-08-07 \
    --end_date 2025-08-13
  1. Observe multiple errors occurring during execution:
    • Date parsing ValueError at position 1183
    • FutureWarning about deprecated fillna method
    • TypeError for string/float division
    • AttributeError for tz_localize on Index objects
  2. Process fails or runs extremely slowly (2-4 files/second for 5000+ files)

Expected Behavior

The data collector should:

  • Successfully parse all date formats without errors

  • Run without pandas deprecation warnings

  • Handle data type conversions properly

  • Process files efficiently

  • Complete the full data collection and normalization pipeline

    Screenshot

    ValueError: unconverted data remains when parsing with format "%Y-%m-%d": " 00:00:00", at position 1183. You might want to try:

    • passing format if your strings have a consistent format;
    • passing format='ISO8601' if your strings are all ISO8601 but not necessarily in exactly the same format;
    • passing format='mixed', and the format will be inferred for each element individually.

    FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.

    TypeError: unsupported operand type(s) for /: 'str' and 'float'

    AttributeError: 'Index' object has no attribute 'tz_localize'

    Environment

  • Qlib version: main branch

  • Python version: 3.10

  • OS: macOS (Darwin 24.3.0)

  • Commit number: 1b42650

    Additional Notes

    Issues Identified:

  1. Date Parsing Error (base.py:308, collector.py:395)
    • pd.to_datetime() fails on mixed date formats ("YYYY-MM-DD" vs "YYYY-MM-DD HH:MM:SS")
    • Fix: Add format='mixed' parameter
  2. FutureWarning (collector.py:374, collector.py:462)
    • fillna(method="ffill") is deprecated
    • Fix: Replace with .ffill()
  3. Type Conversion Error (collector.py:506)
    • String data in numeric columns causes division errors
    • Fix: Add pd.to_numeric(df[_col], errors='coerce') before calculations
  4. Timezone Handling (collector.py:396)
    • Mixed timezone warning and AttributeError on Index objects
    • Fix: Add utc=True and proper DatetimeIndex conversion
  5. Performance Issues
    • ProcessPoolExecutor inefficient for I/O-bound tasks
    • Redundant CSV reads for column detection
    • Fix: Use ThreadPoolExecutor and optimize CSV reading
  6. Missing Recovery Option
    • No way to skip download after interruption
    • Fix: Add --skip_download parameter

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions