Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keyerror for getattr(stage, classification) #14

Closed
hidde1977 opened this issue Aug 28, 2023 · 13 comments
Closed

Keyerror for getattr(stage, classification) #14

hidde1977 opened this issue Aug 28, 2023 · 13 comments

Comments

@hidde1977
Copy link

line data = getattr(stage, classification)() gives error in the pcs code it seems:

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "C:\Users%username%\AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\stage_scraper.py", line 298, in results
table = join_tables(table, table_parser.table, "rider_url")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users%username%\AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\utils.py", line 162, in join_tables
table.append({**table2_dict[row[join_key]], **row})
~~~~~~~~~~~^^^^^^^^^^^^^^^
KeyError: 'rider/laurens-de-plus'

@themm1
Copy link
Owner

themm1 commented Aug 28, 2023

It's fixed, however I had to remove riders that didn't finished stage from TTT results, because it's not possible to get additional info about them (age, nationality...) from the GC and almost all of their stage specific fields would be unknown (time, points, bonus...). I hope that it won't be a problem, since it's still possible to parse startlist/the stage before if you want and see which riders are missing after the TTT. The new version will be soon available on PyPI.

@themm1 themm1 closed this as completed Aug 28, 2023
@hidde1977
Copy link
Author

hidde1977 commented Aug 28, 2023 via email

@hidde1977
Copy link
Author

Mmm, I still get the Laurens de Plus(ki) KeyError:

Traceback (most recent call last):
File "c:\Users%%\OneDrive\coding\py\wielerpoule\import_results_from_pcs_vuelta_23.py", line 156, in
data = getattr(stage, classification)()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users%%AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\stage_scraper.py", line 298, in results
table = join_tables(table, table_parser.table, "rider_url")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users%%\AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\utils.py", line 162, in join_tables
table.append({**table2_dict[row[join_key]], **row})
~~~~~~~~~~~^^^^^^^^^^^^^^^
KeyError: 'rider/laurens-de-plus'

@hidde1977
Copy link
Author

or should I wait for an update in pip?

@themm1
Copy link
Owner

themm1 commented Aug 28, 2023

Well I don't know what is the exact URL where you get the error but on this URL: https://www.procyclingstats.com/race/vuelta-a-espana/2023/stage-1 everything works fine in latest PyPI version. (I published version 0.1.7 on PyPI today so make sure that you have upgraded)

@hidde1977
Copy link
Author

Thx.

I updated, now the error seems no more related to the pcs code. guess it has to do with S01 being a TTT; the code tries to get columns expected in "normal" stage results but now unavailable. The stage type (TTT, etc) is not in the dataframe of the stages, right? Any suggestion how to filter this out with an if/else?

@themm1
Copy link
Owner

themm1 commented Aug 28, 2023

It should return the same table as normal stage, but now I see that rider_number field is missing in TTT results. I will try to fix this soon. All other normal results fields are present in TTT results table.

@hidde1977
Copy link
Author

Thx. I am unsure if it is the TTT thing - riders are not in these results anyway, although I try to create columns rider_name and rider_number in my data. Anyway, I get this trackback, also when I skip scraping stage 1 (the part of the code can copied below):

(NB: my code generally genrerates CSV's, to be merged to an xlsx in the end, with all relevant results, being startlist, all stage results, and all classfications after each stage)

Traceback (most recent call last):
File "c:\Users%%\OneDrive\coding\py\wielerpoule\import_results_from_pcs_vuelta_23.py", line 156, in
data = getattr(stage, classification)()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users%%\AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\stage_scraper.py", line 315, in results
table_parser.parse(fields)
File "C:\Users%%\AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\table_parser.py", line 112, in parse
self._make_times_absolute()
File "C:\Users%%\AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\table_parser.py", line 398, in _make_times_absolute
row[time_field] = add_times(first_time, row['time'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users%%\AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\utils.py", line 107, in add_times
tdelta2 = time_to_timedelta(format_time(time2))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users%%\AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\utils.py", line 76, in time_to_timedelta
[hours, minutes, seconds] = [int(value) for value in time.split(":")]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users%%\AppData\Local\Programs\Python\Python311\Lib\site-packages\procyclingstats\utils.py", line 76, in
[hours, minutes, seconds] = [int(value) for value in time.split(":")]
^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '0-1'

code:

loop1 - loop to create CSV's for results and classifications of each stage

for stage_number in stages:
    race = f'race/{race_name}/{race_year}/overview' 
    stage_url = f'race/{race_name}/{race_year}/stage-{stage_number}'
    startlist_url = f'race/{race_name}/{race_year}/startlist'
    try:
        stage = Stage(stage_url)
    except ValueError:
        print(f"No data found for stage {stage_number}. Exiting the loop.")
        break
    try:
        startlist = RaceStartlist(startlist_url)
        try:
            startlist_data = startlist.startlist('rider_name', 'rider_url', 'team_name', 'team_url', 'nationality', 'rider_number')
            startlist_df = pd.DataFrame(startlist_data, columns=['Rider', 'URL', 'Team', 'Team URL', 'Nationality', 'BIB'])
            # print(f"Startlist data: {startlist_data}") # debug to print full list
            # input("Press enter to continue")
        except ValueError:
            print("ValueError encountered while retrieving startlist.")
            print("Loop 1 - no startlist available yet - creating empty stage results file.")
            startlist_data = []
            print(f"Startlist data: {startlist_data}")
        for classification in classifications:
            try:
                # Call the method associated with the classification
                data = getattr(stage, classification)()
                #print(data)
                #input("press enter")
                if data:
                    # Convert list of dictionaries to DataFrame
                    df = pd.DataFrame(data)
                    # Different classifications have different columns
                    if classification == 'results':
                        necessary_columns = ['rank', 'rider_number', 'rider_name', 'team_name', 'status']
                        new_columns = ['Rnk', 'BIB', 'Rider', 'Team', 'Status']
                    elif classification == 'teams':
                        necessary_columns = ['rank', 'team_name']  
                        new_columns = ['Rnk', 'Team']
                    else:  # 'gc', 'points', 'kom', 'youth'
                        necessary_columns = ['rank', 'rider_number', 'rider_name', 'team_name']
                        new_columns = ['Rnk', 'BIB', 'Rider', 'Team']
                    # Ensure the necessary columns exist before subsetting
                    if set(necessary_columns).issubset(df.columns):
                        print(
                            f"Subsetting DataFrame for {classification} to include only {necessary_columns}. Available columns are: {df.columns.tolist()}")
                        df = df[necessary_columns]
                        df.columns = new_columns
                    else:
                        print(
                            f"Expected columns {necessary_columns} not found in DataFrame for {classification}. Available columns are: {df.columns.tolist()}")
                    filename = f'{stage_number}.csv' if classification == 'results' else f'Stage_S{stage_number}_{classification}.csv'
                else:
                    # In case 'data' is empty
                    if classification != 'teams':
                        df = pd.DataFrame(columns=['Rnk', 'BIB', 'Rider', 'Team', 'Status'])
                    else:
                        df = pd.DataFrame(columns=['Rnk', 'Team'])
                    filename = f'Stage_S{stage_number}_{classification}.csv'
                # print(f'Final DataFrame for {classification}:\n{df}')
                print(f'Writing data to {filename}')
                df.to_csv(f'C:\\Users\\hidde.reitsma\\OneDrive\\Wielerpoultjes\\{race_year}\\{race_name}\\codefiles\\{filename}', index=False)
                now = datetime.datetime.now()
                print(f'{now} - file {filename} updated')
            except ExpectedParsingError:
                df = pd.DataFrame()
                print(f"No data for stage {stage_number} classification {classification}. Available columns are: {df.columns.tolist()}")
                if classification == 'results':
                    df = pd.DataFrame(columns=['Rnk', 'BIB', 'Rider', 'Team', 'Status'])
                    filename = f'{stage_number}.csv'
                elif classification == 'teams':
                    df = pd.DataFrame(columns=['Rnk', 'Team'])
                    filename = f'Stage_S{stage_number}_{classification}.csv'
                else:
                    df = pd.DataFrame(columns=['Rnk', 'BIB', 'Rider', 'Team'])
                    filename = f'Stage_S{stage_number}_{classification}.csv'
                df.to_csv(f'C:\\Users\\hidde.reitsma\\OneDrive\\Wielerpoultjes\\{race_year}\\{race_name}\\codefiles\\{filename}', index=False)
                print(f'File {filename} updated with empty dataframe due to parsing error')
    except AttributeError:
        print("AttributeError encountered while retrieving startlist.")
        traceback.print_exc()  # This will print details about the error
        print("Loop 1 - no startlist available yet - creating empty stage results file.")
        startlist_data = []
        print(f"Startlist data: {startlist_data}")

@hidde1977
Copy link
Author

One more thing: I added a debug print line, for data = getattr(stage, classification); print(f"Data for {classification}: {data}") gives:

Data for results: <bound method Stage.results of Stage(url='https://www.procyclingstats.com/race/vuelta-a-espana/2023/stage-1')>

so the wrong data for results seems to be scraped, right?

@themm1
Copy link
Owner

themm1 commented Aug 29, 2023

The mentioned error is caused because of this line on the page from where you are parsing results from: https://www.procyclingstats.com/race/vuelta-a-espana/2023/stage-2 I can not do much about it, because the problem is in the PCS results page. I think that they will fix the times soon.
image

@hidde1977
Copy link
Author

Ah! Yest, that's it

@hidde1977
Copy link
Author

@themm1
Copy link
Owner

themm1 commented Aug 29, 2023

Well I don't really understand what does the "-1:0-1:0" mean. But I think it's weird that they are listing the times from 9k in the results from finish even if that counts to the GC. It would made more sense to just use the real finish line times in the stage results. If the results won't change in a few days I will have to try to deal with that time notations however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants