-
Notifications
You must be signed in to change notification settings - Fork 1
Utilise read_fwf() from pandas #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Quality of life addition for the end user. There is already a dictionary to map the Z value to symbol and the creation of the reverse is fairly simple so might as well create it.
The format of the file is given in the header of the file. It's wrong, but while finding that out I found that we can setup various lists and dictionaries to then use read_fwf and parse the file in one go. We still need to make a few adjustments once the parse is done, but they are fairly minor.
I misread the condition so had the default and 2020 cases the wrong way round.
Now we don't read the file line by line, the tests needed to be updated to account for this. I also learnt about the pandas.testing module so started to use that. There was in a bug in the AME mass parsing. I assumed the atomic mass always started with the same A as the isotope. This is not true so the parsing was updated to read the value from the file. The atomic mass error was also too large but is now scaled appropriately.
The format is different to the other years so needed a new set of START and END markers as well as a few edge catch alterations to the dataframe after the initial parse.
We can't parse all of these yet, but adding to the repo ready for when we can.
Also needed to tweak the column start and end points for a few of the columns in the mass table. The reaction files appear to be the same as later years.
Two things in one here, the name of the test function was wrong as we no longer do a per line read and for maintainability split into per year functions.
If there is anything different about a years column positions we now give it it's own case branch so no need for the nesting.
There is a lot going on with the line format and the format within a column so for the moment, it's quite a rough parse and we drop a few columns so there is scope for future improvements.
The class is now purely for storage so Parse is no longer a good name as it is both overly generic and incorrect.
We index and on the year and use it during merging so definitely need it at this stage.
Not sure how this passed the tests when we were focusing on the parsing of this file. I must have broken it later on but fix was fairly simple so I have not investigated further.
This is mainly adding in the new years and splitting the AME and NUBASE parsing as they no longer have matching years. This also meant we have to merge in a slightly different way and ensure we don't lose any data unique to one set. I have also removed the validation of the year. We currently have full control over the years passed to the functions so should not get any errors. The fact that AME and NUBASE now have different years means we would have also had to add additional functionality so the decision was made to delete. We can add back in if required.
Added references to the new years of data that we can now parse and fleshed out a bit more to demonstrate some basic usage.
ubuntu-latest point to 24.04, will look into if a ppa make 3.14 available, but of the moment, happy to just remove testing against this version of python.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Continuing the work done in #1 we now make use of pandas inbuilt
read_fwf()to read the data as it is in fixed width format.Turns out the format is not necessarily consistent in the files so we can't use the
widthsparameter and have to stick with setting up our own start and end columns, but it does remove the need to use the original method of reading line by line, slicing then converting to the correct data type.We still need to do some clean up after the initial read, but that's relatively simple so I think it's an improvement.
Data has been added from 1983, 1993 and 1995/1997 (see README updates).