Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: implement AthenaDataset._is_valid #3424

Merged
merged 1 commit into from
Aug 29, 2021

Conversation

neutrinoceros
Copy link
Member

@neutrinoceros neutrinoceros commented Jul 9, 2021

PR Summary

This patch adresse (bug does not fix) #3001 (update: now it does !)
I ran into a random CI failure again because of this brittle validator, so I hope that this patch makes it even less likely and doesn't break anything.
Long term, this validator must be refined into a more characteristic heuristic. update: done

@neutrinoceros neutrinoceros changed the title fix: simplify AthenaDataset._is_valid method simplify AthenaDataset._is_valid method Jul 9, 2021
@neutrinoceros neutrinoceros added bug tests: running tests Issues with the test setup labels Jul 9, 2021
except Exception:
pass
return False
return "vtk" in os.path.basename(filename)
Copy link
Member

@Xarthisius Xarthisius Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 99% sure that it's not even in the filename, but it's supposed to be .vtk extension. So either:

filename.endswith(".vtk")

or

os.path.splitext(filename)[-1] == ".vtk"

or

pathlib.Path(filename).suffix == ".vtk"

(so many choices!) would be even more strict and appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I went your your first suggestion in my first iteration, then changed my mind in fear that it was not conservative enough, but I'm glad that you made me change my mind again, !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forced pushed !

@matthewturk
Copy link
Member

matthewturk commented Jul 9, 2021 via email

@neutrinoceros
Copy link
Member Author

Yup. Do you know someone in the community who use Athena and might be able to help ?

@Xarthisius
Copy link
Member

Xarthisius commented Jul 12, 2021

I think we can just follow the vtk's manual:

The legacy VTK file formats consist of five basic parts.

1. The first part is the file version and identifier. This part contains the single line: # vtk DataFile Version x.x.
This line must be exactly as shown with the exception of the version number x.x, which will vary with different
releases of VTK. (Note: the current version number is 3.0. Version 1.0 and 2.0 files are compatible with version 3.0
files.)

2. The second part is the header. The header consists of a character string terminated by end-of-line character \n. The
header is 256 characters maximum. The header can be used to describe the data and include any other pertinent
information.

3. The next part is the file format. The file format describes the type of file, either ASCII or binary. On this line the single word ASCII or BINARY must appear.

4. The fourth part is the dataset structure. The geometry part describes the geometry and topology of the dataset. This
part begins with a line containing the keyword DATASET followed by a keyword describing the type of dataset.
Then, depending upon the type of dataset, other keyword/data combinations define the actual data.

5. The final part describes the dataset attributes. This part begins with the keywords POINT_DATA or CELL_DATA, followed by an integer number specifying the number of points or cells, respectively. (It doesn’t matter whether
POINT_DATA or CELL_DATA comes first.) Other keyword/data combinations then define the actual dataset attribute
values (i.e., scalars, vectors, tensors, normals, texture coordinates, or field data).

I don't think there's anything special that Athena does, so there's always going to be an ambiguity. Looking at what frontend does, it expects 1. from above snippet and 2. in a form of string containing: time= ..., level= ..., domain= .... If we check for that we should be good.

So roughly:

  1. check if extension is ".vtk"
  2. read the first line and verify that it contains "vtk DataFile Version"
  3. read the 2nd line and verify that it contains "time", "level", and "domain".

@matthewturk
Copy link
Member

I think that would work for me.

@matthewturk
Copy link
Member

Although I'd recommend avoiding using readline in case the file turns out to be binary.

@neutrinoceros
Copy link
Member Author

neutrinoceros commented Jul 13, 2021

Cool, thanks a lot. I'll implement your suggestion.

@neutrinoceros neutrinoceros marked this pull request as draft July 13, 2021 15:00
@neutrinoceros neutrinoceros marked this pull request as ready for review July 13, 2021 17:34
@neutrinoceros
Copy link
Member Author

@Xarthisius I implemented a somewhat stricter version of your suggestion using regexes.

Although I'd recommend avoiding using readline in case the file turns out to be binary.

@matt thanks for the headsup. I ended up using the binary implementation of readline, but guarded against runaway memory leakage by setting an upper limit to the number of bytes being read (arbitrarily limited to 256, which I find is quite generous for known Athena datasets we have, so I hope it won't be a problem).

For reference, I tested my implementation by loading every Athena datasets we have available:

import yt
for name in ["MHDSloshing","ShockCloud", "MHDBlast", "RamPressureStripping", "AM06", "KeplerianDisk"]:
    print(yt.load_sample(name))

⚠️ this little script eats a couple GiB of hard drive, I could afford it because I already have every sample datasets stored locally :)

@neutrinoceros
Copy link
Member Author

Oh and now this fixes #3001

@neutrinoceros neutrinoceros linked an issue Jul 13, 2021 that may be closed by this pull request
@neutrinoceros neutrinoceros changed the title simplify AthenaDataset._is_valid method implement AthenaDataset._is_valid Jul 13, 2021
Copy link
Member

@matthewturk matthewturk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

if not re.match(b"# vtk DataFile Version \\d\\.\\d\n", fh.readline(256)):
return False
if not re.match(
b"(CONSERVED|PRIMITIVE) vars at time= .*, level= \\d, domain= \\d\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to dig into the code and relevant datasets more, but I recall that there are some Athena datasets which are just one variable only--so they may not have CONSERVED or PRIMITIVE in the header. It may take me a day or two to figure out if this is the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know ! it may be useful to have such a dataset in the sample collection. I wrote using only the ones we already have as validation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked, you remembered correctly:
https://princetonuniversity.github.io/Athena-Cversion/AthenaDocsUGbtk

From the source code (see line 129), it looks like in the case of single-variables, "CONSERVED vars" is replaced by "Really cool Athena data". I could reproduce this joke and hardcode it in yt too, but it seems safer to assume it could change any time in the future (no matter how unlikely), so I'm going to loosen the strictness of this regexp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me.

@jzuhone jzuhone merged commit 3df7165 into yt-project:main Aug 29, 2021
@neutrinoceros
Copy link
Member Author

Wait @jzuhone , I think there was a misunderstanding here. You merged before I could push the fix you required !
I'm going to open a followup hotfix.

@matthewturk
Copy link
Member

@meeseeksdev backport to yt-4.0.x

meeseeksmachine pushed a commit to meeseeksmachine/yt that referenced this pull request Oct 15, 2021
neutrinoceros added a commit that referenced this pull request Oct 16, 2021
…4-on-yt-4.0.x

Backport PR #3424 on branch yt-4.0.x (implement AthenaDataset._is_valid)
@neutrinoceros neutrinoceros changed the title implement AthenaDataset._is_valid BUG: implement AthenaDataset._is_valid Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug tests: running tests Issues with the test setup
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AthenaDataset._is_valid() is extremely brittle
4 participants