-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create PathProvider
that uses numtracker
#991
Comments
@DiamondJoseph Where are we going to get visit information from? Does it still have to be hard-coded for the time being? |
@callumforrester I believe we thought that visit should be passed as part of the request to blueapi- or an extra param when logging in that is cached and passed with every request? Since could collect data to commisioning visit then switch to live visit after beamline configured? |
In which case we start here? DiamondLightSource/blueapi#552 |
I believe that ticket is passing the metadata into the documents, not how it is being passed into the run/pod. (e.g. instrument could be a configuration value that gets mounted as an env var: it doesn't need to be passed into each run command, scan number is gotten from numtracker, and shouldn't (?) be overridden). I think the ticket you're looking for is #452 |
Although #452 probably belongs in blueapi |
To get this straight... The current problem is that we have a messy decorator on our plans that triggers the global The current idea is to write a sequenceDiagram
Client->>RunEngine: run plan
RunEngine->>Scan Number Source: next number?
Scan Number Source->>RunEngine: <number>
RunEngine->>Detector: prepare
Detector->>PathProvider: path?
PathProvider->>RunEngine: number?
RunEngine->>PathProvider: <number>
PathProvider->>Detector: <path(<number>)>
RunEngine->>Client: run start(<number>)
For added context we are thinking that the scan number source will be https://github.com/DiamondLightSource/numtracker We would create and inject this provider in blueapi (where the Take a simple use case with a single plan, running a single scan (run) using a single detector. This would call the scan number source once, get a new number and produce an HDF5 file with that number, and a def very_simple_plan(det: StandardDetector) -> MsgGenerator[None]:
yield from count(det, num=5)
# Output files
# |-> ixx-1234-det.h5
# |-- ixx-1234.nxs Downstream we would process documents associated with each Bluesky supports multiple runs per plan, for example: def simple_plan(det: StandardDetector) -> MsgGenerator[None]:
yield from count(det, num=5)
yield from count(det, num=10)
# Output files
# |-> ixx-1235-det.h5
# |-- ixx-1235.nxs
#
# |-> ixx-1236-det.h5
# |-- ixx-1236.nxs Each The other use case we need to support is multiple runs linking to the same file, primarily for detector performance reasons. For example def plan(det: StandardDetector) -> MsgGenerator[None]:
yield from stage(det)
# Actually causes the HDF5 AD plugin to open the file for writing
yield from prepare(det, ...)
for num_frames in [5, 8, 2]:
yield from bps.open_run()
for i in num_frames:
yield from bps.one_shot(det)
yield from bps.close_run()
yield from unstage(det)
# Output files
# |-> ixx-1237-det.h5
# |-- ixx-1237.nxs
# |-- ixx-1238.nxs
# |-- ixx-1239.nxs This leaves us with default behaviour that is largely customizable.
Tagging @DominicOram and @olliesilvester for thoughts about how well this would support MX use cases. |
This works for some of our usecases but we will also need the ability for one run to produce one hdf file but multiple nexus files. This is because we have one hardware fly scan that we want to split into two nexus files. We do this currently by doing something like: def plan(det: StandardDetector) -> MsgGenerator[None]:
yield from stage(det)
# Actually causes the HDF5 AD plugin to open the file for writing
yield from prepare(det, ...)
yield from bps.open_run(md={"nxs_indexes": [0, 100]})
yield from bps.kickoff(det)
yield from bps.kickoff(complete)
yield from bps.close_run()
yield from unstage(det)
# Output files
# |-> sample_name_1_000001.h5
# |-> sample_name_1_000002.h5
# |-- sample_name_1.nxs
# |-- sample_name_2.nxs I think this is covered by There is also the added complication, which you haven't covered above that the Eiger will actually spit out multiple h5 files as it will only put max 1000 frames in each. I don't think this makes a difference to your assumptions but worth mentioning. |
Good point, yes we can effectively decouple the hdf5 files from the nexus files using your suggested approach, so I think everything still works. |
Slight update to the diagram that makes use of numtracker concrete: sequenceDiagram
Client->>RunEngine: run plan
RunEngine->>Numtracker: next number?
Numtracker->>RunEngine: <number>
RunEngine->>Detector: prepare
Detector->>PathProvider: path?
PathProvider->>RunEngine: number?
RunEngine->>PathProvider: <number>
PathProvider->>Numtracker: file path(<number>, <detector>)?
Numtracker->>PathProvider: <file path>
PathProvider->>Detector: <file path>
RunEngine->>Client: run start(<number>)
@tpoliaw, sanity check? :) |
If I understand the flowchart correctly, I don't think this will work with the current service API. You have two calls to numtracker ( With the current setup, I think the plan needs to know up front the detectors that will be used in the collection when it requests scan information. |
So previously that wasn't possible because the plan itself didn't know at that point. Following bluesky/ophyd-async#314 we may be able to access this information from the plan at the right time (although I'm not sure if we can access it from the |
Just checking that all 3 use cases from bluesky/bluesky#1849 will be covered by this... |
@coretl they are as far as we can see |
After discussion with @tpoliaw and @DiamondJoseph we think that a single call to numtracker per run is worth investigating. We would need to know the detectors (that might be) involved upfront, include them in our query, and cache the result in a sequenceDiagram
Client->>RunEngine: run plan(visit)
RunEngine->>Detector1: stage
RunEngine->>Detector2: stage
RunEngine->>RunEngine: on_run_start
RunEngine->>PathProvider: scan_id_source(staged_devices, visit)
PathProvider->>Numtracker: new_scan(staged_devices, visit)
Numtracker->>PathProvider: {"scanNumber": <id>, "scanFile": /data/<beamline>-<id>, {"detector1": <path>/d1, "detector2": <path>/d2}}
PathProvider->>RunEngine: <id>, /data/<beamline>-<id>
RunEngine->>NexusWriter: run_start(<id>, /data/<beamline>-<id>)
RunEngine->>NexusWriter: descriptor(detector1, detector2)
RunEngine->>Detector1: prepare
Detector1->>PathProvider: get_path(detector1)
PathProvider->>Detector1: <path>/d1
RunEngine->>Detector2: prepare
Detector2->>PathProvider: get_path(detector2)
PathProvider->>Detector2: <path>/d2
RunEngine->>NexusWriter: stream_resource(detector1)
RunEngine->>NexusWriter: stream_resource(detector2)
RunEngine->>Detector1: kickoff
RunEngine->>Detector2: kickoff
RunEngine->>NexusWriter: stream_datum(detector1)
NexusWriter->>NexusWriter: create /data/<beamline>-<id>.nxs
RunEngine->>NexusWriter: stream_datum(detector2)
RunEngine->>NexusWriter: stream_datum(detector1)
RunEngine->>NexusWriter: stream_datum(detector2)
RunEngine->>Detector1: complete
RunEngine->>Detector2: complete
RunEngine->>NexusWriter: run_stop(<id>)
RunEngine->>Client: Done
NexusWriter->>NexusWriter: close /data/<beamline>-<id>.nxs
The upside of this is that numtracker is guaranteed to give you a unique scan ID for every call you make to it. It is impossible to query numtracker about a preexisting scan and accidentally overwrite your data. The downside is that it is impossible to query numtracker about a preexisting scan, but we couldn't actually think of a use case for that, if anyone can, please post it here. This also means we have to do some caching and bookkeeping in our There are a few things the diagram glosses over:
|
Steps to accomplish this:
<somehow pass the staged_devices to the either the scan_id_source method, or to the NumTrackerPathProvider>
|
Following the plan above I have made a mockup of how we might expect to integrate all this, including comments explaining changes needed to dependencies and potential issues: https://gist.github.com/callumforrester/5e107e40034e198aa5d99beadc15c57a It includes tests for all of @coretl's use cases Output when I run it:
|
Not saying I'm necessarily disagreeing with this but we are painting ourselves into a corner. If I expose a bluesky plan for changing energy why must I provide a visit to call this? I can't see how it will actually be used? Can we make it optional? Apologies, I should have added this earlier but we currently have the behavior of:
Both of these are only the current behavior, we can push back on to the scientists on it but it will cost us political good will. |
Do you have an example of how these are combined to build the path? How are your files currently named? |
For Hyperion we have:
In GDA there is a field where a user can specify something like I'm not defending the folder structure as it stands, it's a mess. But if we change it we break a hell of a lot of downstream processing and make it harder for users to switch from a GDA experiment to a bluesky experiment seamlessly. This is why we haven't changed it so far but we may try and sanitize it soon now we're mostly bluesky on i03. |
The way the current path templating is set up, the file for a detector is a function of If including sample information is needed in the filename (as well as just in the visit subdirectory) we could look at having arbitrary keys in the template. Would need to look into where the values are sourced from. As far as I know, bluesky doesn't have an equivalent to |
Yes, it's a case we can either support or support you writing your own As far as painting outselves into a corner goes: We can always make the visit an optional parameter. Then if someone doesn't pass it and just wants to move a motor, all good, if they don't pass it but they need to record data, blueapi will return an error. |
Yes, it's just to avoid duplicates.
Yes, added by odin. Odin only saves 1000 frames per file so will create a number of
Currently we don't have experiments with multiple detectors at the same time but we will soon and it would aid in debugging to be able to name them separately.
They are linked in ispyb.
Yes, this would be good.
Can we source it out of event documents? Most of it will be in the run metadata but I can see usecases for having it in events e.g. I read a value from hardware and I want to base my filename on that.
Great, we'd be happy to write our own
Yes please, thanks! |
I think that requiring someone to be on a current visit in order to move motors is a fair restriction. In what scenario should someone be able to control a beamline they're not associated with? I know epics still provides access to anyone on the beamline network but if we're taking that approach there shouldn't be any authz in bluesky/api. |
With the future assumption that logging into blueapi will prompt you with a number of visits to choose from and then your visit will be cached alongside your access token and passed into every request |
As discussed in person. I'm happy for specifying the visit with every call as long as GDA does this for me i.e. I can call |
We'll have to re-generate the Java Blueapi client in GDA when we make a change to the API, when that gets done we'll be sure to propagate the visit from GDA. |
Here is a new, slightly different version of the gist based on various discussions: https://gist.github.com/callumforrester/5e107e40034e198aa5d99beadc15c57a?permalink_comment_id=5429079#gistcomment-5429079 This is not a definite direction but an RFC after talking to @noemifrisina, @DominicOram, @abbiemery and @tpoliaw PhilosophyI think there are enough use cases for different file writing conventions that we may not want to centralise the templating logic in numtracker. This version of the gist keeps it in a Python plan decorator, acknowledging that every plan is special/custom since we don't have one way to do things. In this scenario numtracker provides a scan number and a visit directory (referred to in the gist as ImplementationDecorating your plan effectively customises the way it tells its detectors to write files. The decorator has a hook that allows customising the paths per detector. The following is passed into the hook:
Every scan still has the following metadata, sourced from numtracker:
This version of the gist also implements @DominicOram's suggestion of making the visit an optional parameter, though that may need more discussion too. EvaluationThe simultaneous pro and con of this approach is that it provides a big toolbox with many components that can be assembled in many ways, which means it is flexible enough to cover all use cases we have seen so far, but also easy to mess up. We have this problem currently with |
I'm hugely opposed to this implementation. Allowing the ability to override where the data goes leaves us open for a litany of data security issues, and requiring that a plan is decorated leaves us open to significant opportunities for things to go wrong.
I think we Must enforce a PathProvider and FilenameProvider implementation, with some configuration (as currently expressed in the NumTracker Templating) to allow for common cases (e.g. detectors in subdirs). There is an oncoming train of data integrity that having to adjust to differently named files is the price of not risking much worse consequences. |
Does this rule out being able to name files after data read from the beamline e.g.
I can't speak for all detectors but the version of odin we use will stop you from doing this.
As I said above this will have political consequences. It will be viewed by our users as something that GDA can do and Bluesky cannot. If we're going to make this decision then we will need to canvas some representative scientists from all teams as to whether it's acceptable. |
I'm OK with NexGen having a different scheme for naming NeXus files, as I trust you to ensure that they are non-overlapping, but I do not think the decision on what h5 files are named should be accessible to users. I'd like to get input from LIMS and CyberSecurity teams first: find out what is feasible first. |
@DiamondJoseph users absolutely have freedom to name their files as they see fit: this is not a matter for cyber security For unattended data collection we have transformations from what they request via ISPyB but that is still essentially "free" Preventing data overwriting is an implementation problem, which is already done by GDA with "run numbers" |
@graeme-winter It's the run numbers that we are trying to enforce use of here: if I can define my own PathProvider/FilenameProvider, then I can name my file however I want: I can name every single file ixx-0001.h5 if I choose to. |
@graeme-winter If GDA prevents overwriting with run numbers how do users have the freedom to name their files as they see fit? Isn't GDA enforcing a unique run number removing some of those freedoms by definition? The second approach I prototyped really does give the user the power to name their files whatever they want, including overriding the run number, because it's just a Python hook, which is why I'm collecting opinions. The middle ground here, proposed by @tpoliaw is to allow multiple naming schemes but make numtracker responsible for all of them, ensuring unique file naming but raising the barrier to entry if you want it to use your custom schemes. Important note: I am taking no side here! I am collecting requirements/opinions. They are all valued so thank you. |
User defines folder and prefix, GDA appends run numbers to ensure uniqueness To be clear, data also constrained to be within the visit We are also not allowing the files to be called f̸̪͚́̋̕ó̵̤̭̅ö̸̧́̽̚b̸̠̯̀͊̕ͅă̴̜̭̭̚r̵̲͚̎_̵̖́1̶̛͙̄͋.̶̺̈́̈́n̸͚̰̠̾̑͘x̸̬̤͈́̚s̷͚̲̯̀ But equally we are not insisting that the files are called 2025-02-10-15-03-01-scan1-eiger.nxs |
The proposed templating requirements for a h5 file are: DetectorTemplate { And for the NeXus file: ScanTemplate { |
@graeme-winter makes sense, I think you're headed in the direction of @tpoliaw's middle ground then: Users submit their own file names, numtracker's job is to append a scan number to ensure they are unique. Odin refusing to overwrite file names is also a good data protection measure, don't know if AD does the same, or can be made to. This is definitely an option: it gives us parity with GDA, customisation (which always makes users happy) and keeps some degree of security. @DiamondJoseph's point of view also deserves a fair hearing though, if cybersecurity want to tell us that users shouldn't be allowed to name files what they like then that's their call. Again, this is just me trying to push decisions to people who are actually paid to make decisions. |
I finally got around to reading all this - I'm in favour of this way |
Adding some thoughts as requested by @DiamondJoseph. I'm coming into this with limited understanding, so only have some high level thoughts. IIUC this isn't a proposal to allow users to name data files whatever they wish; there will be guardrails including restricting data writing within the visit directory, and using numtracker. All of this sounds reasonable. What I'm not sure about is Joseph's point about:
and whether it's addressed by the subsequent responses from Graeme and Callum, or other access restrictions in place? If not, I would share the concern that this is a rather large hole from a security perspective. The other argument for the use of a system-controlled standardised path/file naming scheme is ease of search and retrieval of data, though perhaps this is covered by metadata? |
@cheryntand thanks for your thoughts. To a degree, this will only be security through obscurity. There is nothing to stop you from running: yield from bps.abs(detector.hdf.file_path, "someone/elses/visit/directory") This is not a flaw in bluesky but a consequence of the detector control systems being too permissive. I don't know if that's an addressable issue (question for controls) but it's beyond the scope of this ticket. That is the only way to truly address @DiamondJoseph's security concern. What we can do is make I believe the discussion here is: If we can restrict users to a particular visit directory should be let them name their files what they like? Should we enforce a unique scan number in the files? Should we give them no control at all and name a file exactly according to some standard? Also, complete aside but I just thought of a great compromise (because like all good compromises it will annoy everyone): Have a very strict file naming convention and a convenient script that makes symlinks based on users' preferences. |
I have just seen this
In this scenario, what is determining the number to use? Is there something iterating through every file in the directory to find the next unused number for each scan? Otherwise, based on the comments here, I've added proof of concept changes to numtracker to allow custom placeholders and additional templates. Together I think these would allow for all the use cases here (other than the reinitialising the count in each subdirectory above) - any thoughts on the proposed changes are welcome. This would allow users to specify a custom name (via a |
Yes, Line 541 in 611af93
We may be able to live with not having this. |
From discussion with Callum today, how about the following: # dodal.common.beamlines.beamline_utils
# When running dodal in terminal, this is the default, can be overriden: develop without external infrastructure
PATH_PROVIDER = StaticPathProvider("/tmp/", filename_provider=UUIDFileNameProvider())
def set_path_provider():
def get_path_provider(): #dodal.beamlines.ixx
# For beamlines that want to name their own files
filename_templates = {"axis_sample": "{device_name}-{sample_name}-{dataCollectionId}"}
@filename_factory
def filename_provider_factory(md: dict[str, Any]) -> Callable[[str], str]:
# e.g. define N templates in your dodal module
filled_template = filename_templates[md["template_name"]].format_map({**md, "device_name": "{device_name}"})
return lambda device_name: filled_template.format(device_name)
@device_factory()
def det():
return Detector(path_provider=get_path_provider() # ixx-bluesky.plans
def my_plan(detectors: list[Readable], motor: Movable, sample_name: str):
@run_decorator(md={"template_name": "axis_sample", "sample_name": sample_name})
def inner_plan():
...
yield from inner_plan() # blueapi.core.context
def with_dodal_module(self, module: ModuleType, **kwargs) -> None:
# default FileNameProvider that uses Diamond standard naming convention `ixx-NNNN`
filename_factory = get_filename_factory(module) or DefaultFilenameProvider(config().filename_config)
# Safely using visit for all files written through blueapi but with customisable names
dodal.common.beamlines.beamline_utils.set_path_provider(VisitPathProvider(filename_factory)
devices = make_all_devices()
...
We override the scan_id_number hook on the run engine and provider it in every Run's metadata. We pre-process all plans as they are run and pass all of the metadata to the filename_factory when a new StartDocument is created, to get the names of the h5 files- if any metadata isn't present, throws an exception. This is opt in and defaults are provided that are guaranteed to be unique. PathProvider remains unable to be accidentally set, but can still be maliciously tampered with. # ixx-bluesky.plans
# Annotates the function with the template
@filename_template("{device_name}-{sample_name}-{dataCollectionId}")
def my_plan():
... # blueapi.worker.task_worker
def _submit_trackable_task(self, trackable_task: TrackableTask) -> None:
...
filename_provider.set_template(task.plan.__annotations__.get("filename_template", "{device_name}-{dataCollectionId}")
... This would let the template be defined per plan and alongside the plan, but the call to filename_provider.template(md: dict[str, Any]) isn't obvious. We still get being able to enforce a PathProvider. We still override the scan_id_number hook on the run engine and pre-process to provide a filename_provider its Run's metadata. |
How would this work for plans that need to create files with different templates? For instance the screening/centring/snapshot/data use case above. |
Aaaaaaah and now we need to be able to modify the PathProvider too :( I was hoping to just get to remove as much custom as possible and just have def my_plan(detector: StandardDetector, oav: StandardDetector):
@run_decorator(md={"template": "{'snapshots/' if device_name == 'oav' else ''}{device_name}-{scan_number}")
def inner():
...
yield from inner() |
We now have an instance of numtracker deployed at https://numtracker.diamond.ac.uk. We should write a
PathProvider
iterating onStaticVisitPathProvider
to make use of it.Acceptance Criteria
The text was updated successfully, but these errors were encountered: