Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create PathProvider that uses numtracker #991

Open
callumforrester opened this issue Jan 15, 2025 · 44 comments · May be fixed by #1085
Open

Create PathProvider that uses numtracker #991

callumforrester opened this issue Jan 15, 2025 · 44 comments · May be fixed by #1085
Labels
enhancement New feature or request

Comments

@callumforrester
Copy link
Contributor

callumforrester commented Jan 15, 2025

We now have an instance of numtracker deployed at https://numtracker.diamond.ac.uk. We should write a PathProvider iterating on StaticVisitPathProvider to make use of it.

Acceptance Criteria

  • TBD
@callumforrester callumforrester added the enhancement New feature or request label Jan 15, 2025
@callumforrester
Copy link
Contributor Author

@DiamondJoseph Where are we going to get visit information from? Does it still have to be hard-coded for the time being?

@DiamondJoseph
Copy link
Contributor

@callumforrester I believe we thought that visit should be passed as part of the request to blueapi- or an extra param when logging in that is cached and passed with every request? Since could collect data to commisioning visit then switch to live visit after beamline configured?

@callumforrester
Copy link
Contributor Author

In which case we start here? DiamondLightSource/blueapi#552

@DiamondJoseph
Copy link
Contributor

I believe that ticket is passing the metadata into the documents, not how it is being passed into the run/pod. (e.g. instrument could be a configuration value that gets mounted as an env var: it doesn't need to be passed into each run command, scan number is gotten from numtracker, and shouldn't (?) be overridden).

I think the ticket you're looking for is #452

@DiamondJoseph
Copy link
Contributor

Although #452 probably belongs in blueapi

@callumforrester
Copy link
Contributor Author

To get this straight...

The current problem is that we have a messy decorator on our plans that triggers the global PathProvider defined in dodal to update itself. This breaks if you are not using ophyd-async devices (see DiamondLightSource/blueapi#784) and is less portable.

The current idea is to write a PathProvider that hooks into the RunEngine's start hook meaning that whenever a run starts, the scan counter increments. See below:

sequenceDiagram
    Client->>RunEngine: run plan
    RunEngine->>Scan Number Source: next number?
    Scan Number Source->>RunEngine: <number>
    RunEngine->>Detector: prepare
    Detector->>PathProvider: path?
    PathProvider->>RunEngine: number?
    RunEngine->>PathProvider: <number>
    PathProvider->>Detector: <path(<number>)>
    RunEngine->>Client: run start(<number>)
Loading

For added context we are thinking that the scan number source will be https://github.com/DiamondLightSource/numtracker

We would create and inject this provider in blueapi (where the RunEngine is), dodal modules would default to a static path provider that writes to /tmp, for example, and blueapi overrides it.

Take a simple use case with a single plan, running a single scan (run) using a single detector. This would call the scan number source once, get a new number and produce an HDF5 file with that number, and a RunStart document with that same number.

def very_simple_plan(det: StandardDetector) -> MsgGenerator[None]:
    yield from count(det, num=5)

# Output files
# |-> ixx-1234-det.h5
# |-- ixx-1234.nxs

Downstream we would process documents associated with each RunStart into a single NeXus file so this scenario would still result in a single NeXus file with the same scan number (1234).

Bluesky supports multiple runs per plan, for example:

def simple_plan(det: StandardDetector) -> MsgGenerator[None]:
    yield from count(det, num=5)
    yield from count(det, num=10)

# Output files
# |-> ixx-1235-det.h5
# |-- ixx-1235.nxs
#
# |-> ixx-1236-det.h5
# |-- ixx-1236.nxs

Each count plan separately prepares and unstages the detectors, meaning that a new file and scan number is produced for each run.

The other use case we need to support is multiple runs linking to the same file, primarily for detector performance reasons. For example

def plan(det: StandardDetector) -> MsgGenerator[None]:
    yield from stage(det)

    # Actually causes the HDF5 AD plugin to open the file for writing
    yield from prepare(det, ...)
    for num_frames in [5, 8, 2]:
        yield from bps.open_run()
        for i in num_frames:
            yield from bps.one_shot(det)
        yield from bps.close_run()
    yield from unstage(det) 

# Output files
# |-> ixx-1237-det.h5
# |-- ixx-1237.nxs
# |-- ixx-1238.nxs
# |-- ixx-1239.nxs

This leaves us with default behaviour that is largely customizable.

  • Every run has a unique ID in accordance with DLS scan numbers if using blueapi.
  • Every run results in a NeXus file, can change by configuring the downstream NeXus writer service or writing your own.
  • The HDF5 file name is derived from only one of the run numbers when multiple runs reference the same file. Can be changed by writing your own PathProvider that names things accordingly.
  • For developing/debugging devices directly in a Python shell, dodal has a default global PathProvider that writes to a sensible default location (e.g. /tmp). We would need to document this clearly. Can be changed by setting the global singleton prior to instantiating devices.

Tagging @DominicOram and @olliesilvester for thoughts about how well this would support MX use cases.

@DominicOram
Copy link
Contributor

This works for some of our usecases but we will also need the ability for one run to produce one hdf file but multiple nexus files. This is because we have one hardware fly scan that we want to split into two nexus files. We do this currently by doing something like:

def plan(det: StandardDetector) -> MsgGenerator[None]:
    yield from stage(det)

    # Actually causes the HDF5 AD plugin to open the file for writing
    yield from prepare(det, ...)
    yield from bps.open_run(md={"nxs_indexes": [0, 100]})
    yield from bps.kickoff(det)
    yield from bps.kickoff(complete)
    yield from bps.close_run()
    yield from unstage(det) 

# Output files
# |-> sample_name_1_000001.h5
# |-> sample_name_1_000002.h5
# |-- sample_name_1.nxs
# |-- sample_name_2.nxs

I think this is covered by Every run results in a NeXus file, can change by configuring the downstream NeXus writer service or writing your own.

There is also the added complication, which you haven't covered above that the Eiger will actually spit out multiple h5 files as it will only put max 1000 frames in each. I don't think this makes a difference to your assumptions but worth mentioning.

@callumforrester
Copy link
Contributor Author

Good point, yes we can effectively decouple the hdf5 files from the nexus files using your suggested approach, so I think everything still works.

@callumforrester
Copy link
Contributor Author

Slight update to the diagram that makes use of numtracker concrete:

sequenceDiagram
    Client->>RunEngine: run plan
    RunEngine->>Numtracker: next number?
    Numtracker->>RunEngine: <number>
    RunEngine->>Detector: prepare
    Detector->>PathProvider: path?
    PathProvider->>RunEngine: number?
    RunEngine->>PathProvider: <number>
    PathProvider->>Numtracker: file path(<number>, <detector>)?
    Numtracker->>PathProvider: <file path>
    PathProvider->>Detector: <file path>
    RunEngine->>Client: run start(<number>)
Loading

@tpoliaw, sanity check? :)

@tpoliaw
Copy link
Contributor

tpoliaw commented Jan 24, 2025

If I understand the flowchart correctly, I don't think this will work with the current service API.

You have two calls to numtracker (next number? and file path(<number>, <detector>)?) where it currently only expects a single call per scan (for GDA's definition of 'scan' - per run?).

With the current setup, I think the plan needs to know up front the detectors that will be used in the collection when it requests scan information.

@callumforrester
Copy link
Contributor Author

With the current setup, I think the plan needs to know up front the detectors that will be used in the collection when it requests scan information.

So previously that wasn't possible because the plan itself didn't know at that point. Following bluesky/ophyd-async#314 we may be able to access this information from the plan at the right time (although I'm not sure if we can access it from the RunEngine. I'm uncertain if that's what we want or if we are specifically trying to keep the decoupling between the plan and the detectors.

@coretl
Copy link
Collaborator

coretl commented Feb 3, 2025

Just checking that all 3 use cases from bluesky/bluesky#1849 will be covered by this...

@callumforrester
Copy link
Contributor Author

@coretl they are as far as we can see

@callumforrester
Copy link
Contributor Author

callumforrester commented Feb 4, 2025

After discussion with @tpoliaw and @DiamondJoseph we think that a single call to numtracker per run is worth investigating. We would need to know the detectors (that might be) involved upfront, include them in our query, and cache the result in a PathProvider. I tried to draw a sequence diagram that describes all of this, previously I have abridged them because they quickly get huge but that keeps causing confusion, so this is everything that I think is going on:

sequenceDiagram
    Client->>RunEngine: run plan(visit)
    RunEngine->>Detector1: stage
    RunEngine->>Detector2: stage
    RunEngine->>RunEngine: on_run_start
    RunEngine->>PathProvider: scan_id_source(staged_devices, visit)
    PathProvider->>Numtracker: new_scan(staged_devices, visit)
    Numtracker->>PathProvider: {"scanNumber": <id>, "scanFile": /data/<beamline>-<id>, {"detector1": <path>/d1, "detector2": <path>/d2}}
    PathProvider->>RunEngine: <id>, /data/<beamline>-<id>
    RunEngine->>NexusWriter: run_start(<id>, /data/<beamline>-<id>)
    RunEngine->>NexusWriter: descriptor(detector1, detector2)
    RunEngine->>Detector1: prepare
    Detector1->>PathProvider: get_path(detector1)
    PathProvider->>Detector1: <path>/d1
    RunEngine->>Detector2: prepare
    Detector2->>PathProvider: get_path(detector2)
    PathProvider->>Detector2: <path>/d2
    RunEngine->>NexusWriter: stream_resource(detector1)
    RunEngine->>NexusWriter: stream_resource(detector2)
    RunEngine->>Detector1: kickoff
    RunEngine->>Detector2: kickoff
    RunEngine->>NexusWriter: stream_datum(detector1)
    NexusWriter->>NexusWriter: create /data/<beamline>-<id>.nxs
    RunEngine->>NexusWriter: stream_datum(detector2)
    RunEngine->>NexusWriter: stream_datum(detector1)
    RunEngine->>NexusWriter: stream_datum(detector2)
    RunEngine->>Detector1: complete
    RunEngine->>Detector2: complete
    RunEngine->>NexusWriter: run_stop(<id>)
    RunEngine->>Client: Done
    NexusWriter->>NexusWriter: close /data/<beamline>-<id>.nxs
Loading

The upside of this is that numtracker is guaranteed to give you a unique scan ID for every call you make to it. It is impossible to query numtracker about a preexisting scan and accidentally overwrite your data. The downside is that it is impossible to query numtracker about a preexisting scan, but we couldn't actually think of a use case for that, if anyone can, please post it here. This also means we have to do some caching and bookkeeping in our PathProvider rather than defer that functionality to numtracker. On the other hand making a single call to numtracker per run is less fragile than per detector.

There are a few things the diagram glosses over:

  1. There is no way of getting the list of staged devices to the PathProvider, we would have to implement one, probably we should just include it in RE.md, which is passed to scan_id_source
  2. The list would include devices other than detectors, which we would also pass to numtracker and then ignore
  3. We would need to mutate the metadata dictionary passed into scan_id_source to include the file path, so it can be injected into the start document. Not a major showstopper but ugly.
  4. The PathProvider does its own bookeeping (maintaining a cache of the detector -> path mapping), as does the NexusWriter (appending .nxs to the scan file name). Neither of them have to do much bookeeping so this may be acceptable

@callumforrester
Copy link
Contributor Author

callumforrester commented Feb 5, 2025

Steps to accomplish this:

  • Add a property to the RunEngine to get access to staged devices
  • Make scan_id_source on the RunEngine an async function
  • Make Blueapi require passing the current visit as part of the request to run a plan, which is stored as part of a Task, and passed into the run_engine.md as part of do_task

<somehow pass the staged_devices to the either the scan_id_source method, or to the NumTrackerPathProvider>

  • Define a new NumTrackerPathProvider in blueapi, with an update function which has the signature of scan_id_source: when called this makes a call to the NumTracker service to get a new data collection id and paths for all of the staged devices in the visit that is part of the metadata, storing this as a mapping from device_name to Path that can be fetched from the PathProvider when called with the name of the device.

  • Add optional configuration to blueapi that when set will allow an instance of the NumTrackerPathProvider to be instantiated and to call dodal.set_path_provider prior to importing any dodal beamline modules.

  • Define a default PathProvider in dodal- this can use the existing StaticPathProvider, UUIDFileNameProvider in ophyd-async (names to be checked) to write data to /tmp/.h5 when running dodal from the command line (e.g. RE(count(i22.saxs()))

  • Remove the PathProviders that are constructed in dodal beamline modules

  • Joseph wrote this. 😎

@callumforrester
Copy link
Contributor Author

callumforrester commented Feb 5, 2025

Following the plan above I have made a mockup of how we might expect to integrate all this, including comments explaining changes needed to dependencies and potential issues: https://gist.github.com/callumforrester/5e107e40034e198aa5d99beadc15c57a

It includes tests for all of @coretl's use cases

Output when I run it:

================================
Running plan one_run_one_file
New run opened:
{'scan_file': '/data/cm-123456/0',
 'scan_id': 0,
 'uid': '03822b78-7955-42c3-a9e4-cf26faefec41'}
d1 --> PathInfo(directory_path=PosixPath('/data/cm-123456/d1-0'), filename='d1-0.h5', create_dir_depth=0)
d2 --> PathInfo(directory_path=PosixPath('/data/cm-123456/d2-0'), filename='d2-0.h5', create_dir_depth=0)
Plan complete: runs: ('03822b78-7955-42c3-a9e4-cf26faefec41',)

================================
Running plan different_runs_different_files
New run opened:
{'scan_file': '/data/cm-123456/1',
 'scan_id': 1,
 'uid': '7e82a10f-7ace-4465-a97e-6f55d40a1d49'}
d1 --> PathInfo(directory_path=PosixPath('/data/cm-123456/d1-1'), filename='d1-1.h5', create_dir_depth=0)
New run opened:
{'scan_file': '/data/cm-123456/2',
 'scan_id': 2,
 'uid': '59512a6e-dd32-464f-b63e-6f254436291b'}
d2 --> PathInfo(directory_path=PosixPath('/data/cm-123456/d2-2'), filename='d2-2.h5', create_dir_depth=0)
d3 --> PathInfo(directory_path=PosixPath('/data/cm-123456/d3-2'), filename='d3-2.h5', create_dir_depth=0)
New run opened:
{'scan_file': '/data/cm-123456/3',
 'scan_id': 3,
 'uid': '7eb9dbef-1137-4aca-980b-eaadd732d2ea'}
d1 --> PathInfo(directory_path=PosixPath('/data/cm-123456/d1-3'), filename='d1-3.h5', create_dir_depth=0)
d2 --> PathInfo(directory_path=PosixPath('/data/cm-123456/d2-3'), filename='d2-3.h5', create_dir_depth=0)
d3 --> PathInfo(directory_path=PosixPath('/data/cm-123456/d3-3'), filename='d3-3.h5', create_dir_depth=0)
Plan complete: runs: ('7e82a10f-7ace-4465-a97e-6f55d40a1d49', '59512a6e-dd32-464f-b63e-6f254436291b', '7eb9dbef-1137-4aca-980b-eaadd732d2ea')

================================
Running plan different_runs_one_file
New run opened:
{'scan_file': '/data/cm-123456/4',
 'scan_id': 4,
 'uid': 'cc54a548-a36e-497e-aad1-ac7ceb048f26'}
d1 --> PathInfo(directory_path=PosixPath('/data/cm-123456/d1-4'), filename='d1-4.h5', create_dir_depth=0)
d2 --> PathInfo(directory_path=PosixPath('/data/cm-123456/d2-4'), filename='d2-4.h5', create_dir_depth=0)
d3 --> PathInfo(directory_path=PosixPath('/data/cm-123456/d3-4'), filename='d3-4.h5', create_dir_depth=0)
New run opened:
{'scan_file': '/data/cm-123456/5',
 'scan_id': 5,
 'uid': '2236c21f-0068-44dc-9948-c92b7b16b4cd'}
New run opened:
{'scan_file': '/data/cm-123456/6',
 'scan_id': 6,
 'uid': '8c0ee2d9-48bf-49d8-a458-bfe23837c49b'}
Plan complete: runs: ('cc54a548-a36e-497e-aad1-ac7ceb048f26', '2236c21f-0068-44dc-9948-c92b7b16b4cd', '8c0ee2d9-48bf-49d8-a458-bfe23837c49b')

@DominicOram
Copy link
Contributor

Make Blueapi require passing the current visit as part of the request to run a plan, which is stored as part of a Task, and passed into the run_engine.md as part of do_task

Not saying I'm necessarily disagreeing with this but we are painting ourselves into a corner. If I expose a bluesky plan for changing energy why must I provide a visit to call this? I can't see how it will actually be used? Can we make it optional?

Apologies, I should have added this earlier but we currently have the behavior of:

  1. The location of where files are stored is a function of visit, sample name, experiment type and run number. The above seems to assume it's only a function of visit and run number. In fact, in the general GDA case users will provide a file pattern based on lots of different metadata. I assume we can sort this by implementing our own PathProvider?
  2. We currently actually don't have a global run number. Run number to us is just incremented per directory so we could store sample_1.hdf in the xraycentring folder then next collection will be sample_1.hdf in the collection folder. Is this use case covered?

Both of these are only the current behavior, we can push back on to the scientists on it but it will cost us political good will.

@tpoliaw
Copy link
Contributor

tpoliaw commented Feb 5, 2025

The location of where files are stored is a function of visit, sample name, experiment type and run number

Do you have an example of how these are combined to build the path? How are your files currently named?

@DominicOram
Copy link
Contributor

For Hyperion we have:

  • /dls/i03/data/2025/mx23694-127/xraycentring/auto/TestThaumatin/thau_1/thau_1_1_000001.h5. Which is the xray centring done for visit mx23694-127, protein named TestThaumatin and sample name thau_1.
  • /dls/i03/data/2025/mx23694-127/auto/TestThaumatin/thau_1/thau_1_1_000001.h5. Which is the actual collection done on this
  • /dls/i03/data/2025/mx23694-127/auto/screen/TestThaumatin/thau_1/thau_1_1_000001.h5. Which is a screening collection done on it
  • Maybe not relevant but we then also have snapshots for each of these, which are the optical camera images used for diagnostics. These are stored in a snapshots subfolder of all the above with the format 154645_oav_snapshot_270.png where 154645 is a timestamp and 270 is the omega angle the snapshot was taken at.

In GDA there is a field where a user can specify something like {sample_name}_bob_{protein_name}123 which will cause files to be written with the name thau_1_bob_TestThaumatin123_000001.h5

I'm not defending the folder structure as it stands, it's a mess. But if we change it we break a hell of a lot of downstream processing and make it harder for users to switch from a GDA experiment to a bluesky experiment seamlessly. This is why we haven't changed it so far but we may try and sanitize it soon now we're mostly bluesky on i03.

@tpoliaw
Copy link
Contributor

tpoliaw commented Feb 5, 2025

  • Is the second 1 in thau_1_1_000001 the scan number? So something like thau_1_23452_000001 would be possible with the same naming scheme?
  • Is the 000001 the suffix added by odin or is that something you're setting?
  • Does the template used vary between detectors? Do you ever have multiple detectors in the same scan/collection that need different templates?
  • Are the OAV images linked to specific scans at all other than by comparing the times they were collected?

The way the current path templating is set up, the file for a detector is a function of scan_number, subdirectory, detector and beamline where scan_number is required to ensure uniqueness of files between scans.

If including sample information is needed in the filename (as well as just in the visit subdirectory) we could look at having arbitrary keys in the template. Would need to look into where the values are sourced from. As far as I know, bluesky doesn't have an equivalent to LocalProperties, does it?

@callumforrester
Copy link
Contributor Author

Yes, it's a case we can either support or support you writing your own PathProvider for.

As far as painting outselves into a corner goes: We can always make the visit an optional parameter. Then if someone doesn't pass it and just wants to move a motor, all good, if they don't pass it but they need to record data, blueapi will return an error.

@DominicOram
Copy link
Contributor

Is the second 1 in thau_1_1_000001 the scan number? So something like thau_1_23452_000001 would be possible with the same naming scheme?

Yes, it's just to avoid duplicates. thau_1_23452_000001 is probably fine, though I suspect we will get some grumbling from the scientists.

Is the 000001 the suffix added by odin or is that something you're setting?

Yes, added by odin. Odin only saves 1000 frames per file so will create a number of 00000X files.

Does the template used vary between detectors? Do you ever have multiple detectors in the same scan/collection that need different templates?

Currently we don't have experiments with multiple detectors at the same time but we will soon and it would aid in debugging to be able to name them separately.

Are the OAV images linked to specific scans at all other than by comparing the times they were collected?

They are linked in ispyb.

If including sample information is needed in the filename (as well as just in the visit subdirectory) we could look at having arbitrary keys in the template.

Yes, this would be good.

Would need to look into where the values are sourced from.

Can we source it out of event documents? Most of it will be in the run metadata but I can see usecases for having it in events e.g. I read a value from hardware and I want to base my filename on that.

Yes, it's a case we can either support or support you writing your own PathProvider for.

Great, we'd be happy to write our own PathProvider. Just as long as we can hook it into the event stream.

We can always make the visit an optional parameter. Then if someone doesn't pass it and just wants to move a motor, all good, if they don't pass it but they need to record data, blueapi will return an error.

Yes please, thanks!

@tpoliaw
Copy link
Contributor

tpoliaw commented Feb 6, 2025

If I expose a bluesky plan for changing energy why must I provide a visit to call this?

I think that requiring someone to be on a current visit in order to move motors is a fair restriction. In what scenario should someone be able to control a beamline they're not associated with? I know epics still provides access to anyone on the beamline network but if we're taking that approach there shouldn't be any authz in bluesky/api.

@DiamondJoseph
Copy link
Contributor

With the future assumption that logging into blueapi will prompt you with a number of visits to choose from and then your visit will be cached alongside your access token and passed into every request

@DominicOram
Copy link
Contributor

With the future assumption that logging into blueapi will prompt you with a number of visits to choose from and then your visit will be cached alongside your access token and passed into every request

As discussed in person. I'm happy for specifying the visit with every call as long as GDA does this for me i.e. I can call run_plan() without specifying a visit and this just works.

@DiamondJoseph
Copy link
Contributor

We'll have to re-generate the Java Blueapi client in GDA when we make a change to the API, when that gets done we'll be sure to propagate the visit from GDA.

@callumforrester
Copy link
Contributor Author

callumforrester commented Feb 7, 2025

Here is a new, slightly different version of the gist based on various discussions: https://gist.github.com/callumforrester/5e107e40034e198aa5d99beadc15c57a?permalink_comment_id=5429079#gistcomment-5429079

This is not a definite direction but an RFC after talking to @noemifrisina, @DominicOram, @abbiemery and @tpoliaw

Philosophy

I think there are enough use cases for different file writing conventions that we may not want to centralise the templating logic in numtracker. This version of the gist keeps it in a Python plan decorator, acknowledging that every plan is special/custom since we don't have one way to do things.

In this scenario numtracker provides a scan number and a visit directory (referred to in the gist as data_session_directory).

Implementation

Decorating your plan effectively customises the way it tells its detectors to write files. The decorator has a hook that allows customising the paths per detector. The following is passed into the hook:

  • The RunStart document, including all its metadata (as an aside, we may also be able to pass in the descriptor if people wanted to have different file names for different streams)
  • A set of devices that are currently staged, that should include all detectors involved

Every scan still has the following metadata, sourced from numtracker:

  • scan_id: Unique scan number
  • data_session: What we at Diamond call either a visit or instrumentSession (we are making efforts to standardise on the latter).
  • data_session_directory: Where to find the data for the given data session

This version of the gist also implements @DominicOram's suggestion of making the visit an optional parameter, though that may need more discussion too.

Evaluation

The simultaneous pro and con of this approach is that it provides a big toolbox with many components that can be assembled in many ways, which means it is flexible enough to cover all use cases we have seen so far, but also easy to mess up. We have this problem currently with attach_data_session_metadata_decorator, but this gist makes some improvements on the error handling.

@DiamondJoseph
Copy link
Contributor

I'm hugely opposed to this implementation. Allowing the ability to override where the data goes leaves us open for a litany of data security issues, and requiring that a plan is decorated leaves us open to significant opportunities for things to go wrong.

  • If a plan isn't decorated we risk overriding existing data on disk. This could be done accidentally or maliciously, in contravention of our open data policy.
  • If we allow overriding a naming convention, then we again allow overriding data on disk.
  • If we allow constructing your own output Path then we immediately provide a backdoor around authorizing your ability to write to a visit, again allowing for data overwriting: accidentally or maliciously, and not just within the visit but over every directory that the detectors have mounted.

I think we Must enforce a PathProvider and FilenameProvider implementation, with some configuration (as currently expressed in the NumTracker Templating) to allow for common cases (e.g. detectors in subdirs). There is an oncoming train of data integrity that having to adjust to differently named files is the price of not risking much worse consequences.

@DominicOram
Copy link
Contributor

The following is passed into the hook:

Does this rule out being able to name files after data read from the beamline e.g. {sample_name}-{measured_temperature}?

risk overriding existing data on disk

I can't speak for all detectors but the version of odin we use will stop you from doing this.

There is an oncoming train of data integrity that having to adjust to differently named files is the price of not risking much worse consequences.

As I said above this will have political consequences. It will be viewed by our users as something that GDA can do and Bluesky cannot. If we're going to make this decision then we will need to canvas some representative scientists from all teams as to whether it's acceptable.

@DiamondJoseph
Copy link
Contributor

I'm OK with NexGen having a different scheme for naming NeXus files, as I trust you to ensure that they are non-overlapping, but I do not think the decision on what h5 files are named should be accessible to users. I'd like to get input from LIMS and CyberSecurity teams first: find out what is feasible first.

@graeme-winter
Copy link
Collaborator

@DiamondJoseph users absolutely have freedom to name their files as they see fit: this is not a matter for cyber security

For unattended data collection we have transformations from what they request via ISPyB but that is still essentially "free"

Preventing data overwriting is an implementation problem, which is already done by GDA with "run numbers"

@DiamondJoseph
Copy link
Contributor

@graeme-winter It's the run numbers that we are trying to enforce use of here: if I can define my own PathProvider/FilenameProvider, then I can name my file however I want: I can name every single file ixx-0001.h5 if I choose to.

@callumforrester
Copy link
Contributor Author

@graeme-winter If GDA prevents overwriting with run numbers how do users have the freedom to name their files as they see fit? Isn't GDA enforcing a unique run number removing some of those freedoms by definition?

The second approach I prototyped really does give the user the power to name their files whatever they want, including overriding the run number, because it's just a Python hook, which is why I'm collecting opinions.

The middle ground here, proposed by @tpoliaw is to allow multiple naming schemes but make numtracker responsible for all of them, ensuring unique file naming but raising the barrier to entry if you want it to use your custom schemes.

Important note: I am taking no side here! I am collecting requirements/opinions. They are all valued so thank you.

@graeme-winter
Copy link
Collaborator

@graeme-winter If GDA prevents overwriting with run numbers how do users have the freedom to name their files as they see fit? Isn't GDA enforcing a unique run number removing some of those freedoms by definition?

User defines folder and prefix, GDA appends run numbers to ensure uniqueness

To be clear, data also constrained to be within the visit

We are also not allowing the files to be called

f̸̪͚́̋̕ó̵̤̭̅ö̸̧́̽̚b̸̠̯̀͊̕ͅă̴̜̭̭̚r̵̲͚̎_̵̖́1̶̛͙̄͋.̶̺̈́̈́n̸͚̰̠̾̑͘x̸̬̤͈́̚s̷͚̲̯̀

But equally we are not insisting that the files are called 2025-02-10-15-03-01-scan1-eiger.nxs

@DiamondJoseph
Copy link
Contributor

DiamondJoseph commented Feb 10, 2025

The proposed templating requirements for a h5 file are:

DetectorTemplate {
"A template describing the location within a visit directory where ",
"the data for a given detector should be written",
"It should contain placeholders for {detector} and {scan_number} ",
"to ensure paths are unique between scans and for multiple ",
"detectors."
}

And for the NeXus file:

ScanTemplate {
"A template describing the location within a visit directory where the root scan file should be written. ",
"It should be a relative path and contain a placeholder for {scan_number} to ensure files are unique."
}

@callumforrester
Copy link
Contributor Author

@graeme-winter makes sense, I think you're headed in the direction of @tpoliaw's middle ground then: Users submit their own file names, numtracker's job is to append a scan number to ensure they are unique. Odin refusing to overwrite file names is also a good data protection measure, don't know if AD does the same, or can be made to.

This is definitely an option: it gives us parity with GDA, customisation (which always makes users happy) and keeps some degree of security.

@DiamondJoseph's point of view also deserves a fair hearing though, if cybersecurity want to tell us that users shouldn't be allowed to name files what they like then that's their call.

Again, this is just me trying to push decisions to people who are actually paid to make decisions.

@olliesilvester
Copy link
Collaborator

Users submit their own file names, numtracker's job is to append a scan number to ensure they are unique.

I finally got around to reading all this - I'm in favour of this way

@cheryntand
Copy link

Adding some thoughts as requested by @DiamondJoseph. I'm coming into this with limited understanding, so only have some high level thoughts.

IIUC this isn't a proposal to allow users to name data files whatever they wish; there will be guardrails including restricting data writing within the visit directory, and using numtracker. All of this sounds reasonable.

What I'm not sure about is Joseph's point about:

If we allow constructing your own output Path then we immediately provide a backdoor around authorizing your ability to write to a visit, again allowing for data overwriting: accidentally or maliciously, and not just within the visit but over every directory that the detectors have mounted

and whether it's addressed by the subsequent responses from Graeme and Callum, or other access restrictions in place? If not, I would share the concern that this is a rather large hole from a security perspective.

The other argument for the use of a system-controlled standardised path/file naming scheme is ease of search and retrieval of data, though perhaps this is covered by metadata?

@callumforrester
Copy link
Contributor Author

@cheryntand thanks for your thoughts.

To a degree, this will only be security through obscurity. There is nothing to stop you from running:

yield from bps.abs(detector.hdf.file_path, "someone/elses/visit/directory")

This is not a flaw in bluesky but a consequence of the detector control systems being too permissive. I don't know if that's an addressable issue (question for controls) but it's beyond the scope of this ticket. That is the only way to truly address @DiamondJoseph's security concern.

What we can do is make PathProvider only allow a restricted set of filenames/paths, supported by numtracker that ensures uniqueness. But again that's the standard path to take. It is possible using this to restrict users to their own visit directory, as long as they are using PathProvider.

I believe the discussion here is: If we can restrict users to a particular visit directory should be let them name their files what they like? Should we enforce a unique scan number in the files? Should we give them no control at all and name a file exactly according to some standard?

Also, complete aside but I just thought of a great compromise (because like all good compromises it will annoy everyone): Have a very strict file naming convention and a convenient script that makes symlinks based on users' preferences.

@tpoliaw
Copy link
Contributor

tpoliaw commented Feb 13, 2025

I have just seen this

We currently actually don't have a global run number. Run number to us is just incremented per directory so we could store sample_1.hdf in the xraycentring folder then next collection will be sample_1.hdf in the collection folder. Is this use case covered?

In this scenario, what is determining the number to use? Is there something iterating through every file in the directory to find the next unused number for each scan?


Otherwise, based on the comments here, I've added proof of concept changes to numtracker to allow custom placeholders and additional templates. Together I think these would allow for all the use cases here (other than the reinitialising the count in each subdirectory above) - any thoughts on the proposed changes are welcome.

This would allow users to specify a custom name (via a {filename} placeholder or similar) while still ensuring files are unique and in the correct directory (assuming plans go through the service and don't talk to epics devices directly).

@DominicOram
Copy link
Contributor

In this scenario, what is determining the number to use? Is there something iterating through every file in the directory to find the next unused number for each scan?

Yes,

def get_run_number(directory: str, prefix: str = "") -> int:
, which I don't particularly like but it was the quickest way to get us moving.

other than the reinitialising the count in each subdirectory above

We may be able to live with not having this.

@DiamondJoseph
Copy link
Contributor

From discussion with Callum today, how about the following:

# dodal.common.beamlines.beamline_utils

# When running dodal in terminal, this is the default, can be overriden: develop without external infrastructure
PATH_PROVIDER = StaticPathProvider("/tmp/", filename_provider=UUIDFileNameProvider())

def set_path_provider():
def get_path_provider():

#dodal.beamlines.ixx

# For beamlines that want to name their own files
filename_templates = {"axis_sample": "{device_name}-{sample_name}-{dataCollectionId}"}


@filename_factory
def filename_provider_factory(md: dict[str, Any]) -> Callable[[str], str]:
    # e.g. define N templates in your dodal module
    filled_template = filename_templates[md["template_name"]].format_map({**md, "device_name": "{device_name}"})
    return lambda device_name: filled_template.format(device_name)

@device_factory()
def det():
    return Detector(path_provider=get_path_provider()

# ixx-bluesky.plans

def my_plan(detectors: list[Readable], motor: Movable, sample_name: str):
    @run_decorator(md={"template_name": "axis_sample", "sample_name": sample_name})
    def inner_plan():
        ...
    yield from inner_plan()

# blueapi.core.context
def with_dodal_module(self, module: ModuleType, **kwargs) -> None:
    # default FileNameProvider that uses Diamond standard naming convention `ixx-NNNN`
    filename_factory = get_filename_factory(module) or DefaultFilenameProvider(config().filename_config)
    # Safely using visit for all files written through blueapi but with customisable names
    dodal.common.beamlines.beamline_utils.set_path_provider(VisitPathProvider(filename_factory)
    devices = make_all_devices()
    ...
   

We override the scan_id_number hook on the run engine and provider it in every Run's metadata. We pre-process all plans as they are run and pass all of the metadata to the filename_factory when a new StartDocument is created, to get the names of the h5 files- if any metadata isn't present, throws an exception. This is opt in and defaults are provided that are guaranteed to be unique. PathProvider remains unable to be accidentally set, but can still be maliciously tampered with.

# ixx-bluesky.plans

# Annotates the function with the template
@filename_template("{device_name}-{sample_name}-{dataCollectionId}")
def my_plan():
    ...

# blueapi.worker.task_worker

def _submit_trackable_task(self, trackable_task: TrackableTask) -> None:
    ...
    filename_provider.set_template(task.plan.__annotations__.get("filename_template", "{device_name}-{dataCollectionId}")
    ...

This would let the template be defined per plan and alongside the plan, but the call to filename_provider.template(md: dict[str, Any]) isn't obvious. We still get being able to enforce a PathProvider. We still override the scan_id_number hook on the run engine and pre-process to provide a filename_provider its Run's metadata.

@tpoliaw
Copy link
Contributor

tpoliaw commented Feb 17, 2025

# Annotates the function with the template
@filename_template("{device_name}-{sample_name}-{dataCollectionId}")
def my_plan():
    ...

How would this work for plans that need to create files with different templates? For instance the screening/centring/snapshot/data use case above.

@DiamondJoseph
Copy link
Contributor

Aaaaaaah and now we need to be able to modify the PathProvider too :(

I was hoping to just get to remove as much custom as possible and just have

def my_plan(detector: StandardDetector, oav: StandardDetector):
    @run_decorator(md={"template": "{'snapshots/' if device_name == 'oav' else ''}{device_name}-{scan_number}")
    def inner():
        ...
    yield from inner()

@abbiemery abbiemery linked a pull request Mar 3, 2025 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants