Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add File API proposal design document #3101

Merged
merged 3 commits into from
Jun 26, 2023
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
192 changes: 192 additions & 0 deletions docs/design/019-file-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# Introduce a File API module for k6

| | |
| :------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **author** | @oleiade |
| **status** | 🔧 proposal |
| **revisions** | [previous](https://github.com/grafana/k6/pull/3101/commits/0669d16e76791241b75a2622729327880cd814e2), [initial](https://github.com/grafana/k6/pull/3101/commits/0669d16e76791241b75a2622729327880cd814e2) |
| **Proof of concept** | [branch](https://github.com/grafana/k6/tree/poc/experimental/fs-module/js/modules/k6/experimental/fs) |
| **references** | [[#2977](https://github.com/grafana/k6/issues/2977) [[#2974](https://github.com/grafana/k6/issues/2974) |

## Problem definition

The current version of k6 lets users load file content via [the `open` function](https://k6.io/docs/javascript-api/init-context/open/), which is only accessible in the init context. However, the `open` function diverges from its counterparts in other languages and the Linux stack as it reads the whole file into memory rather than opening it for further interaction. This process leads to considerable memory consumption when loading large binary files (as the content ends up copied in each VUs) or when the `SharedArray` cannot be used.
oleiade marked this conversation as resolved.
Show resolved Hide resolved

In line with [our commitment to optimize large file handling in k6](https://github.com/grafana/k6/issues/2974), we propose introducing a new `fs` module. This module is intended to offer an intuitive and user-friendly API for file interactions within k6 scripts. We'll also provide some ideas for efficient file handling to minimize memory consumption during k6 execution.

### Context
Currently, files cannot be opened from within a function executed by a VU, only in the init context.

This is due to k6's design for distributed execution, particularly in the cloud. K6 runs the init context once, gathers resources, including files, and sends them to other instances where VU code runs.

### Assumptions
- The solution should be native to k6, with the code being in the k6 repository and maintained by the k6 OSS team.
- The new functionality's API should be intuitive for users familiar with filesystem APIs from other languages and technologies.
- The module should introduce only essential APIs: read-only operations.
- The proposed API should not affect file handling within HTTP requests.

### Requirements
- The solution should allow file interaction during k6 script execution.
- The solution must use asynchronous operations.
- The module should operate seamlessly in local and cloud setups.
- The module should strive to optimize memory usage when handling files.
imiric marked this conversation as resolved.
Show resolved Hide resolved

### Scope

#### In scope
- The proposed solution will offer an alternative to the existing `open` function.
- The solution will provide a File concept that supports read-only operations: reading file chunks into a buffer, reading the entire file content, and file seeking.
- The proposal could suggest changes to improve overall memory usage in k6 scripts.

#### Out of scope
- This proposal doesn't aim to solve memory consumption issues when using file content in HTTP requests, although minor improvements might be suggested.
- Changes to how k6 gathers resources, particularly files, are out of scope (ideas are proposed in the "Future Improvements" section).
- Despite the proposed API's potential to support them, write operations are beyond this proposal's scope.
oleiade marked this conversation as resolved.
Show resolved Hide resolved

## Proposed solution

We suggest implementing a minimalist, experimental file system (`fs`) module based on [Deno's fs module](https://deno.land/api@v1.32.3?s=Deno.open). The new module will allow users to interact with files, separating text and binary files. The module will provide an `open` function that returns a file handle/view for performing read operations.

The initial API will mostly be asynchronous, except for the `open` functionality which will be synchronous due to the current lack of support for `await` operations within the init context.
oleiade marked this conversation as resolved.
Show resolved Hide resolved

The API will have the following characteristics:
- Load file content exactly once into memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well .. the underlying implementation already loads it once and if it over 100kb it likely loads a couple of more times. This is blocked on removing afero.

But hopefully this will be a lot better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

Copy link
Contributor

@codebien codebien Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So removing afero is part of the implementation? Should we have a rude but working PoC with it? We should also link the issue here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't argue removing afero is part of the implementation.

afero adds +1 on the times we load it in memory and there is +1-2x if it is fairly big file.

This second memory though is temporarily and will be reused. You can see a more detailed explanation here

So we will have at least 2 copies of the file in memory, which is far from ideal, but definitely a lot better than having a copy per VU.

The whole proposal currently while having some values does not make it a lot more useful to open big files, as the moment you have to use them you still need to start copy. So dropping the 2 copies of this part is likely negligible to getting streams and not copying on the "other side" where this data will be used by something.

I am not against removing afero entirely and would like to prioritise it again, but I don't think this is our biggest problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also keep removing afero out of scope for this design.

Thanks for the heads-up on afero indeed, I didn't have the full context in mind regarding how it handles memory itself 🙇🏻 I'll try to rephrase to be more accurate.

- As opening files is restricted to the init context, each VU receives a copy of each unique file handle, pointing to a unique memory area to avoid copying the whole buffer for each VU.
- Ensure each file handle has unique offsets allowing each VU to work independently.

A working [proof of concept](https://github.com/grafana/k6/tree/poc/experimental/fs-module/js/modules/k6/experimental/fs) of the new API is available on GitHub.

### Proposed API:

```typescript
/*
* open opens a file and returns an instance of a
* `File`.
*/
open(path: string): Promise<File>

/*
* File is an abstraction to interact with
* files which exposes read-only operations.
*/
interface File {
oleiade marked this conversation as resolved.
Show resolved Hide resolved
/*
* read reads the file into an array buffer.
* resolves to either the number of bytes read during the operation
* or `null` if there was nothing to read.
*/
read(p ArrayBuffer | TypedArray | DataView): Promise<number>

/*
* readAll reads the whole content of the file and
* returns a promise that will resolve to its content
* as an `ArrayBuffer`.
*/
readAll(): Promise<ArrayBuffer>
oleiade marked this conversation as resolved.
Show resolved Hide resolved

/*
* Seek to the given `offset` under mode given by `whence`.
* The call resolves to the new position within the resource
* (bytes from the start).
*/
seek(offset: number, whence: SeekMode): Promise<number>

/*
* Resolves to a `FileInfo` describing the file.
*/
stat(): Promise<FileInfo>

/*
* close closes the file.
*/
close(): Promise<void>
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No close ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether we need one or not depends a lot on the underlying implementation, so I was on the verge. As a safety net, and for the API to fill intuitive indeed, it's probably better to have one indeed. Adding it 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think File.close() would be confusing with the current implementation, considering open() returns a slice with the file data.

So what would close() actually do? It certainly wouldn't close the file, as the comment here states. If we have the concept of a FileHandle, then close() could make it unusable from that point on? Not sure what the purpose of that would be, since FileHandles shouldn't have any expensive resources to free.

I get that a counterpart to open() would be intuitive, but since we're not dealing with traditional file handles here, a close() doesn't make sense to me.

Maybe instead of a top-level open() function, a File constructor would be more intuitive? E.g.:

let f = new File('/path');
f.read() ...

This would make it clear that File handles are cheap and disposable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding close I had left it out of the proposal and POC implementation as they were not necessary with the way I approached solving the problem on the implementation side. But this discussion is a good opportunity to mention that, yet, I didn't want to assume what the precise underlying implementation would be. However, I had thought of one potential use of close: the file registry I used in the POC to keep files content in memory, and provide a pointer to the data to VUs, could keep a reference count for each file->vu number, and once a file is not referenced anymore (because all VUs closed it), it could explicitly drop it. But I'm not entirely convinced that's really useful in practice.

Regarding using a constructor as opposed to open, I'm not opposed to it, but I would personally prefer to stick to an API that ressembles what you find in OSes and other languages. On another note that maybe proves interesting somehow: In my initial implementation of the POC, the File struct was actually called FileView to reflect that it's a view of the file content, cannot be modified, and is a "cheap and disposable" handle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think File.close() would be confusing with the current implementation

implementation is the key word here.

And this is an API proposal. If after making this better we need to break the API because now close will need to be called in order for it to work that will be bad.

I would expect that any file you open - you will be able to close. Whether that has a significant impact is a good question.

If there was a different way to tell us that "I no longer need this file" I will be fine. But at this point this is not possible AFAIK. There are some resource management tc39 proposals that I am not going to discuss as they are just proposal and will require that we have a close like method to tell us that a file is no longer needed. So ... not really a solution.

I am okay with this not having close for now - I do expect we will likely need to iterate on all the changes around the epic "support big files in k6" quite a lot. But wanted to lay my reasoning on wanting a close.

Maybe instead of a top-level open() function, a File constructor would be more intuitive? E.g.:

While I am not against it in principal, I don't see why that will be better..

If anything the current API is closer to what is in deno and in other places and consequently might be reusable across platforms. Or more likely easier to support when someone wants to reuse a library. So ... 👎 on my side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that open() and close() only make sense if the file itself is opened and closed. But since we'll only open the file and read all its data once, and each subsequent call to open() will return a view of the data, calling close() on this view would be confusing if people are expecting the file to be closed.

This behavior will not change in a different implementation, since it's the core of what we want to accomplish. So the naming doesn't quite align with the behavior, and repurposing the name used in other frameworks to forcefully align it wouldn't make it more intuitive--quite the contrary. As a user, I'd rather have the API clearly reflect the behavior, than have to remap my existing assumptions of what open() and close() do.

I suggested using a constructor instead of open() since it doesn't imply that a close() would be needed. But I'm not sold on it either, and we could consider alternatives like FileView.

Anyway, this is not a blocker from my side. So if you strongly feel we should keep it as is, I'm fine with it.


/*
* FileInfo provides information about a file.
*/
interface FileInfo {
/*
* the filename of the file
*/
name: string

/*
* the size of the file, in bytes.
*/
size: number
}
```

**N.B**: although text and binary files are distinguished by the top-level `readTextFile` and `readFile` operations, the `File` doesn't offer that distinction at the moment. This is based on the assumption we could somewhat easily add `TextDecoder` support to k6 (see comments of [#2291](https://github.com/grafana/k6/issues/2291) and [#2440](https://github.com/grafana/k6/issues/2440)). If this assumption was to be invalidated, we could adopt the same API format and have two different read-operation variants on the File, or even expose two different kinds of files `TextFileHandle` and `BinaryFileHandle` for instance.
oleiade marked this conversation as resolved.
Show resolved Hide resolved

## Example usage

```javascript
import { openSync, SeekMode } from 'k6/experimental/fs';

export const options = {
scenarios: {
default: {
executor: 'constant-vus',
vus: 100,
duration: '1m',
},
},
};

const file = openSync("./data.csv");

export default async function () {
let resultString = ""

let buffer = new Uint8Array(10);
let n = await file.read(buffer);
resultString += ab2str(buffer)

// Read the same data again
n = await file.read(buffer);
resultString += ab2str(buffer)

// Read the same data again
n = await file.read(buffer);
resultString += ab2str(buffer)

await file.seek(0, SeekMode.Start);

console.log(`[vu ${__VU}] resultString: ${resultString}`);
}

export default function teardown() {
file.close();
}

function ab2str(buf) {
return String.fromCharCode.apply(null, new Uint16Array(buf));
}
```

### Implementation details

The proposed API is made feasible through the following implementation aspects:

- A file's content is loaded into memory only once:
- When a file is opened, its content is buffered at the module's root in a dedicated registry, returning a handle with a pointer to that buffer.
- Each VU receives a copy of the file handle, enabling them to interact with files using the unique memory area linked to the handle, instead of each receiving a full copy of the buffer.
- As each invocation of `open*` receives a unique file handle linked to the same memory area, they each have unique offsets. This setup allows each VU to process file data independently without conflict or race conditions.

### Possible future improvements

#### Introduce stream support

The Deno API's `FsFile` our proposal is inspired by exposes a `readable` read-only property which is a [Streams API](https://developer.mozilla.org/en-US/docs/Web/API/Streams_API) `ReadableStream` allowing to stream the content of the file. We have an open issue tracking the implementation of the Streams API in k6 [#2978](https://github.com/grafana/k6/issues/2978).

#### Enable file opening within VU context

See [#3020](https://github.com/grafana/k6/issues/3020)

This is currently unachievable because we must anticipate which files will be opened. While some quick fixes might appear feasible (e.g., parsing the AST before execution to identify files), they quickly fall apart: What if the filename resides in a variable? A plausible solution would involve requiring users to declare necessary resources (files/folders) ahead of time. This approach would ensure these resources are captured and included in the archive for future VU code access.

## Conclusion

We believe the [proof of concept](https://github.com/grafana/k6/tree/poc/experimental/fs-module/js/modules/k6/experimental/fs) developed with this proposal illustrates the feasibility and benefits of developing such an API.