Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add File API proposal design document #3101

Merged
merged 3 commits into from
Jun 26, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
226 changes: 226 additions & 0 deletions docs/design/019-file-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# Introduce a File API module for k6

| | |
| ---------- | ------------ |
| **author** | @oleiade |
| **status** | 🔧 proposal |
| **Proof of concept** | [branch](https://github.com/grafana/k6/tree/poc/experimental/fs-module/js/modules/k6/experimental/fs) |
| **references** | [[#2977](https://github.com/grafana/k6/issues/2977) [[#2974](https://github.com/grafana/k6/issues/2974) |

## Problem definition

The current version of k6 lets users load file content via [the `open` function](https://k6.io/docs/javascript-api/init-context/open/), which is only accessible in the init context. However, the `open` function diverges from its counterparts in other languages and the Linux stack as it reads the whole file into memory rather than opening it for further interaction. This process leads to considerable memory consumption when loading large binary files (as the content ends up copied in each VUs) or when the `SharedArray` cannot be used.
oleiade marked this conversation as resolved.
Show resolved Hide resolved

In line with [our commitment to optimize large file handling in k6](https://github.com/grafana/k6/issues/2974), we propose introducing a new `fs` module. This module is intended to offer an intuitive and user-friendly API for file interactions within k6 scripts. We'll also provide some ideas for efficient file handling to minimize memory consumption during k6 execution.

### Context
Currently, files cannot be opened from within a function executed by a VU, only in the init context.

This is due to k6's design for distributed execution, particularly in the cloud. K6 runs the init context once, gathers resources, including files, and sends them to other instances where VU code runs.

### Assumptions
- The solution should be native to k6, with the code being in the k6 repository and maintained by the k6 OSS team.
- The new functionality's API should be intuitive for users familiar with filesystem APIs from other languages and technologies.
- The module should introduce only essential APIs: read-only operations.
- The proposed API should not affect file handling within HTTP requests.

### Requirements
- The solution should allow file interaction during k6 script execution.
- The solution should resort to asynchronous operations as much as possible.
oleiade marked this conversation as resolved.
Show resolved Hide resolved
- The module should operate seamlessly in local and cloud setups.
- The module should strive to optimize memory usage when handling files.

### Scope

#### In scope
- The proposed solution will offer an alternative to the existing `open` function.
- The solution will provide a File concept that supports read-only operations: reading file chunks into a buffer, reading the entire file content, and file seeking.
- The proposal could suggest changes to improve overall memory usage in k6 scripts.

#### Out of scope
- This proposal doesn't aim to solve memory consumption issues when using file content in HTTP requests, although minor improvements might be suggested.
- Changes to how k6 gathers resources, particularly files, are out of scope (ideas are proposed in the "Future Improvements" section).
- Despite the proposed API's potential to support them, write operations are beyond this proposal's scope.
oleiade marked this conversation as resolved.
Show resolved Hide resolved

## Proposed solution

We suggest implementing a minimalist, experimental file system (`fs`) module based on [Deno's fs module](https://deno.land/api@v1.32.3?s=Deno.open). The new module will allow users to interact with files, separating text and binary files. The module will provide an `open` function that returns a file handle/view for performing read operations.

The initial API will mostly be asynchronous, except for the `open` functionality which will be synchronous due to the current lack of support for `await` operations within the init context.
oleiade marked this conversation as resolved.
Show resolved Hide resolved

The API will have the following characteristics:
- Load file content exactly once into memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well .. the underlying implementation already loads it once and if it over 100kb it likely loads a couple of more times. This is blocked on removing afero.

But hopefully this will be a lot better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

Copy link
Contributor

@codebien codebien Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So removing afero is part of the implementation? Should we have a rude but working PoC with it? We should also link the issue here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't argue removing afero is part of the implementation.

afero adds +1 on the times we load it in memory and there is +1-2x if it is fairly big file.

This second memory though is temporarily and will be reused. You can see a more detailed explanation here

So we will have at least 2 copies of the file in memory, which is far from ideal, but definitely a lot better than having a copy per VU.

The whole proposal currently while having some values does not make it a lot more useful to open big files, as the moment you have to use them you still need to start copy. So dropping the 2 copies of this part is likely negligible to getting streams and not copying on the "other side" where this data will be used by something.

I am not against removing afero entirely and would like to prioritise it again, but I don't think this is our biggest problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also keep removing afero out of scope for this design.

Thanks for the heads-up on afero indeed, I didn't have the full context in mind regarding how it handles memory itself 🙇🏻 I'll try to rephrase to be more accurate.

- Provide each VU a copy of the file handle pointing to a unique memory area to avoid copying the whole buffer for each VU.
- Ensure each file handle has unique offsets allowing each VU to work independently.
oleiade marked this conversation as resolved.
Show resolved Hide resolved

A working [proof of concept](https://github.com/grafana/k6/tree/poc/experimental/fs-module/js/modules/k6/experimental/fs) of the new API is available on GitHub.

### Proposed API:

```typescript
/*
* readFile reads the whole content of a file and
* returns its content as an `ArrayBuffer`.
*
* It effectively is a drop-in replacement for the
* current `open(filename, 'b')` API.
*/
readFileSync(filename: string): ArrayBuffer

/*
* readFile reads the whole content of a file and
* returns its content as a `string`.
*
* It effectively is a drop-in replacement for the
* current `open(filename, 'r')` API.
*/
readTextFileSync(filename: string): string
oleiade marked this conversation as resolved.
Show resolved Hide resolved

/*
* openSync opens a file and returns an instance of a
* `File`.
*/
openSync(path: string) File

/*
* File is an abstraction to interact with
* files which exposes read-only operations.
*/
interface File {
oleiade marked this conversation as resolved.
Show resolved Hide resolved
/*
* read reads the file into an array buffer.
* resolves to either the number of bytes read during the operation
* or `null` if there was nothing to read.
*/
read(p ArrayBuffer | TypedArray | DataView): Promise<number>

/*
* readAll reads the whole content of the file and
* returns a promise that will resolve to its content
* as an `ArrayBuffer`.
*/
readAll(): Promise<ArrayBuffer>
oleiade marked this conversation as resolved.
Show resolved Hide resolved

/*
* Seek to the given `offset` under mode given by `whence`.
* The call resolves to the new position within the resource
* (bytes from the start).
*/
seek(offset: number, whence: SeekMode): Promise<number>

/*
* Resolves to a `FileInfo` describing the file.
*/
stat(): Promise<FileInfo>
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No close ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether we need one or not depends a lot on the underlying implementation, so I was on the verge. As a safety net, and for the API to fill intuitive indeed, it's probably better to have one indeed. Adding it 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think File.close() would be confusing with the current implementation, considering open() returns a slice with the file data.

So what would close() actually do? It certainly wouldn't close the file, as the comment here states. If we have the concept of a FileHandle, then close() could make it unusable from that point on? Not sure what the purpose of that would be, since FileHandles shouldn't have any expensive resources to free.

I get that a counterpart to open() would be intuitive, but since we're not dealing with traditional file handles here, a close() doesn't make sense to me.

Maybe instead of a top-level open() function, a File constructor would be more intuitive? E.g.:

let f = new File('/path');
f.read() ...

This would make it clear that File handles are cheap and disposable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding close I had left it out of the proposal and POC implementation as they were not necessary with the way I approached solving the problem on the implementation side. But this discussion is a good opportunity to mention that, yet, I didn't want to assume what the precise underlying implementation would be. However, I had thought of one potential use of close: the file registry I used in the POC to keep files content in memory, and provide a pointer to the data to VUs, could keep a reference count for each file->vu number, and once a file is not referenced anymore (because all VUs closed it), it could explicitly drop it. But I'm not entirely convinced that's really useful in practice.

Regarding using a constructor as opposed to open, I'm not opposed to it, but I would personally prefer to stick to an API that ressembles what you find in OSes and other languages. On another note that maybe proves interesting somehow: In my initial implementation of the POC, the File struct was actually called FileView to reflect that it's a view of the file content, cannot be modified, and is a "cheap and disposable" handle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think File.close() would be confusing with the current implementation

implementation is the key word here.

And this is an API proposal. If after making this better we need to break the API because now close will need to be called in order for it to work that will be bad.

I would expect that any file you open - you will be able to close. Whether that has a significant impact is a good question.

If there was a different way to tell us that "I no longer need this file" I will be fine. But at this point this is not possible AFAIK. There are some resource management tc39 proposals that I am not going to discuss as they are just proposal and will require that we have a close like method to tell us that a file is no longer needed. So ... not really a solution.

I am okay with this not having close for now - I do expect we will likely need to iterate on all the changes around the epic "support big files in k6" quite a lot. But wanted to lay my reasoning on wanting a close.

Maybe instead of a top-level open() function, a File constructor would be more intuitive? E.g.:

While I am not against it in principal, I don't see why that will be better..

If anything the current API is closer to what is in deno and in other places and consequently might be reusable across platforms. Or more likely easier to support when someone wants to reuse a library. So ... 👎 on my side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that open() and close() only make sense if the file itself is opened and closed. But since we'll only open the file and read all its data once, and each subsequent call to open() will return a view of the data, calling close() on this view would be confusing if people are expecting the file to be closed.

This behavior will not change in a different implementation, since it's the core of what we want to accomplish. So the naming doesn't quite align with the behavior, and repurposing the name used in other frameworks to forcefully align it wouldn't make it more intuitive--quite the contrary. As a user, I'd rather have the API clearly reflect the behavior, than have to remap my existing assumptions of what open() and close() do.

I suggested using a constructor instead of open() since it doesn't imply that a close() would be needed. But I'm not sold on it either, and we could consider alternatives like FileView.

Anyway, this is not a blocker from my side. So if you strongly feel we should keep it as is, I'm fine with it.


/*
* FileInfo provides information about a file.
*/
interface FileInfo {
/*
* the filename of the file
*/
name: string

/*
* the size of the file, in bytes.
*/
size: number
}
```

**N.B**: although text and binary files are distinguished by the top-level `readTextFile` and `readFile` operations, the `File` doesn't offer that distinction at the moment. This is based on the assumption we could somewhat easily add `TextDecoder` support to k6. If this assumption was to be invalidated, we could adopt the same API format and have two different read-operation variants on the File, or even expose two different kinds of files `TextFileHandle` and `BinaryFileHandle` for instance.
oleiade marked this conversation as resolved.
Show resolved Hide resolved

## Example usage

```javascript
import { openSync, SeekMode } from 'k6/experimental/fs';

export const options = {
scenarios: {
default: {
executor: 'constant-vus',
vus: 100,
duration: '1m',
},
},
};

const file = openSync("./data.csv");

export default async function () {
let resultString = ""

let buffer = new Uint8Array(10);
let n = await file.read(buffer);
resultString += ab2str(buffer)

// Read the same data again
n = await file.read(buffer);
resultString += ab2str(buffer)

// Read the same data again
n = await file.read(buffer);
resultString += ab2str(buffer)

await file.seek(0, SeekMode.Start);

console.log(`[vu ${__VU}] resultString: ${resultString}`);
}

function ab2str(buf) {
return String.fromCharCode.apply(null, new Uint16Array(buf));
}
```

### Potential additions

We have intentionally excluded certain elements from the proposed API. Specifically, the asynchronous `open` function is currently unusable in the init context as it does not support the `await` keyword.

Additionally, while Deno provides synchronous counterparts for its entire API, we may also want to consider doing the same. The primary argument for asynchronous code is to facilitate non-blocking IO.

```typescript
/*
* Asynchronous counterparts to currently synchronous
* proposed APIs.
*/
open(path: string): Promise<File>
readFile(filename: string): Promise<ArrayBuffer>
readTextFile(filename: string): Promise<string>
oleiade marked this conversation as resolved.
Show resolved Hide resolved

/*
* Synchronous counterparts to the already
* proposed APIs.
*/
interface File {
readSync(p ArrayBuffer | TypedArray | DataView): number
readAllSync(): ArrayBuffer
seekSync(offset: number, whence: number): number
statSync(): FileInfo
}
```
oleiade marked this conversation as resolved.
Show resolved Hide resolved

### Implementation details

The proposed API is made feasible through the following implementation aspects:

- A file's content is loaded into memory only once:
- When a file is opened, its content is buffered at the module's root in a dedicated registry, returning a handle with a pointer to that buffer.
- Each VU receives a copy of the file handle, enabling them to interact with files using the unique memory area linked to the handle, instead of each receiving a full copy of the buffer.
- As each VU receives a unique file handle linked to the same memory area, they each have unique offsets. This setup allows each VU to process file data independently without conflict or race conditions.
oleiade marked this conversation as resolved.
Show resolved Hide resolved
- If we plan to introduce write operations to this API in the future, a synchronization mechanism would be required to ensure adherence to the multiple reader/single writer architecture constraints.
oleiade marked this conversation as resolved.
Show resolved Hide resolved

### Possible future improvements

#### Introduce stream support

The Deno API's `FsFile` our proposal is inspired by exposes a `readable` read-only property which is a Streams API `ReadableStream` allowing to stream the content of the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually much more for this to have been the part that was worked on.

The current implementation let you open files and then read them in a fairly optimized way.

The problems are that ... using them in practice removes all the benefits almost entirely.

Binary files will 100% just need to be copied and even if we do some magic, the moment a user needs to use formdata or any other kind of formatting the request we are again with a very small gains that likely will not benefit anyone.
I don't think reading binary files little by little and doing something with this result before continuing was ever a requested feature.

This is mostly the same for text files, but here at least there is some chance that something can just read them and parse them on the fly. But that seems like we will need to do custom solutions to stuff that like papaparse, while supporting streams would just fix that ... and for any other API that just takes streams.

I am somewhat hopeful that given the API that won't be as hard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also one of the big take aways of this work for me too. I built a dumb streaming CSV file parser using the proposed API, and while it's already more convinient, and efficient to do so in JavaScript, it quickly led to complicated code, and strong limitations that wouldn't have occured with the streams API.

Implementing the Streams API as described in #2978 should come next, once this is implemented indeed, and I already plan to start putting together a POC for it for the next release 👍

#### Enable file opening within VU context

This is currently unachievable because we must anticipate which files will be opened. While some quick fixes might appear feasible (e.g., parsing the AST before execution to identify files), they quickly fall apart: What if the filename resides in a variable? A plausible solution would involve requiring users to declare necessary resources (files/folders) ahead of time. This approach would ensure these resources are captured and included in the archive for future VU code access.

oleiade marked this conversation as resolved.
Show resolved Hide resolved
## Conclusion

We believe the [proof of concept](https://github.com/grafana/k6/tree/poc/experimental/fs-module/js/modules/k6/experimental/fs) developed along this proposal illustrates the feasability and benefits of developing such an API.