Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multidimensional arrays #1839

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
287 changes: 287 additions & 0 deletions proposals/p1839.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,287 @@
# Multidimensional arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a piece of high-level feedback, I'd encourage you to look for ways to make this proposal smaller. A lot of the features you're proposing here seem like they could be follow-up proposals once the main multidimensional array feature is in place. Some of these features seem like they could be controversial, or at least require substantial discussion, and I don't want the main proposal to get bogged down.


<!--
Part of the Carbon Language project, under the Apache License v2.0 with LLVM
Exceptions. See /LICENSE for license information.
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-->

[Pull request](https://github.com/carbon-language/carbon-lang/pull/1839)

<!-- toc -->

## Table of contents

- [Problem](#problem)
- [Background](#background)
- [Proposal](#proposal)
- [Details](#details)
- [Rationale](#rationale)
- [Alternatives considered](#alternatives-considered)

<!-- tocstop -->

## Problem

Multidimensional arrays are actively used in numerical methods, machine
intelligence and data science. This is one feature than makes modern Fortran
more attractive than C++ when it comes to a choice of compiled language:
currently, C++ lacks support of multidimensional arrays. Having Carbon implement
this would give it a major boost in the eyes of the scientific community.

Nested arrays may look as a good alternative of multidimensional arrays but
their performance may be not effective due to splitting in memory.

## Background

Multidimensional array is an array with more than two dimensions which is
continuous in memory.

Multidimensional array may be stored in memory in
[row- or column- major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order).

## Proposal

We should add support of multidimensional arrays in Carbon via syntax extension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with #1787 (see this comment), we've put this proposal in a procedurally awkward position, because we haven't yet adopted a proposal for one-dimensional arrays yet. That being the case, I'm not sure what the best way forward is, but it might make sense to defer this proposal until we've adopted a one-dimensional array proposal that this can build on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late reply and thank you for your review! As I can see, #1787 is just about array initialization syntax. Should I create a new proposal for 1D arrays?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I create a new proposal for 1D arrays?

I think either you or @asoffer should; see the discussion in the #process channel on Discord.

for making code clean and simplier for reading and writing.

```carbon
var a: [f64; 3, 4];
var values: i64 = 0;
for(a_i: auto in a[:,...]) {
for(a_ij: auto in a_i[:,...]) {
a_ij = values++;
} }
```
or
```carbon
var a: [f64; 3, 4];
var values: i64 = 0;
for(i: auto in (0:2)) {
for(j: auto in (0:3)) {
a[i,j] = values++;
} }
```

## Details

### Definition

#### Automatic allocation

Arrays can be automatically allocated:
```carbon
var x: [i32; :, :];
```
For avoiding Undefined Behavior, `x` has shape `(0, 0)`.
Comment on lines +73 to +76
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does : mean "deduce this dimension from the initializer", or does it mean "this dimension can vary at run-time"? I'd strongly prefer the first meaning, but if that's the case, this code should be a compile error, just like var x: auto; is a compile error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that dimensions may vary at run-time. The most simple example is string (C# example):

string s = "Hello";
s += ", world!";

In Fortran, it oftenly used for allocating arrays when sizes are becoming known:

type(atom) :: atoms(:)
integer :: Natoms
Natoms = get_atoms_len()
allocate(atoms(Natoms))

In carbon, for simplifying (Sorry for Fortran-style):

var n_atoms: i64;
var atoms: [atom; :];
n_atoms = get_atoms_len();
allocate(atoms, shape = ( n_atoms ))

It can be rewritten as:

var n_atoms: i64 = get_atoms_len();
var atoms: [atom; n_atoms];

But I would prefer to have such constructions since they may be actively used for class fields.

Copy link
Contributor

@geoffromer geoffromer Aug 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like it would make the type of var x: [i32; 3, 2] = ((0, 1, 2), (3, 4, 5)); very different from the type of var y: [i32; :, :] = ((0, 1, 2), (3, 4, 5));.

  • shape(x) would presumably be a compile-time constant, but shape(y) could not be.
  • x could store all its elements directly in the object (like a std::array does), but y has to be implemented with a pointer to heap-allocated memory.
  • The generated code for x[2, 1] can consist of a single memory access, but I believe the generated code for y[2, 1] requires at least two memory accesses, even if you don't count the indirection through the pointer mentioned above.
  • x provides pointer stability, whereas y does not: &x[0] will never change, but &y[0] could have different values over the course of y's lifetime.

Those differences are so basic that I think it would be misleading to use such similar syntax to represent them both.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my point of view, var x: [i32; 3, 2] = ((0, 1, 2), (3, 4, 5)); and var y: [i32; :, :] = ((0, 1, 2), (3, 4, 5)); are the same. In y, : is used for skipping direct mention of dimensions' lengths. So,

  • shape(x) and shape(y) would presumably be a compile-time constant.
  • x and y could store all its elements directly in the object (like a std::array does)
  • x and y provides pointer stability

I did not get why accessing for y is as y[2][1], not y[2,1]. It is not a nested array, so accessing will be via single memory access. For nested array, you are right about two memory accesses.

Let me separate var y: [i32; :, :] = ((0, 1, 2), (3, 4, 5)); and var z: [i32; :, :];. In the case of y, : is a syntax sugar for skipping direct mention of dimensions' lengths. For z, : means that this array can be allocated/reallocated. And for z array, it is true what you wrote about y, except (3).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably move this discussion to your 1D array proposal, but to answer your questions:

I did not get why accessing for y is as y[2][1], not y[2,1].

Sorry, that was a typo. I've fixed it now.

It is not a nested array, so accessing will be via single memory access. For nested array, you are right about two memory accesses.

Even if it's not nested I believe it needs two memory accesses. Let's assume the 2D array is implemented as a 1D array in row-major order. That means the element at row i and column j is located at index i + N * j in the underlying array, where N is the number of columns. So in order to access that element, I need to know the number of columns. If the number of columns can vary at run-time, that means I need to load the number of columns from memory before I can access the element. On the other hand, if the number of columns is fixed at compile time, I can avoid doing that load.


Array may be defined:
1. via **assignment**:
```carbon
var x: [i32; :, :];
var y: [i32; :, :] = ((0, 1, 2), (3, 4, 5));
x = y;
```
2. via **memory allocation**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely recommend leaving this part out of the current proposal. We don't have a design for dynamic allocation of one-dimensional arrays, or even single objects, and it's impossible to evaluate this part of the proposal in isolation from that design.

```carbon
var x: [i32; :, :];
allocate(x, /*shape=*/(3, 2));
```

If array was already allocated and then, `allocate` called, the runtime error is.

#### Automatic deallocation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason this should work differently for arrays than for single objects? If not, I think we can leave this section out.


Automatically allocated arrays are destroying at the end of scope. For example,
if such arrays belong to class object, they are destroying with class object.

Manually, deallocation can be called using:
```carbon
var x: [i32; :, :];
allocate(x, /*shape=*/(3, 2));
deallocate(x);
```
Calling of deallocation for non-allocated arrays leads to runtime error.

### Operators

Arrays may be modified in scalar and vector ways.

1. Scalar way:
```carbon
var x: [i32; 3, 2] = ((0, 1, 2), (3, 4, 5));
var y: auto = -x;
// each to each elements are summarized
var z: auto = x + y; // z = ((0,0,0), (0,0,0))
```
If shapes are inconsistent, runtime error is.

2. Vector way:
```carbon
var x: [i32; 3, 2] = ((0, 1, 2), (3, 4, 5));
// multiply each element by 2
x *= 2; // x = ((0, 2, 4), (6, 8, 10));
```

### Iterators (?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use cases for this seem likely to be rare enough that named functions would be clearer, and would avoid the need for a new core-language syntax. In particular, I'm concerned about the ... syntax conflicting with variadics.


In multidimensional arrays, it may be useful to have _iterators_ (row-major
order):
```carbon
var a: [f64; 2, 3, 4];
var it: auto = a[:, ...];
for(i: auto in it) { ... }
```
In this example, `i` presents `a[0,:,:]` and `a[1,:,:]` sequentially.
Also, iterator may use last dimension (column-major order):
```carbon
var a: [f64; 4, 3, 2];
var it: auto = a[..., :];
for(i: auto in it) { ... }
```
In this example, `i` presents `a[:,:,0]` and `a[:,:,1]` sequentially.
In both cases, `i` is two dimensional array.

`...` masks all dimensions.

### Functions

Usually, arrays uses as is. Below, sum of two arrays is:
```carbon
fn sum[T:! Type](x: T, y: T) -> T {
return x + y;
}
```

Function returning 1D array:
```carbon
fn arr1D[T:! Type](x: T, y: T) -> [T; :] {
return (x, y);
}
```
or 2D array:
```carbon
fn arr2D[T:! Type](x: T, y: T) -> [T; :,:] {
return ((x, x), (y, y));
}
```

Dimensions may be specified explicitly:
```carbon
fn arr1D[T:! Type](x: T, y: T) -> [T; 2] {
return (x, y);
}
```

Lowering dimensions:
```carbon
fn unarr[T:! Type](x: T[:], y: T[:]) -> T {
return sum(x + y);
}
```

#### Elemental functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the background and rationale for this feature. For example, what problems does it solve? Can those problems be solved with libraries instead of language features? Is there precedent for this feature in other languages?

This might be a good piece to split out into a separate proposal.


These functions applied to each element sequentially.

```carbon
el fn inc[T:! Type](x: T) -> T {
return x + 1;
}
fn Main() -> i32 {
var x: [i32, 3] = (0:2);
var y: i32 = 3;
x = inc(x); // similar to x = x + 1;
y = inc(y);
return 0;
}
```
It is useful when function is more compilicated than increment.

Using _iterators_, elemental function may be used for sub-dimensions:
```carbon
el fn conv[T:! Type](x: T) -> T {
return sum(x);
}
fn Main() -> i32 {
var x: [i32; 3, 4] = reshape((0:11),/*shape=*/(3, 4));
var y: auto = conv(x[:,...]); // y = (6, 22, 38)
var z: auto = conv(x[...,:]); // z = (12, 15, 18, 21)
return 0;
}
```

### Standard library
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like another good piece to postpone to a future proposal. I'm not sure if we even have the right people on the project to properly review a full library design for multidimensional arrays, so it might be better to focus for now on the core-language functionality.


#### allocate
Allocates array:
```carbon
var x: [i32; :, :];
allocate(x, /*shape=*/(3, 2));
```
#### deallocate
Deallocates array:
```carbon
var x: [i32; :, :];
allocate(x, /*shape=*/(3, 2));
deallocate(x);
```
#### allocated
Returns status of allocation:
```carbon
var x: [i32; :, :];
allocated(x); // False
allocate(x, /*shape=*/(3, 2));
allocated(x); // True
deallocate(x);
allocated(x); // False
```
#### shape
Returns shape of arrays:
```carbon
var s: [i32; 2] = shape(x); // s = (3, 2)
```
#### size
Returns total array size:
```carbon
var l: i32 = size(x); // l = 6
```
With optional argument `dim` returns size in given dimension (indexing from 1):
```carbon
var l1: i32 = size(x, /*dim=*/1); // l1 = 3
var l2: i32 = size(x, /*dim=*/2); // l2 = 2
```
#### reshape
Reshapes array:
```carbon
var x: [i32; 3, 2] = reshape(/*array=*/(0, 1, 2, 3, 4, 5), /*shape=*/(3, 2));/
```
#### transpose
Transposes array (without additional argument only for 2D):
```carbon
var x: [i32; 3, 2] = ((0, 1, 2), (3, 4, 5));
var y: auto = transpose(x); // y = ((0, 3), (1, 4), (2, 5))
var z: auto = shape(y); // z = (2, 3)
```
Additional argument `dims` marks dimensions for transposing:
```carbon
var x: [i32; 2, 2, 2] = ( ((0, 1), (2, 3)), ((4, 5), (6, 7)) );
var y: auto = transpose(x, /*dims=*/(1, 3));
// y = ( ((0, 4), (2, 6)), ((1, 5), (3, 7)) )
```
#### sum
Sums all values in array:
```carbon
var x: [i32; 2, 2, 2] = ( ((0, 1), (2, 3)), ((4, 5), (6, 7)) );
var y: auto = sum(x); // y = 28
```

## Rationale

This proposal should simplify to write High-Performance Compiting codes,
most of them is performance-critical software. Unfortunately, C++ code is not
affected.

## Alternatives considered

I'm under high impress of Fortran.