Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(inverse_transform): enable fit and transform with horizontal_matrix #139

Merged
merged 2 commits into from
Oct 16, 2024

Conversation

raimbaultL
Copy link
Contributor

@raimbaultL raimbaultL commented Oct 16, 2024

After experiments, we conclude that allowing avatarization with a horizontal matrix is not a problem.

@mguillaudeux
Copy link
Contributor

Summary of Dimensionality Limitation in PCA, MCA, and FAMD

The maximum number of dimensions in methods like PCA (ACP), MCA (ACM), and FAMD (AFDM) is constrained by the rank of the data matrix. This rank is equal to the minimum of the number of rows (n, individuals) and columns (p, variables).

In PCA, the covariance matrix's rank cannot exceed the smallest of n or p, meaning the number of principal components is limited by this minimum.
In MCA, working with qualitative data, the number of factor axes is similarly limited by the degrees of freedom in the matrix.
In FAMD, which handles mixed data, the same logic applies: the number of dimensions is restricted by the matrix rank, reflecting the minimum between the number of individuals and variables.

Example: For a dataset with 100 individuals and 50 variables, the maximum number of dimensions after projection will be 50, as there are only 50 variables to define the variability.

This constraint is a fundamental mathematical property of matrix rank, which determines the number of independent linear combinations available from the data.

Technically other components could be added but with 0 more variance explained as they would necessarily be ortogonal with an already existing one.

Please Note that 100% of total variance is still captured in this limited number of components

@mguillaudeux
Copy link
Contributor

Regarding the previous explantion, adding use_approximate_inverse and triggering error when p > n is completely useless as the resulting matrix will be of size (n, min(n, p)) no matter what.
However, a warning should still be considered if the nf argument of fit function is set above min(n, p) on purpose by the user

) -> None:
"""Verify that the coordinates are a squared matrix even if the input is horizontal."""
coord, __ = fit_transform(df_to_fit_transform)
assert coord.shape[0] == coord.shape[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glad you added this test

Copy link
Contributor

@mguillaudeux mguillaudeux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !
I added some theoretical arguments in the Conversation of the PR.
I also wonder what we should do when the user purposely specifies nf > min(n, p) in the fit function? Error, Warning, Nothing?
Personally I would like to avoid returning to many errors when not necessary

Also maybe we could update docsting of retunrs of transform and fit transform function explaining that coords df will always be of size (n, min(n,p)) or (n, nf) if nf specified and < min(n,p)

Copy link
Contributor

@albanfelix albanfelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except the mistake in the test

) -> None:
"""Verify that the coordinates are a squared matrix even if the input is horizontal."""
coord, __ = fit_transform(df_to_fit_transform)
assert coord.shape[0] == coord.shape[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert coord.shape[0] == coord.shape[0]
assert coord.shape[0] == coord.shape[1]

@albanfelix
Copy link
Contributor

albanfelix commented Oct 16, 2024

Summary of Dimensionality Limitation in PCA, MCA, and FAMD

The maximum number of dimensions in methods like PCA (ACP), MCA (ACM), and FAMD (AFDM) is constrained by the rank of the data matrix. This rank is equal to the minimum of the number of rows (n, individuals) and columns (p, variables).

In PCA, the covariance matrix's rank cannot exceed the smallest of n or p, meaning the number of principal components is limited by this minimum.
In MCA, working with qualitative data, the number of factor axes is similarly limited by the degrees of freedom in the matrix.
In FAMD, which handles mixed data, the same logic applies: the number of dimensions is restricted by the matrix rank, reflecting the minimum between the number of individuals and variables.

Example: For a dataset with 100 individuals and 50 variables, the maximum number of dimensions after projection will be 50, as there are only 50 variables to define the variability.

This constraint is a fundamental mathematical property of matrix rank, which determines the number of independent linear combinations available from the data.

Technically other components could be added but with 0 more variance explained as they would necessarily be ortogonal with an already existing one.

Please Note that 100% of total variance is still captured in this limited number of components

The rank is NOT equal to the minimum dimension, but bounded by the minimal dimension. If rank(matrix)=min(n,p), the matrix is said to be full ranked.

Copy link
Contributor

@jpetot jpetot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. I wonder what the approximate inverse is, or how the inverse transformation works when the data frame is horizontal.

if not use_approximate_inverse and n_records < n_dimensions:
raise InvalidParameterException(
f"n_dimensions ({n_dimensions}) is greater than n_records ({n_records})."
)
# Get back scaled_values from coord with inverse matrix operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what is the approximate inverse if this is the only place where we used it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inverse transform can't be in horizontal matrix as the coordinates are at worst squared.
and we use https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html to do the inverse as https://numpy.org/doc/stable/reference/generated/numpy.linalg.inv.html
is only on squared matrix

@raimbaultL raimbaultL merged commit 844e95e into main Oct 16, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants