Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(#56) blog about caching #58

Closed
Closed
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
283 changes: 283 additions & 0 deletions _posts/2024/2024-02-06-about-caching-in-eo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
---
layout: post
date: 2024-02-06
title: "Build cache in EO and other build systems"
author: Alekseeva Yana
---


## Introduction
In [EO](https://github.com/objectionary/eo), caching is used to speed up program compilation.
Recently we found a caching
[bug](https://github.com/objectionary/eo/issues/2790) between goals in `eo-maven-plugin`
for EO version `0.34.0`. The bug occurred because the old verification method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 It's better to say: "The bug occurred because the old verification method used compilation time and caching time to search for a cached file"

used compilation time and caching time to search for a cached file.
This is not the most reliable verification method,
because caching time does not have to be equal to compilation time.
We came to the conclusion that we need caching with a reliable verification method.
Furthermore, this verification method should refrain from reading the file content.

The goal is to implement effective caching in EO.
To achieve the goal, we will briefly look at how well-known used build systems (such as ccache, Maven, Gradle)
in order to gain a deeper understanding of the caching concepts employed within them.

<!--more-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 "More"?


## Caching in Other Build Systems

### ccache/sccache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Is it a build system or what? Where is the link? Short description?

In compiled programming languages, building a project with many source code files takes a long time.
This time is spent on loading of libraries, preparing, optimizing, checking the code, and so on.
Let's look at the assembly scheme using C++ as an example [Picture 1](/images/defaultCPhase.svg):

<p align="center">
<img src="/images/defaultCPhase.svg">
</p>

1) First, preprocessor retrieves the source code files,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 You only say that "preprocessor" only retrieves the source code files. And then... magic...:

The result is a single file .cpp with human-readable code that the compiler will get.

Moreover, you don't need "compiler will get"

which consist of both source files `.cpp` and header files `.h`.
The result is a single file `.cpp` with human-readable code that the compiler will get.
2) The compiler receives the file `.cpp` from the preprocessor and compiles it into an object file - `.obj`.
At the compilation stage, parsing checks whether the code matches rules of a specific programming language.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Did you mean "parser" instead of "parsing"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo yes, thanks

At the end, the compiler optimizes the resulting machine code and produces an object file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 You already mentioned it:

The compiler receives the file .cpp from the preprocessor and compiles it into an object file - .obj.

To speed up compilation, different files of the same project might be compiled in parallel.
3) Then, the [Linker](https://en.wikipedia.org/wiki/Linker_(computing)) combines object files
into an executable `.exe` file.


To speed up the build of compiled languages, [ccache](https://ccache.dev)
and [sccache](https://github.com/mozilla/sccache) are used.
`ccache` uses the hash algorithm for the hashing of code at certain stages of the build.
When compiling a file, its hash is calculated.
If the file is already present in the registry of compiled files, the file will not be compiled again.
Instead, the previously compiled binary file will be utilized.
This approach can significantly accelerate the build process of certain packages, reducing build times by 5-10 times.
The [`ccache` hash](https://ccache.dev/manual/4.8.2.html#_common_hashed_information) is
based on:
* the file contents
* the current directory of the file
* the name of the compiler
* the compiler’s size and modification time
* extensions used by the compiler.

Moreover, `ccache` has two types of the hashing:
1) `Direct mode` - the hash is generated based on the source code only.
When using this mode, the user must ensure that the external libraries used in a project have not changed.
Otherwise, the project might fail to build, resulting in errors.
2) `Preprocessor mode` - hash is generated based on the `.cpp` file received after the preprocessor step.


`Sccache` is similar in purpose to `ccache` but provides more functionality.
`Sccache` allows to store cached files not only locally, but also in a cloud data storage.
And `sccache` supports a wider range of languages, while `ccache` focuses on caching C and C++ compiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Maybe we need to write a short summary 1-2 sentences about this type of caching?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo The principle of caching in sccache is the same as in ccache (using Direct and Preprocessor modes), the only difference is in the places where the data is stored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 I mean ccache and sccache altogether. What is the difference with other types of caching? Why did you choose these tools?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo I wrote above that I looked at well-known used build systems. Isn't this enough?


`ccache` cannot work with individual compilation tasks (e.g. `Maven goal` or `Gradle task`).
However, the hashing approach and the concept of non-local data storage could potentially
be incorporated during the development of the EO caching mechanism.


### Gradle
[Gradle](https://gradle.org) builds projects using a
[task graph](https://docs.gradle.org/current/userguide/build_lifecycle.html) that allows for synchronous execution
of certain tasks. A task represents a unit of work in `Gradle` project.
`Gradle` employs
[Incremental build](https://docs.gradle.org/current/userguide/incremental_build.html#sec:how_does_it_work),
to speed up project builds.
For an incremental build to work, the tasks used to build the project must have specified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Could you please simplify this sentence and use simple active voice?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo "The tasks that build the project must have input and output files for an incremental build to work." - is it ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 The second sentence clearly explains the idea which you are trying to explain here. I would suggest to combine this two sentences into a single one. Or jut to remove this sentence. What do you think?

input and output files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 "To enable an incremental build, the tasks that build the project must specify their input and output files."

The provided code snippet demonstrates the implementation of a task in Gradle:
```
task myTask {
inputs.file 'src/main/java/MyTask.somebody' // Specify the input file
outputs.file 'build/classes/java/main/MyTask.somebody' // Specify the output file

doLast {
// Task actions go here
// This code will only be executed if the inputs or outputs have changed
}
}
```


To understand how `Incremental build` works, consider the following steps:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Something strange is happening here with punctuation. Did you put this sentences in this order intentionally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo If I replace "To understand how Incremental build works, consider the following steps" with "How Incremental build works", will it be ok?

Copy link
Member

@volodya-lombrozo volodya-lombrozo May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Is it possible to remove this sentence?

`Incremental build` uses a hash to detect changes in the inputs and the outputs.
The single hash contains the paths and the contents of all the input files or output files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 "contains"? Maybe "uses"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo In Gragle documentation: "This fingerprint contains the paths of input files and a hash of the contents of each file."

1) Before executing a task, `Gradle` takes a hash of the input files and saves it.
The hash is considered valid if the last modification time and the size of the source files have not changed.
2) Then `Gradle` executes the task and saves a hash of the output files.
3) Then, when Gradle starts a project build again, it generates a new hash for the same files.
If the new hash is valid, Gradle can safely skip this task.
In the opposite case, the task performs an action again and rewrites outputs.


In addition to `Incremental build`, `Gradle` also stores hash of previous each build, enabling quick project builds,
for example when switching from one git branch to another. This feature is known as
the [Build Cache](https://docs.gradle.org/current/userguide/build_cache.html).


`Gradle Incremental build` can manage separate compilation tasks based on inputs and outputs.
And the EO compiler consists from a unit of work in `Maven` (the last section contains a detailed description).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 You can add a link to the Maven section.

Steps of the EO compiler can have input and output files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Why did you write this two sentences about eo?

Building upon the concept of `Gradle Incremental Build`, we can use its principles to develop the EO caching mechanism.


### Maven
[Maven](https://maven.apache.org) automates and manages Java-project builds.
`Maven` is based on the concept of
[Maven LifeCycles](https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html),
which include default, clean, and site lifecycles.
Each lifecycle consists of `phases` and these `phases` consist of sets of `goals`.
One `phase` can consist of several `goals`.

In Maven, there are default phases and goals for building any projects:

<p align="center">
<img src="/images/defaultPhaseMaven.svg">
</p>

In Maven, the `phases` are inherently interconnected within the build lifecycle.
A `phase` represents a specific task, and the execution order of `phases` is determined by the default Maven
lifecycle bindings. Each `phase` functions as a series of individual tasks known as `goals`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 You have already described phases and goals above. Could you please remove this redundancy and repetition?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo I will remove 130-131 lines:

Each lifecycle consists of phases and these phases consist of sets of goals.
One phase can consist of several goals.

There are `goals` tied to the Maven lifecycle, as shown in [Picture 2](/images/defaultPhaseMaven.svg).
It's also possible to add a new `goal` to a desired phase by modifying the `pom.xml` file.
Additionally, Maven also supports `goals` that are not bound to any build phase
and can be executed outside the build lifecycle, directly through the command line.
The sequence of achieving `goals` is as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 I'm not sure we should explain maven lifecycle so deep. I would just leave necessary information. For the rest you might provide a link to the documentation.

1) The `goals` tied to the Maven lifecycle are executed first.
2) The `goals` added to the `pom.xml` file are executed second.
3) The `goals` that are not tied to `phases` can be executed last.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Do you need this empty line?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo I guess so, because I talk since in lines 122-128 about Maven and in lines 131-150 about caching mechanisms in Maven.


`Maven` can utilize caching mechanisms through the `takari-lifecycle-plugin` and `maven-build-cache-extension`:

* The [takari-lifecycle-plugin](http://takari.io/book/40-lifecycle.html) is an alternative to the default Maven lifecycle
(building JAR files). Its distinct feature lies in the use of a single universal plugin with the equivalent
functionality to plugins for the standard lifecycle, but with significantly fewer dependencies. This plugin leverages
[The Takari Incremental API](https://github.com/takari/io.takari.incrementalbuild),
which introduces the concept of `builders`. These `builders` are user-provided public non-abstract
top-level classes that implement specific build actions, denoted as methods annotated `@Builder`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Do we really need to know this low-level details about takari?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 The question remains

They can produce various types of outputs, including generated/output files on the filesystem,
build messages, and project model mutations. For each `@Builder` annotated method, a maven mojo,
which represents a maven `goal`, is generated.
When a `builder` is run for a given set of inputs, it produces and saves to the specified directory the same outputs.
Any changes in the inputs result in the removal of outputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Do you need this empty line?


* The [maven-build-cache-extension](https://maven.apache.org/extensions/maven-build-cache-extension/)
is utilized for large Maven projects that have a significant number of small `modules`.
A `module` refers to a subproject within a larger project.
Each `module` has its own `pom.xm` file, and there is an aggregator `pom.xml` that consolidates all the `modules`.
This plugin takes a hash from `module` inputs and stores outputs in the cache.
The cache restores unchanged `modules`.
In the opposite case, the cache seamlessly delegates the build work to the standard Maven core,
without interfering with the build execution logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 What is the conclusion? Why did you mention Maven? Does this caching similar to Grade? to ccache? What is the difference?


Let's clarify upfront that the Maven Build Cache Extension is not suitable for caching EO compilation stages,
as it is designed for caching at the module level within a project and not for individual tasks.


Special attention should be given to the Takari Incremental API.
This API can be applied to cache EO compilation stages as it operates with `goals`.
It does not use hashing algorithms, which can slow down project build times,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 How does it possible? Does it really cache something if it doesn't use hashing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volodya-lombrozo The Takari checks the last modification time of the input files. It doesn't create a hash.

Or did I not understand the question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 It would be good to mention it here:

The Takari checks the last modification time of the input files. It doesn't create a hash.

and it does not have separate cache directories.
Each `builder` has own directories for input and output data related to their work.
The operational principle of the Takari Incremental API is similar to the operation of caching in EO.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 We don't know about caching in EO yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Still an issue.


### EO build cache

The EO code uses the `Maven` for building projects.
For this purpose, there is the `eo-maven-plugin` containing the essential goals for working with EO code.
As previously mentioned, the build of projects in Maven follows a specific order of phases.
Below is a diagram illustrating the main phases and their corresponding goals for the EO:

<p align="center">
<img src="/images/EO.svg">
</p>

In [Picture 3](/images/EO.svg) the goals of the `eo-maven-plugin` are highlighted in green.


However, the actual work with EO code takes place in `AssembleMojo`.
`AssembleMojo` is the goal consisting of other goals that work with the EO file, as shown in
[Picture 4](/images/AssembleMojo.svg).


<p align="center">
<img src="/images/AssembleMojo.svg">
</p>

Each goal within `AssembleMojo` is a distinct compilation step for EO code.
These tasks happen one after the other, and each task relies on the output of the one before it.
Each task has directories for input and output data, as well as a directory for storing cached data.
Using the program name, each task can receive and store data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Why do you need two consecutive empty lines here? If you need some logical division, use headings and clear sections.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Same question.


The previous caching mechanism in EO made use of distinct interfaces, specifically `Footprint` and `Optimization`.
These caching interfaces shared similar logic, but with minor differences.
For instance, `Footprint` verifies the EO version of the compiler, whereas the remaining checks are identical.
Additionally, the conditions for searching data in the cache had errors.
Due to this issue, the program behaved incorrectly, because saving the goal's result to the cache is not instantaneous.
After conducting an in-depth analysis of the project's incorrect operation,
several disadvantages of the previous EO caching mechanism were brought to light:
* Incorrect search conditions for data in the cache.
* The verification method requires reading the file content, which results in inefficiencies.
* The presence of multiple caching mechanisms creates challenges in identifying and rectifying caching errors.
* Employing multiple caching mechanisms for similar entities is a suboptimal practice,
leading to redundancy and complicating the caching infrastructure.


In tackling caching challenges within EO, we conducted a thorough evaluation of current caching systems.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 There is lot of "water" in this paragraph. Could you simplify this text please?

Most existing caching systems are not suitable for the EO project.
However, one candidate emerged as a potential solution for caching EO compilation stages: the Takari Incremental API.
The Takari Incremental API exhibits key similarities with the EO caching system,
notably in its utilization of inputs and outputs directories, absence of a hash for data storage and retrieval,
and compatibility with Maven goals.
However, it diverges from the EO caching approach in one significant aspect – the absence of a distinct cache directory.

We can try to use this API or implement our own caching approach, correcting the disadvantages found.
The envisioned approach involves the creation of a singular class responsible
for storing and retrieving data from the cache.
The logic for checking the relevance of cached data is presented below:
1) We create EO program, named "example".
Intermediate files during compilation of this program will have the same name, but not the format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 Why do we need to have different format?

(e.g. `example.eo`, `example.xml`).
2) When the EO compiler compiles this program task, it saves intermediate files of compilation steps into cache.
Each compilation step has own caching directory.
3) When the EO compiler starts a project build again, it will check if there is a file, named "example",
in the cache of each step. If such a file exists,
then it is enough to check that the last modification time of this file at the current step
is later than at the previous step. If this condition is true,
then the finished file can be retrieved from the cache.
Below is a diagram illustrating the EO compilation steps, which have caching directory for EO version `0.34.0`:

<p align="center">
<img src="/images/SavingInCacheEO.svg">
</p>

4) If the EO program file [Picture 5](/images/RewritingInCacheEO1.svg)
or an intermediate file [Picture 6](/images/RewritingInCacheEO2.svg) have changed,
then the previously cached files becomes invalid.
In this case, the compilation step performs an action again and rewrites outputs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 "rewrite outputs"? Do you mean cache here?


<p align="center">
<img src="/images/RewritingInCacheEO1.svg">
</p>


<p align="center">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96, I agree that we need to remove the redundancy in the code. For example, we should combine the Footprint and Optimization methods and fix the hash comparison mechanism. As for checking the "previous" step, I completely disagree. First of all, some steps might be skipped, and more importantly, doing this significantly increases coupling between the phases, which is a significant architectural flaw.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yanich96 This comment is still relevant.

<img src="/images/RewritingInCacheEO2.svg">
</p>












49 changes: 49 additions & 0 deletions images/AssembleMojo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading