Skip to content

Conversation

jvns
Copy link

@jvns jvns commented Oct 3, 2025

Changes in v2:

The biggest change is to remove all mentions of the .git directory, and explain references in a way that doesn't refer to "directories" at all, and instead talks about the "hierarchy" (from Kristoffer and Patrick's reviews).

Also:

  • objects: Mention that an object ID is called an "object name", and update the glossary to include the term "object ID" (from Junio's review)
  • objects: Replace "SHA-1 hash" with "cryptographic hash" which is more accurate (from Patrick's review)
  • blobs: Made the explanation of git gc a little higher level and took some ideas from Patrick's suggested wording (from Patrick's and Kroftoffer's reviews)
  • commits: Mention that tag objects and commits can optionally have other fields. I didn't mention the GPG signature specifically, but don't have any objections to adding it. (from Patrick and Junio's reviews)
  • commits: Remove one of the mentions of git gc, since it perhaps opens up too much of a rabbit hole: "how does git gc decide which commits to clean up?". (from Kristoffer's review)
  • tag objects: Add an example of how a tag object is represented (from user feedback on the draft)
  • index: Use the term "file mode" instead of "permissions", and list all allowed file modes (from Patrick's review)
  • index: Use "stage number" instead of "number" for index entries (from Patrick's review)
  • reflogs: Remove "any ref can be logged", it raises some questions of "how do you tell Git to log a ref that it isn't normally logging?" and my guess is that it's uncommon to ask Git to log more refs. I don't think it's a "lie" to omit this but I can bring it back if folks disagree. (from Patrick's review)
  • reflogs: Fix an error I noticed in the explanation of reflogs: tags aren't logged by default and remote-tracking branches are, according to man git-config
  • branches and tags: Be clearer about how branches are usually updated (by committing), and make it a little more obvious that only branches can be checked out. This is a bit tricky because using the word "check out" introduces a rabbit hole that I want to avoid (what does "check out" mean?). I've dealt this by just talking about the "current branch" (HEAD) since that is defined here, and making it more explicit that HEAD must either be a branch or a commit, there's no "HEAD is a tag" option. (from Patrick's review)
  • tags: Explain the differences between annotated and lightweight tags (this is the main piece of user feedback I've gotten on the draft so far)
  • Various style/typo changes ("2 or more", linkgit:git-gc[1], removed extra asterisks, added empty SYNOPSIS, "commits -> tags" typo fix, add to meson build)

non-changes:

  • I still haven't mentioned things that aren't part of the "data model", like revision params and configuration. I think there could be a place for them but I haven't found it yet.
  • tag objects: I noticed that there's a "tag" header field in tag objects (like tag v1.0.0) but I didn't mention it yet because I couldn't figure out what the purpose of that field is (I thought the tag name was stored in the reference, why is it duplicated in the tag object?)

Changes in v3:

I asked for feedback from Git users on Mastodon and got 220 pieces of feedback from 48 different users. People seemed very excited to read about Git's data model. Usually I judge explanations by what folks report learning from them. Here people reported learning:

  • how branches are stored (that a branch is "a name for a commit")
  • how objects work
  • that Git has separate "author" and "committer" fields
  • that amending a commit does not change it
  • that a tree is "just a directory" (not something more complicated), and how trees are stored
  • that Git repos can contain symlinks
  • that Git saves modes separately from the OS.
  • how the stage number works
  • that when you git add a file, Git will create an object
  • that third-party tools can create their own refs.
  • that the reflog stores the history of branches (not just HEAD), and what reflogs are for

Also (of course) there were quite a few points of confusion! The main 4 pieces of feedback were

  1. The index section doesn't explain what the word "staged" means, and one person says that it makes it sounds like only files that you "git add"ed are in the index. Rewrite the explanation to avoid using the word "staged" to define the index and instead define the word "staging".
  2. Explain the difference between "annotated tags" and "lightweight tags" (done)
  3. Add examples for tag objects and reflogs (done)
  4. Mention a little more about where things are stored in the .git directory, which I'd removed in v2. This seems most important for .git/refs, so I added a hopefully accurate note about how refs are stored by default, with a comment about one of the major implications. I did not discuss where objects or the index are stored, because I don't think the implementation details of how objects are stored are as important, and there are better tools for viewing the "raw" state of objects and the index (with git cat-file -p or git ls-files --staged).

Here's every other change I made in response to the feedback, as well as a few comments that I did not address.

intro:

  • Give a 1-sentence intro to "reflog"

objects:

  • people really like having git ls-files --stage as a way to view the index, so add git cat-file -p as well in a note

commits:

  • 2 people asked "Are commits stored as a diff?". Say that diffs are calculated at runtime, this is very important.
  • The order the fields are given in don't match the order in the example. Make them match.
  • "All the files in the commit, stored as a tree" is throwing a few people off. Be clearer that it's the tree ID of the base directory.
  • Several people asked "What's the difference between an author and committer? I added an example using git cherry-pick that I'm not 100% happy with (what if the reader doesn't know what cherry-pick does?). There might be a better example to give here.
  • In the note about commits being amended: one person suggested saying "creates a new commit with the same parent" to make it clearer what the relationship between the new and old commit are. I liked that idea so I did it.

trees:

  • file modes. 2 people want to know more about "The file mode, for example 100644". Also 2 people are curious about what relationship these have to Unix permissions. Say that they're inspired by Unix permissions, and move the list of possible file modes up to make the relationship clearer
  • On "so git-gc(1) periodically compresses objects to save disk space", there are a few follow up comments wondering about more, which makes me think the comment about compression is actually a distraction. Say something simpler instead, ("Git only needs to store new versions of files which were changed in that commit"), from Junio's suggestion
  • Re "commit (a Git submodule)": 2 people say it's not clear how trees relate to submodules. Say that it refers to a commit in a different repository.
  • One person says they're not sure if the "object ID" is a hash. Link it to the definition of "object ID".

tag objects:

  • Requests for an example, added one.
  • Requests to explain the difference between "lightweight" and "annotated" tags, added it.

tags:

  • one person thinks "It’s expected that a tag will never change after you create it." is too strong (since of course you can change it with git tag -f). Say instead that tags are "usually" not changed.

HEAD:

  • Several people are asking for more detail about detached HEAD state. There's actually quite a lot to talk about here (what it means, how it happens, what it implies, and how you might adjust your workflow to avoid it by using git switch). I don't think we can get into all of that here, so refer to the DETACHED HEAD section of git-checkout instead. I'm not totally happy with the current version of that section but that seems like the most practical solution right now.

remote-tracking branches:

  • discuss refs/remotes/<remote>/HEAD.

the index:

  • "permissions" should be "file mode" (like with trees). Changed.
  • "filename" should be "file path". Changed.
  • the stage number can only be 0, 1, 2, or 3, since it's 2 bits. Also maybe say that the numbers have specific meanings. Said it can only be 0/1/2/3 but did not give the specific meanings.

reflogs

  • Request for an example. Added one.
  • It's not clear if there's one reflog per branch/tag/HEAD, or if there's one universal reflog. Make this clearer.
  • Mention the role of the reflog in retrieving "lost" commits or undoing bad rebases.

Not fixed:

  • intro: A couple of people say that it's confusing that tags are both "an object" and "a reference". Handled this by just explaining the difference between an annotated and a lightweight tag further down. I'd like to make this clearer in the intro but not sure if there's a way to do it.
  • commits and tag objects: one person asks if there's a reference for the other "optional fields", like "encoding" and "gpgsig". I couldn't find one, so left this as is.
  • HEAD: A couple of people ask if there are any other symbolic references other than HEAD, or if they can make their own symbolic references. I don't know the answer to this.
  • HEAD: the HEAD: HEAD thing looks weird, it made more sense when it was HEAD: .git/HEAD. Will think about this.
  • reflogs: One person asks: if reflogs only store local changes, why does it track the user who made the change? Is that for remote operations like fetches and pulls? Or for cases where more than one user is using the same repo on a system? I don't know the answer to this.
  • reflogs: How can you see the full data in the reflog? git reflog show doesn't list the user who made the change. git reflog show <refname> --format="%h | %gd | %gn <%ge> | %gs" --date=iso seems to work but it's really a mouthful, not sure it's useful to include all that.
  • index: Is it worth mentioning that the index can be locked? I don't have an opinion about this.
  • other: One person asks what a "working tree" is. It made me wonder if "the current working directory" has a place in Git's data model. My feeling is "no" but I could be convinced otherwise.
  • overall: "How can Git be so fast? If I switch branches, how does it figure out what to add, remove or replace?". I don't think this is the right place for that discussion but it would
  • there are some docs CI errors I haven't figured out yet (IDREF attribute linkend references an unknown ID "tree")

cc: "Kristoffer Haugsbakk" kristofferhaugsbakk@fastmail.com
cc: "D. Ben Knoble" ben.knoble@gmail.com
cc: Patrick Steinhardt ps@pks.im

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 3, 2025

There are issues in commit 31993be:
doc: Add a explanation of Git's data model
Prefixed commit message must be in lower case
Commit not signed off

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 3, 2025

There are issues in commit c3ff12a:
doc: Add a explanation of Git's data model
Prefixed commit message must be in lower case

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 3, 2025

There are issues in commit bfcc916:
doc: Add a explanation of Git's data model
Prefixed commit message must be in lower case

@jvns jvns force-pushed the gitdatamodel branch 4 times, most recently from f7eadcf to fcbd21b Compare October 3, 2025 17:30
@jvns
Copy link
Author

jvns commented Oct 3, 2025

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 3, 2025

Submitted as pull.1981.git.1759512876284.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1981/jvns/gitdatamodel-v1

To fetch this version to local tag pr-1981/jvns/gitdatamodel-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1981/jvns/gitdatamodel-v1

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 3, 2025

On the Git mailing list, "Kristoffer Haugsbakk" wrote (reply to this):

On Fri, Oct 3, 2025, at 19:34, Julia Evans via GitGitGadget wrote:
> From: Julia Evans <julia@jvns.ca>
>
> Git very often uses the terms "object", "reference", or "index" in its
> documentation.
>
> However, it's hard to find a clear explanation of these terms and how
> they relate to each other in the documentation. The closest candidates
> currently are:
>
> 1. `gitglossary`. This makes a good effort, but it's an alphabetically
>     ordered dictionary and a dictionary is not a good way to learn
>     concepts. You have to jump around too much and it's not possible to
>     present the concepts in the order that they should be explained.
> 2. `gitcore-tutorial`. This explains how to use the "core" Git commands.
>    This is a nice document to have, but it's not necessary to learn how
>    `update-index` works to understand Git's data model, and we should
>    not be requiring users to learn how to use the "plumbing" commands
>    if they want to learn what the term "index" or "object" means.
> 3. `gitrepository-layout`. This is a great resource, but it includes a
>    lot of information about configuration and internal implementation
>    details which are not related to the data model. It also does
>    not explain how commits work.
>
> The result of this is that Git users (even users who have been using
> Git for 15+ years) struggle to read the documentation because they don't
> know what the core terms mean, and it's not possible to add links
> to help them learn more.
>
> Add an explanation of Git's data model. Some choices I've made in
> deciding what "core data model" means:
>
> 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me
>    if those are intended to be user facing or if they're more like
>    internal implementation details.
> 2. Don't talk about submodules other than by mentioning how they
>    relate to trees. This is because Git has a lot of special features,
>    and explaining how they all work exhaustively could quickly go
>    down a rabbit hole which would make this document less useful for
>    understanding Git's core behaviour.
> 3. Don't discuss the structure of a commit message
>    (first line, trailers, GPG signatures, etc).
>    Perhaps this should change.
>
> Some other choices I've made:
>
> 1. Mention packed refs only in a note.

I don’t think it’s worth mentioning this at all.  More on that later.

> 2. Don't mention that the full name of the branch `main` is
>    technically `refs/heads/main`. This should likely change but I
>    haven't worked out how to do it in a clear way yet.

I think this is worth getting into.  This is a pretty
user-facing concept.

> 3. Mostly avoid referring to the `.git` directory, because the exact
>    details of how things are stored change over time.
>    This should perhaps change from "mostly" to "entirely"
>    but I haven't worked out how to do that in a clear way yet.

I think that’s good.  I mean, I think us users don’t need that level of
detail and shouldn’t be “inspired” to muck with the internals.  If that
makes sense.  (See later)

>
> Signed-off-by: Julia Evans <julia@jvns.ca>
> ---
>     doc: Add a explanation of Git's data model
>[snip]
> diff --git a/Documentation/Makefile b/Documentation/Makefile
>[snip]
> diff --git a/Documentation/gitdatamodel.adoc
> b/Documentation/gitdatamodel.adoc
> new file mode 100644
> index 0000000000..4b2cb167dc
> --- /dev/null
> +++ b/Documentation/gitdatamodel.adoc
> @@ -0,0 +1,226 @@
> +gitdatamodel(7)
> +===============
> +
> +NAME
> +----
> +gitdatamodel - Git's core data model
> +
> +DESCRIPTION
> +-----------
> +
> +It's not necessary to understand Git's data model to use Git, but it's
> +very helpful when reading Git's documentation so that you know what it
> +means when the documentation says "object" "reference" or "index".

I haven’t gone hunting through the docs to see if this is covered
elsewhere.  But the thrust of all the things here definitely feel to me
like something that should be presented and documented in such a way.

> +
> +Git's core operations use 4 kinds of data:

Maybe small numerals should be spelled as words in running text?

> +
> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects
> +2. <<references,References>>: branches, tags,
> +   remote-tracking branches, etc
> +3. <<index,The index>>, also known as the staging area
> +4. <<reflogs,Reflogs>>

Reflogs is certainly auxiliary ref data. What makes it qualify as
one-of-the-four?  I am open to it being both, to be clear.

> +
> +[[objects]]
> +OBJECTS
> +-------
> +
> +Commits, trees, blobs, and tag objects are all stored in Git's object
> database.
> +Every object has:
> +
> +1. an *ID*, which is the SHA-1 hash of its contents.
> +  It's fast to look up a Git object using its ID.
> +  The ID is usually represented in hexadecimal, like
> +  `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`.
> +2. a *type*. There are 4 types of objects:
> +   <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>,
> +   and <<tag-object,tag objects>>.
> +3. *contents*. The structure of the contents depends on the type.
> +
> +Once an object is created, it can never be changed.
> +Here are the 4 types of objects:

As a curious Git user this seems correct.

> +
> +[[commit]]
> +commits::
> +    A commit contains:
> ++
> +1. Its *parent commit ID(s)*. The first commit in a repository has 0
> parents,

Maybe this is a subjective style thing but is it necessary to use “(s)”
when the context makes clear that it could be zero to many?

    Its *parent commit IDs. ...

> +  regular commits have 1 parent, merge commits have 2+ parents

s/2+/two or more/ ?

Same point as the “numeral” one above.

> +2. A *commit message*
> +3. All the *files* in the commit, stored as a *<<tree,tree>>*
> +4. An *author* and the time the commit was authored
> +5. A *committer* and the time the commit was committed
> ++
> +Here's how an example commit is stored:
> ++
> +----
> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
> +author Maya <maya@example.com> 1759173425 -0400
> +committer Maya <maya@example.com> 1759173425 -0400
> +
> +Add README
> +----
> ++
> +Like all other objects, commits can never be changed after they're
> created.
> +For example, "amending" a commit with `git commit --amend` creates a
> new commit.

> +The old commit will eventually be deleted by `git gc`.

Maybe this could be moved to a part about what happens (eventually) to
unreachable objects?

Mentioning `git gc` and how things will get deleted raises
questions naturally. Like why would they be deleted? Okay
that’s clear: the previous commit will be replaced by the
amended one. Then when it is not reachable by anything
(even the reflog) it will get garbage collected.

It all follows. But is the reader necessarily mature enough
in their understanding to make the inference?

This is a long-winded way of saying: if you’re gonna discuss
`git gc` you might need to go into all of these concepts.

> +
> +[[tree]]
> +trees::
> +    A tree is how Git represents a directory. It lists, for each item
> in
> +    the tree:
> ++
> +1. The *permissions*, for example `100644`
> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
> +  or <<commit,`commit`>> (a Git submodule)
> +3. The *object ID*
> +4. The *filename*
> ++
> +For example, this is how a tree containing one directory (`src`) and
> one file
> +(`README.md`) is stored:
> ++
> +----
> +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md
> +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src
> +----
> ++
> +*NOTE:* The permissions are in the same format as UNIX permissions, but
> +the only allowed permissions for files (blobs) are 644 and 755.
> +

Makes sense.

> +[[blob]]
> +blobs::
> +    A blob is how Git represents a file. A blob object contains the
> +    file's contents.
> ++
> +Storing a new blob for every new version of a file can get big, so
> +`git gc` periodically compresses objects for efficiency in
> `.git/objects/pack`.

This gets into mentioning implementation files(?) like you mentioned in
the commit message.

1. That it’s a packfile and where it is might be too much detail for
   this doc
2. I vaguely recall documents discussing what happens to “storing every
   version” discussing deltas instead of packs? Again, I am not a Git
   developer though.

> +
> +[[tag-object]]
> +tag objects::
> +    Tag objects (also known as "annotated tags") contain:
> ++
> +1. The *tagger* and tag date
> +2. A *tag message*, similar to a commit message
> +3. The *ID* of the object (often a commit) that they reference

s/often/typically/ ?

I know it can get tedious to caveat the 99% cases with things that are
technically possible.  Maybe if it gets “bad enough” there could be a
part that explains/distinguishes the high-level/porcelain Git use and
what is technically possible: you make a `git tag -a`, which is on a
commit... except if you accidentally run it on top of an existing
tag. Then even the porcelain won’t protect you from making a 
tag-on-tag. (But it will issue a warning I guess.) Hmm. Now I don’t know.

> +
> +[[references]]
> +REFERENCES
> +----------
> +
> +References are a way to give a name to a commit.
> +It's easier to remember "the changes I'm working on are on the `turtle`
> +branch" than "the changes are in commit bb69721404348e".
> +Git often uses "ref" as shorthand for "reference".

Good.

> +
> +References that you create are stored in the `.git/refs` directory,
> +and Git has a few special internal references like `HEAD` that are
> stored
> +in the base `.git` directory.

Implementation file details.

You also mention `.git/refs/heads/<name>` below.  But refs aren’t stored
as files if you are using the *reftable* backend.  And that backend will
become the default for new repositories in Git 3.0, I think.

How does reftable work?  I don’t know.  But I don’t think we need to
know after reading this doc. :)

To be clear: how files are stored might not matter here.

> +
> +References can either be:
> +
> +1. References to an object ID, usually a <<commit,commit>> ID
> +2. References to another reference. This is called a "symbolic
> reference".

You seem to have used `**` when introducing terms:

    This is a *symbolic reference*

>[snip ref stuff]
> +
> +[[HEAD]]
> +HEAD: `.git/HEAD`::
> +    `HEAD` is where Git stores your current <<branch,branch>>.
> +    `HEAD` is normally a symbolic reference to your current branch, for
> +    example `ref: refs/heads/main` if your current branch is `main`.
> +    `HEAD` can also be a direct reference to a commit ID,
> +    that's called "detached HEAD state".
> +
> +[[remote-tracking-branch]]
> +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`::
> +    A remote-tracking branch is a name for a commit ID.
> +    It's how Git stores the last-known state of a branch in a remote
> +    repository. `git fetch` updates remote-tracking branches. When
> +    `git status` says "you're up to date with origin/main", it's looking at
> +    this.

Looks good.

> +
> +[[other-refs]]
> +Other references::
> +    Git tools may create references in any subdirectory of `.git/refs`.
> +    For example, linkgit:git-stash[1], linkgit:git-bisect[1],
> +    and linkgit:git-notes[1] all create their own references
> +    in `.git/refs/stash`, `.git/refs/bisect`, etc.
> +    Third-party Git tools may also create their own references.
> ++
> +Git may also create references in the base `.git` directory
> +other than `HEAD`, like `ORIG_HEAD`.
> +

> +*NOTE:* As an optimization, references may be stored as packed
> +refs instead of in `.git/refs`. See linkgit:git-pack-refs[1].

I don’t know if this is relevant for both ref backends. And does it
matter?

> +
> +[[index]]
> +THE INDEX
> +---------
> +
> +The index, also known as the "staging area", contains the current
> staged
> +version of every file in your Git repository. When you commit, the
> files
> +in the index are used as the files in the next commit.
> +
> +Unlike a tree, the index is a flat list of files.
> +Each index entry has 4 fields:
> +
> +1. The *permissions*
> +2. The *<<blob,blob>> ID* of the file
> +3. The *filename*
> +4. The *number*. This is normally 0, but if there's a merge conflict
> +   there can be multiple versions (with numbers 0, 1, 2, ..)
> +   of the same filename in the index.
> +
> +It's extremely uncommon to look at the index directly: normally you'd
> +run `git status` to see a list of changes between the index and
> <<HEAD,HEAD>>.
> +But you can use `git ls-files --stage` to see the index.
> +Here's the output of `git ls-files --stage` in a repository with 2
> files:
> +
> +----
> +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
> +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
> +----
> +
> +[[reflogs]]
> +REFLOGS
> +-------
> +
> +Git stores the history of branch, tag, and HEAD refs in a reflog
> +(you should read "reflog" as "ref log"). Not every ref is logged by

You’ve heard of the re-flog too?

> +default, but any ref can be logged.
> +
> +Each reflog entry has:
> +
> +1. *Before/after *commit IDs*
> +2. *User* who made the change, for example `Maya <maya@example.com>`
> +3. *Timestamp*
> +4. *Log message*, for example `pull: Fast-forward`
> +
> +Reflogs only log changes made in your local repository.
> +They are not shared with remotes.

Makes sense.

> +
> +GIT
> +---
> +Part of the linkgit:git[1] suite

I appreciate that this is the first version and you might have plans
after this one. But I wonder if this doc could use a fair number of
`gitlink` to branch out to all the other parts. Like git-reflog(1),
gitglossary(7).

Thanks for starting on a whole new doc. That must take quite
some effort.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 3, 2025

User "Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +MAN7_TXT += gitdatamodel.adoc
>  MAN7_TXT += gitdiffcore.adoc
> ...
> +gitdatamodel(7)
> +===============
> +
> +NAME
> +----
> +gitdatamodel - Git's core data model
> +
> +DESCRIPTION
> +-----------

The above causes doc-lint to barf.

https://github.com/git/git/actions/runs/18265502271/job/51999236907#step:4:655

gitdatamodel.adoc:226: has no required 'SYNOPSIS' section!
    LINT MAN SEC giteveryday.adoc
make[1]: *** [Makefile:498: .build/lint-docs/man-section-order/gitdatamodel.ok] Error 1


You can check locally with "make check-docs" without waiting for my
integration cycle to push to GitHub CI.

Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2025

This patch series was integrated into seen via git@56f8416.

@gitgitgadget gitgitgadget bot added the seen label Oct 6, 2025
@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2025

On the Git mailing list, "Julia Evans" wrote (reply to this):

> The above causes doc-lint to barf.
>
> https://github.com/git/git/actions/runs/18265502271/job/51999236907#step:4:655
>
> gitdatamodel.adoc:226: has no required 'SYNOPSIS' section!
>     LINT MAN SEC giteveryday.adoc
> make[1]: *** [Makefile:498: 
> .build/lint-docs/man-section-order/gitdatamodel.ok] Error 1
>
>
> You can check locally with "make check-docs" without waiting for my
> integration cycle to push to GitHub CI.


Thanks, will fix.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2025

On the Git mailing list, "Julia Evans" wrote (reply to this):

Thanks for the review!

>> 2. Don't mention that the full name of the branch `main` is
>>    technically `refs/heads/main`. This should likely change but I
>>    haven't worked out how to do it in a clear way yet.
>
> I think this is worth getting into.  This is a pretty
> user-facing concept.

I think I'll see if I can figure out a way to mention this and at the
same time remove most of the rest of the references to the `.git`
directory when explaining references (which you talked about
further down), including packed refs.

>> +
>> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects
>> +2. <<references,References>>: branches, tags,
>> +   remote-tracking branches, etc
>> +3. <<index,The index>>, also known as the staging area
>> +4. <<reflogs,Reflogs>>
>
> Reflogs is certainly auxiliary ref data. What makes it qualify as
> one-of-the-four?  I am open to it being both, to be clear.

The reason I like to talk about reflogs is that it gives you a
way to "undo" Git operations that can be really useful. 
And any Git command that updates refs can updates that
ref's reflog.

Understanding how reflogs work helps to understand what the
limitations of using reflogs to undo mistakes is: for example
the index is not a ref, so you can't use the reflog to undo
changes to the index.

>> +2. A *commit message*
>> +3. All the *files* in the commit, stored as a *<<tree,tree>>*
>> +4. An *author* and the time the commit was authored
>> +5. A *committer* and the time the commit was committed
>> ++
>> +Here's how an example commit is stored:
>> ++
>> +----
>> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
>> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
>> +author Maya <maya@example.com> 1759173425 -0400
>> +committer Maya <maya@example.com> 1759173425 -0400
>> +
>> +Add README
>> +----
>> ++
>> +Like all other objects, commits can never be changed after they're
>> created.
>> +For example, "amending" a commit with `git commit --amend` creates a
>> new commit.
>
>> +The old commit will eventually be deleted by `git gc`.
>
> Maybe this could be moved to a part about what happens (eventually) to
> unreachable objects?
>
> Mentioning `git gc` and how things will get deleted raises
> questions naturally. Like why would they be deleted? Okay
> that’s clear: the previous commit will be replaced by the
> amended one. Then when it is not reachable by anything
> (even the reflog) it will get garbage collected.
>
> It all follows. But is the reader necessarily mature enough
> in their understanding to make the inference?
>
> This is a long-winded way of saying: if you’re gonna discuss
> `git gc` you might need to go into all of these concepts.

If folks here think this is a reasonable document to add to
Git I'll try get some beta readers to read this, see which parts
folks find confusing, and address those, keeping the `git gc`
stuff in mind.

Similarly for the style comments.

>> +blobs::
>> +    A blob is how Git represents a file. A blob object contains the
>> +    file's contents.
>> ++
>> +Storing a new blob for every new version of a file can get big, so
>> +`git gc` periodically compresses objects for efficiency in
>> `.git/objects/pack`.
>
> This gets into mentioning implementation files(?) like you mentioned in
> the commit message.

That's true! The reason I think this is important to mention is that I find
that people often "reject" information that they find implausible, even
if it comes from a credible source. ("that can't be true! I must be
not understanding correctly. Oh well, I'll just ignore that!")

I sometimes hear from users that "commits can't be snapshots", because
it would take up too much disk space to store every version of
every commit. So I find that sometimes explaining a little bit about the
implementation can make the information more memorable.

Certainly I'm not able to remember details that don't make sense
with my mental model of how computers work and I don't expect other
people to either, so I think it's important to give an explanation that
handles the biggest "objections".

> 1. That it’s a packfile and where it is might be too much detail for
>    this doc
> 2. I vaguely recall documents discussing what happens to “storing every
>    version” discussing deltas instead of packs? Again, I am not a Git
>    developer though.

I could be wrong about the details here, I'm not a Git developer either.
From https://git-scm.com/book/en/v2/Git-Internals-Packfiles
it looks like packfiles are implemented using deltas.

>> +
>> +References can either be:
>> +
>> +1. References to an object ID, usually a <<commit,commit>> ID
>> +2. References to another reference. This is called a "symbolic
>> reference".
>
> You seem to have used `**` when introducing terms:
>
>     This is a *symbolic reference*

Thanks, will take a look at that.

>> +[[reflogs]]
>> +REFLOGS
>> +-------
>> +
>> +Git stores the history of branch, tag, and HEAD refs in a reflog
>> +(you should read "reflog" as "ref log"). Not every ref is logged by
>
> You’ve heard of the re-flog too?

haha exactly, I just want folks to understand why it's called that :)

> I appreciate that this is the first version and you might have plans
> after this one. But I wonder if this doc could use a fair number of
> `gitlink` to branch out to all the other parts. Like git-reflog(1),
> gitglossary(7).

That's reasonable. Do you often use the "See also" section of
man pages? I've never looked at them so I'm always curious about
how people are actually using them in practice.

I also need to think about what else could link *to* this, because
without attention to discoverability probably nobody will find it.
My main idea so far is actually to add it to
https://git-scm.com/learn
but I wanted to send it here instead of adding it to the website
directly because I thought it could benefit from a more detailed
review.

> Thanks for starting on a whole new doc. That must take quite
> some effort.

All the work on documentation takes a lot of effort, in some
ways it's easier to write something new than to edit something
existing :)

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2025

On the Git mailing list, "D. Ben Knoble" wrote (reply to this):

On Mon, Oct 6, 2025 at 3:37 PM Julia Evans <julia@jvns.ca> wrote:
>
> Thanks for the review!
>
> >> 2. Don't mention that the full name of the branch `main` is
> >>    technically `refs/heads/main`. This should likely change but I
> >>    haven't worked out how to do it in a clear way yet.
> >
> > I think this is worth getting into.  This is a pretty
> > user-facing concept.
>
> I think I'll see if I can figure out a way to mention this and at the
> same time remove most of the rest of the references to the `.git`
> directory when explaining references (which you talked about
> further down), including packed refs.

A colleague will be explaining reflog for an audience tomorrow, and
decided to briefly explain refs, too—which tells me this is
much-needed.

For refs themselves, perhaps "git for-each-ref" is a reasonable place
to start? Since it tells you the refs you have and how to spell them
explicitly regardless of how they are stored?

-- 
D. Ben Knoble

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2025

User "D. Ben Knoble" <ben.knoble@gmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2025

On the Git mailing list, "Julia Evans" wrote (reply to this):

On Mon, Oct 6, 2025, at 5:44 PM, D. Ben Knoble wrote:
> On Mon, Oct 6, 2025 at 3:37 PM Julia Evans <julia@jvns.ca> wrote:
>>
>> Thanks for the review!
>>
>> >> 2. Don't mention that the full name of the branch `main` is
>> >>    technically `refs/heads/main`. This should likely change but I
>> >>    haven't worked out how to do it in a clear way yet.
>> >
>> > I think this is worth getting into.  This is a pretty
>> > user-facing concept.
>>
>> I think I'll see if I can figure out a way to mention this and at the
>> same time remove most of the rest of the references to the `.git`
>> directory when explaining references (which you talked about
>> further down), including packed refs.
>
> A colleague will be explaining reflog for an audience tomorrow, and
> decided to briefly explain refs, too—which tells me this is
> much-needed.
>
> For refs themselves, perhaps "git for-each-ref" is a reasonable place
> to start? Since it tells you the refs you have and how to spell them
> explicitly regardless of how they are stored?

Interesting, do you use git for-each-ref? 
What do you use it for?

> -- 
> D. Ben Knoble

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2025

On the Git mailing list, "D. Ben Knoble" wrote (reply to this):

On Mon, Oct 6, 2025 at 5:47 PM Julia Evans <julia@jvns.ca> wrote:
>
>
>
> On Mon, Oct 6, 2025, at 5:44 PM, D. Ben Knoble wrote:
> > On Mon, Oct 6, 2025 at 3:37 PM Julia Evans <julia@jvns.ca> wrote:
> >>
> >> Thanks for the review!
> >>
> >> >> 2. Don't mention that the full name of the branch `main` is
> >> >>    technically `refs/heads/main`. This should likely change but I
> >> >>    haven't worked out how to do it in a clear way yet.
> >> >
> >> > I think this is worth getting into.  This is a pretty
> >> > user-facing concept.
> >>
> >> I think I'll see if I can figure out a way to mention this and at the
> >> same time remove most of the rest of the references to the `.git`
> >> directory when explaining references (which you talked about
> >> further down), including packed refs.
> >
> > A colleague will be explaining reflog for an audience tomorrow, and
> > decided to briefly explain refs, too—which tells me this is
> > much-needed.
> >
> > For refs themselves, perhaps "git for-each-ref" is a reasonable place
> > to start? Since it tells you the refs you have and how to spell them
> > explicitly regardless of how they are stored?
>
> Interesting, do you use git for-each-ref?
> What do you use it for?

Ah, yes, but primarily for scripting.

What I should have clarified is that "the tool (I know of) to
interrogate the refs you currently have is git-for-each-ref" (like how
git-ls-remote is the tool to interrogate a remote's refs). It avoids
the issues with assuming "tree .git/refs" or similar will capture the
actual data.

-- 
D. Ben Knoble

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2025

This patch series was integrated into seen via git@0f619ba.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 7, 2025

On the Git mailing list, "Kristoffer Haugsbakk" wrote (reply to this):

On Mon, Oct 6, 2025, at 05:32, Junio C Hamano wrote:
> "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> +MAN7_TXT += gitdatamodel.adoc
>>  MAN7_TXT += gitdiffcore.adoc
>> ...
>> +gitdatamodel(7)
>> +===============
>> +
>> +NAME
>> +----
>> +gitdatamodel - Git's core data model
>> +
>> +DESCRIPTION
>> +-----------
>
> The above causes doc-lint to barf.
>[snip]
> You can check locally with "make check-docs" without waiting for my
> integration cycle to push to GitHub CI.

I think you meant `make lint-docs` for both of these.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 7, 2025

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote:
> diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc
> new file mode 100644
> index 0000000000..4b2cb167dc
> --- /dev/null
> +++ b/Documentation/gitdatamodel.adoc
> @@ -0,0 +1,226 @@
> +gitdatamodel(7)
> +===============
> +
> +NAME
> +----
> +gitdatamodel - Git's core data model
> +
> +DESCRIPTION
> +-----------
> +
> +It's not necessary to understand Git's data model to use Git, but it's
> +very helpful when reading Git's documentation so that you know what it
> +means when the documentation says "object" "reference" or "index".

There's a missing comma after "object".

> +
> +Git's core operations use 4 kinds of data:
> +
> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects
> +2. <<references,References>>: branches, tags,
> +   remote-tracking branches, etc
> +3. <<index,The index>>, also known as the staging area
> +4. <<reflogs,Reflogs>>

This list makes sense to me. There's of course more data structures in
Git, but all the other data structures shouldn't really matter to users
at all as they are mostly caches or internal details of the on-disk
format.

There's potentially one exception though, namely the Git configuration.
I'd claim that Git "uses" the Git configuration similarly to how it uses
the others, but I get why it's not explicitly mentioned here.

> +[[objects]]
> +OBJECTS
> +-------
> +
> +Commits, trees, blobs, and tag objects are all stored in Git's object database.
> +Every object has:
> +
> +1. an *ID*, which is the SHA-1 hash of its contents.

I think this needs to be adapted to not single out SHA-1 as the only
hashing algorithm. We already support SHA-256, so we should definitely
say that the algorithm can be swapped. Maybe something like:

  An *object ID*, which is the cryptographic hash of its contents. By
  default, Git uses SHA-1 as object hash, but alternative hashes like
  SHA-256 are supported.

> +  It's fast to look up a Git object using its ID.
> +  The ID is usually represented in hexadecimal, like
> +  `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`.
> +2. a *type*. There are 4 types of objects:
> +   <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>,
> +   and <<tag-object,tag objects>>.
> +3. *contents*. The structure of the contents depends on the type.

Nit: every object also has an object size. Not sure though whether it's
fine to imply that with "contents".

> +Once an object is created, it can never be changed.
> +Here are the 4 types of objects:
> +
> +[[commit]]
> +commits::
> +    A commit contains:
> ++
> +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
> +  regular commits have 1 parent, merge commits have 2+ parents

I'd say "at least two parents" instead of "2+ parents".

> +2. A *commit message*
> +3. All the *files* in the commit, stored as a *<<tree,tree>>*
> +4. An *author* and the time the commit was authored
> +5. A *committer* and the time the commit was committed
> ++
> +Here's how an example commit is stored:
> ++
> +----
> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
> +author Maya <maya@example.com> 1759173425 -0400
> +committer Maya <maya@example.com> 1759173425 -0400
> +
> +Add README
> +----

In practice, commits can have other headers that are ignored by Git. But
that's certainly not part of Git's core data model, so I don't think we
should mention that here.

> +Like all other objects, commits can never be changed after they're created.
> +For example, "amending" a commit with `git commit --amend` creates a new commit.
> +The old commit will eventually be deleted by `git gc`.

If we mention git-gc(1) I think it would make sense to use
`linkgit:git-gc[1]` instead to provide a link to its man page.

> +[[tree]]
> +trees::
> +    A tree is how Git represents a directory. It lists, for each item in
> +    the tree:
> ++
> +1. The *permissions*, for example `100644`

I think we should rather call these "mode bits". These bits are
permissions indeed when you have a blob, but for subtrees, symlinks and
submodules they aren't.

> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
> +  or <<commit,`commit`>> (a Git submodule)

There's also symlinks.

> +3. The *object ID*
> +4. The *filename*
> ++
> +For example, this is how a tree containing one directory (`src`) and one file
> +(`README.md`) is stored:
> ++
> +----
> +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md
> +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src
> +----
> ++
> +*NOTE:* The permissions are in the same format as UNIX permissions, but
> +the only allowed permissions for files (blobs) are 644 and 755.
> +
> +[[blob]]
> +blobs::
> +    A blob is how Git represents a file. A blob object contains the
> +    file's contents.
> ++
> +Storing a new blob for every new version of a file can get big, so
> +`git gc` periodically compresses objects for efficiency in `.git/objects/pack`.

I would claim that it's not necessary to mention object compression.
This should be a low-level detail that users don't ever have to worry
about. Furthermore, packing objects isn't only relevant in the context
of blobs: trees for example also tend to compress very well as there
typically is only small incremental updates to trees.

> +[[tag-object]]
> +tag objects::
> +    Tag objects (also known as "annotated tags") contain:
> ++
> +1. The *tagger* and tag date
> +2. A *tag message*, similar to a commit message
> +3. The *ID* of the object (often a commit) that they reference

They can also be signed, if we want to mention that.

> +[[references]]
> +REFERENCES
> +----------
> +
> +References are a way to give a name to a commit.
> +It's easier to remember "the changes I'm working on are on the `turtle`
> +branch" than "the changes are in commit bb69721404348e".
> +Git often uses "ref" as shorthand for "reference".
> +
> +References that you create are stored in the `.git/refs` directory,
> +and Git has a few special internal references like `HEAD` that are stored
> +in the base `.git` directory.

This isn't true anymore with the introduction of the reftable backend,
which is slated to become the default backend. I'd argue that this is
another implementation detail that the user shouldn't have to worry
about.

> +References can either be:
> +
> +1. References to an object ID, usually a <<commit,commit>> ID
> +2. References to another reference. This is called a "symbolic reference".
> +
> +Git handles references differently based on which subdirectory of
> +`.git/refs` they're stored in.

So instead of saying "subdirectory", I'd rather say "reference
hierarchy".

In general, I think we should explain that references are layed out
in a hierarchy. This is somewhat obvious with the "files" backend, as we
use directories there. But as we move on to the "reftable" backend this
may become less obvious over time.

> +Here are the main types:
> +
> +[[branch]]
> +branches: `.git/refs/heads/<name>`::

Here and in the other cases we should then strip the `.git/` prefix.

> +    A branch is a name for a commit ID.
> +    That commit is the latest commit on the branch.
> +    Branches are stored in the `.git/refs/heads/` directory.
> ++
> +To get the history of commits on a branch, Git will start at the commit
> +ID the branch references, and then look at the commit's parent(s),
> +the parent's parent, etc.
> +
> +[[tag]]
> +tags: `.git/refs/tags/<name>`::
> +    A tag is a name for a commit ID, tag object ID, or other object ID.
> +    Tags are stored in the `refs/tags/` directory.
> ++
> +Even though branches and commits are both "a name for a commit ID", Git
> +treats them very differently.
> +Branches are expected to be regularly updated as you work on the branch,
> +but it's expected that a tag will never change after you create it.

This sounds a bit like the user itself needs to update the branch. How
about this instead:

    Even though branches and commits are both "a name for a commit ID", Git
    treats them very differently:

        - Branches can be checked out directly. If so, creating a new
          commit will automatically update the checked-out branch to
          point to the new commit.

        - Tags cannot be checked out directly and don't move when
          creating a new commit. Instead, one can only check out the
          commit that a branch points to. This is called "detached
          HEAD", and the effect is that a new commit will not update 

> +[[HEAD]]
> +HEAD: `.git/HEAD`::
> +    `HEAD` is where Git stores your current <<branch,branch>>.
> +    `HEAD` is normally a symbolic reference to your current branch, for
> +    example `ref: refs/heads/main` if your current branch is `main`.
> +    `HEAD` can also be a direct reference to a commit ID,
> +    that's called "detached HEAD state".
> +
> +[[remote-tracking-branch]]
> +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`::
> +    A remote-tracking branch is a name for a commit ID.
> +    It's how Git stores the last-known state of a branch in a remote
> +    repository. `git fetch` updates remote-tracking branches. When
> +    `git status` says "you're up to date with origin/main", it's looking at
> +    this.

This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic
reference that indicates the default branch on the remote side.

> +[[other-refs]]
> +Other references::
> +    Git tools may create references in any subdirectory of `.git/refs`.
> +    For example, linkgit:git-stash[1], linkgit:git-bisect[1],
> +    and linkgit:git-notes[1] all create their own references
> +    in `.git/refs/stash`, `.git/refs/bisect`, etc.
> +    Third-party Git tools may also create their own references.
> ++
> +Git may also create references in the base `.git` directory
> +other than `HEAD`, like `ORIG_HEAD`.

Let's mention that such references are typically spelt all-uppercase
with underscores between. You shouldn't ever create a reference that is
for example called ".git/foo".

We enforce this restriction inconsistently, only, but I don't think that
should keep us from spelling out the common rule.

> +*NOTE:* As an optimization, references may be stored as packed
> +refs instead of in `.git/refs`. See linkgit:git-pack-refs[1].

I'd drop this note. It's an internal implementation detail and only true
for the "files" backend. The "reftable" backend stores references quite
differently and doesn't really "pack" references.

> +[[index]]
> +THE INDEX
> +---------
> +
> +The index, also known as the "staging area", contains the current staged

Honestly, I always forget which of these two nouns we are supposed to
use nowadays. I think consensus was to use "index" and avoid using
"staging area"? Not sure though, but I think we should only mention
one of these.

> +version of every file in your Git repository. When you commit, the files
> +in the index are used as the files in the next commit.
> +
> +Unlike a tree, the index is a flat list of files.
> +Each index entry has 4 fields:
> +
> +1. The *permissions*
> +2. The *<<blob,blob>> ID* of the file
> +3. The *filename*
> +4. The *number*. This is normally 0, but if there's a merge conflict

I think we don't call this "number", but "stage".

> +   there can be multiple versions (with numbers 0, 1, 2, ..)
> +   of the same filename in the index.
> +
> +It's extremely uncommon to look at the index directly: normally you'd
> +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>.
> +But you can use `git ls-files --stage` to see the index.
> +Here's the output of `git ls-files --stage` in a repository with 2 files:
> +
> +----
> +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
> +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
> +----
> +
> +[[reflogs]]
> +REFLOGS
> +-------
> +
> +Git stores the history of branch, tag, and HEAD refs in a reflog
> +(you should read "reflog" as "ref log"). Not every ref is logged by
> +default, but any ref can be logged.

If we mention this here, do we maybe want to mention how the user can
decide which references are logged?

> +Each reflog entry has:
> +
> +1. *Before/after *commit IDs*

This will probably misformat as we have three asterisks here, not two.

> +2. *User* who made the change, for example `Maya <maya@example.com>`
> +3. *Timestamp*

Suggestion: "*Timestamp* when that change has been made".

> +4. *Log message*, for example `pull: Fast-forward`
> +
> +Reflogs only log changes made in your local repository.
> +They are not shared with remotes.

We may want ot mention that you can reference reflog entries via
`refs/heads/<branch>@{<reflog-nr>}`.

In general, one thing that I think would be important to highlight in
this document is revisions. Most of the commands tend to not accept
references, but revisions instead, which are a lot more flexible. They
use our do-what-I-mean mechanism to resolve, but also allow the user to
specify commits relative to one another. It's probably sufficient though
to mention them briefly and then redirect to girevisions(7).

Thanks for working on this!

Patrick

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 7, 2025

User Patrick Steinhardt <ps@pks.im> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 7, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail.com> writes:

> On Mon, Oct 6, 2025, at 05:32, Junio C Hamano wrote:
>> "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> +MAN7_TXT += gitdatamodel.adoc
>>>  MAN7_TXT += gitdiffcore.adoc
>>> ...
>>> +gitdatamodel(7)
>>> +===============
>>> +
>>> +NAME
>>> +----
>>> +gitdatamodel - Git's core data model
>>> +
>>> +DESCRIPTION
>>> +-----------
>>
>> The above causes doc-lint to barf.
>>[snip]
>> You can check locally with "make check-docs" without waiting for my
>> integration cycle to push to GitHub CI.
>
> I think you meant `make lint-docs` for both of these.

The former is a typo for "causes lint-docs to barf", but I did mean
"make check-docs" as the recipe for local checking.

You could also do "make -C Documentation lint-docs", but that is a
lot more to type ;-).

Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 7, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

Patrick Steinhardt <ps@pks.im> writes:

>> +Git's core operations use 4 kinds of data:
>> +
>> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects
>> +2. <<references,References>>: branches, tags,
>> +   remote-tracking branches, etc
>> +3. <<index,The index>>, also known as the staging area
>> +4. <<reflogs,Reflogs>>
>
> This list makes sense to me. There's of course more data structures in
> Git, but all the other data structures shouldn't really matter to users
> at all as they are mostly caches or internal details of the on-disk
> format.
>
> There's potentially one exception though, namely the Git configuration.
> I'd claim that Git "uses" the Git configuration similarly to how it uses
> the others, but I get why it's not explicitly mentioned here.

The core operations do not use Git configuration any more than they
use what is specified by the command line arguments.

>> +[[objects]]
>> +OBJECTS
>> +-------
>> +
>> +Commits, trees, blobs, and tag objects are all stored in Git's object database.
>> +Every object has:
>> +
>> +1. an *ID*, which is the SHA-1 hash of its contents.
>
> I think this needs to be adapted to not single out SHA-1 as the only
> hashing algorithm. We already support SHA-256, so we should definitely
> say that the algorithm can be swapped. Maybe something like:

Good point.  Also officially they are called "object name".

>   An *object ID*, which is the cryptographic hash of its contents. By
>   default, Git uses SHA-1 as object hash, but alternative hashes like
>   SHA-256 are supported.

I'd avoid "object name is the result of hashing X" which historically
was a source of question: "why does 'sha1sum README.md' give different
hash from 'git add README.md && git ls-files -s README.md'?"

It is an irrelevant implementation detail (and you'd eventually end
up having to say "X is <type> SP <length> NUL <contents>").

    An object name, which is derived cryptographically from its
    type, size and contents.  All versions of Git can use SHA-1 hash
    function, but more recent versions of Git can also use SHA-256
    hash function.

>> +commits::
>> +    A commit contains:
>> ++
>> +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
>> +  regular commits have 1 parent, merge commits have 2+ parents
>
> I'd say "at least two parents" instead of "2+ parents".

Yup, that reads much better.

>> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
>> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
>> +author Maya <maya@example.com> 1759173425 -0400
>> +committer Maya <maya@example.com> 1759173425 -0400
>> +
>> +Add README
>> +----
>
> In practice, commits can have other headers that are ignored by Git. But
> that's certainly not part of Git's core data model, so I don't think we
> should mention that here.

Third-party software can add truly garbage ones that do not have any
meaning, and Git tolerates by ignoring them.  But there are others
that Git does pay attention to, like encoding, gpgsig, etc., which
may worth mention (in the form that "these four are what you typically
see, but there may be others" without even naming any).

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 7, 2025

On the Git mailing list, "D. Ben Knoble" wrote (reply to this):

On Tue, Oct 7, 2025 at 11:51 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote:
[snip]
> > +    A branch is a name for a commit ID.
> > +    That commit is the latest commit on the branch.
> > +    Branches are stored in the `.git/refs/heads/` directory.
> > ++
> > +To get the history of commits on a branch, Git will start at the commit
> > +ID the branch references, and then look at the commit's parent(s),
> > +the parent's parent, etc.
> > +
> > +[[tag]]
> > +tags: `.git/refs/tags/<name>`::
> > +    A tag is a name for a commit ID, tag object ID, or other object ID.
> > +    Tags are stored in the `refs/tags/` directory.
> > ++
> > +Even though branches and commits are both "a name for a commit ID", Git
> > +treats them very differently.
> > +Branches are expected to be regularly updated as you work on the branch,
> > +but it's expected that a tag will never change after you create it.
>
> This sounds a bit like the user itself needs to update the branch. How
> about this instead:
>
>     Even though branches and commits are both "a name for a commit ID", Git
>     treats them very differently:
>
>         - Branches can be checked out directly. If so, creating a new
>           commit will automatically update the checked-out branch to
>           point to the new commit.
>
>         - Tags cannot be checked out directly and don't move when
>           creating a new commit. Instead, one can only check out the
>           commit that a branch points to. This is called "detached
>           HEAD", and the effect is that a new commit will not update

missing "the tag." ?

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 7, 2025

On the Git mailing list, "Julia Evans" wrote (reply to this):

On Tue, Oct 7, 2025, at 10:32 AM, Patrick Steinhardt wrote:
> On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote:
>> diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc
>> new file mode 100644
>> index 0000000000..4b2cb167dc
>> --- /dev/null
>> +++ b/Documentation/gitdatamodel.adoc
>> @@ -0,0 +1,226 @@
>> +gitdatamodel(7)
>> +===============
>> +
>> +NAME
>> +----
>> +gitdatamodel - Git's core data model
>> +
>> +DESCRIPTION
>> +-----------
>> +
>> +It's not necessary to understand Git's data model to use Git, but it's
>> +very helpful when reading Git's documentation so that you know what it
>> +means when the documentation says "object" "reference" or "index".
>
> There's a missing comma after "object".

Will fix.

>> +
>> +Git's core operations use 4 kinds of data:
>> +
>> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects
>> +2. <<references,References>>: branches, tags,
>> +   remote-tracking branches, etc
>> +3. <<index,The index>>, also known as the staging area
>> +4. <<reflogs,Reflogs>>
>
> This list makes sense to me. There's of course more data structures in
> Git, but all the other data structures shouldn't really matter to users
> at all as they are mostly caches or internal details of the on-disk
> format.
>
> There's potentially one exception though, namely the Git configuration.
> I'd claim that Git "uses" the Git configuration similarly to how it uses
> the others, but I get why it's not explicitly mentioned here.
>
>> +[[objects]]
>> +OBJECTS
>> +-------
>> +
>> +Commits, trees, blobs, and tag objects are all stored in Git's object database.
>> +Every object has:
>> +
>> +1. an *ID*, which is the SHA-1 hash of its contents.
>
> I think this needs to be adapted to not single out SHA-1 as the only
> hashing algorithm. We already support SHA-256, so we should definitely
> say that the algorithm can be swapped. Maybe something like:
>
>   An *object ID*, which is the cryptographic hash of its contents. By
>   default, Git uses SHA-1 as object hash, but alternative hashes like
>   SHA-256 are supported.

Makes sense. I might just say "cryptographic hash of its type and contents"
and leave it that. I'm not sure it's worth getting into details
of the exact hash function.

>> +  It's fast to look up a Git object using its ID.
>> +  The ID is usually represented in hexadecimal, like
>> +  `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`.
>> +2. a *type*. There are 4 types of objects:
>> +   <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>,
>> +   and <<tag-object,tag objects>>.
>> +3. *contents*. The structure of the contents depends on the type.
>
> Nit: every object also has an object size. Not sure though whether it's
> fine to imply that with "contents".

I think it is.

>> +Once an object is created, it can never be changed.
>> +Here are the 4 types of objects:
>> +
>> +[[commit]]
>> +commits::
>> +    A commit contains:
>> ++
>> +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
>> +  regular commits have 1 parent, merge commits have 2+ parents
>
> I'd say "at least two parents" instead of "2+ parents".
>
>> +2. A *commit message*
>> +3. All the *files* in the commit, stored as a *<<tree,tree>>*
>> +4. An *author* and the time the commit was authored
>> +5. A *committer* and the time the commit was committed
>> ++
>> +Here's how an example commit is stored:
>> ++
>> +----
>> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
>> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
>> +author Maya <maya@example.com> 1759173425 -0400
>> +committer Maya <maya@example.com> 1759173425 -0400
>> +
>> +Add README
>> +----
>
> In practice, commits can have other headers that are ignored by Git. But
> that's certainly not part of Git's core data model, so I don't think we
> should mention that here.
>
>> +Like all other objects, commits can never be changed after they're created.
>> +For example, "amending" a commit with `git commit --amend` creates a new commit.
>> +The old commit will eventually be deleted by `git gc`.
>
> If we mention git-gc(1) I think it would make sense to use
> `linkgit:git-gc[1]` instead to provide a link to its man page.

Agreed.

>> +[[tree]]
>> +trees::
>> +    A tree is how Git represents a directory. It lists, for each item in
>> +    the tree:
>> ++
>> +1. The *permissions*, for example `100644`
>
> I think we should rather call these "mode bits". These bits are
> permissions indeed when you have a blob, but for subtrees, symlinks and
> submodules they aren't.

I think it's a bit strange to call them mode bits since I thought they were stored
as ASCII strings and it's basically an enum of 5 options, but I see your point.
I think "file mode" will work and that's used elsewhere.

I wonder if it would make sense to list all of the possible file modes if
this isn't documented anywhere else, my impression is that it's a short
list and that it's unlikely to change much in the future.

And listing them all might make it more clear that Git's file modes don't
have much in common with Unix file modes.
I looked for where this is documented and it looks like the only place is
in `man git-fast-import` . That man page says that there are just 5 options
(040000, 160000, 100644, 100755, 120000)

>> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
>> +  or <<commit,`commit`>> (a Git submodule)
>
> There's also symlinks.

I created a test symlink and it looks like symlinks are stored as type "blob".
I might say which type corresponds to which file mode,
though I'm not sure what type corresponds to the "gitlink" mode (commit?).

I think these are the 5 modes and what they mean / what type they
should have. Not sure about the gitlink mode though.

  - `100644`: regular file (with type `blob`)
  - `100755`: executable file (with type `blob`)
  - `120000`: symbolic link (with type `blob`)
  - `040000`: directory (with type `tree`)
  - `160000`: gitlink, for use with submodules (with type `commit`)

>> +3. The *object ID*
>> +4. The *filename*
>> ++
>> +For example, this is how a tree containing one directory (`src`) and one file
>> +(`README.md`) is stored:
>> ++
>> +----
>> +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md
>> +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src
>> +----
>> ++
>> +*NOTE:* The permissions are in the same format as UNIX permissions, but
>> +the only allowed permissions for files (blobs) are 644 and 755.
>> +
>> +[[blob]]
>> +blobs::
>> +    A blob is how Git represents a file. A blob object contains the
>> +    file's contents.
>> ++
>> +Storing a new blob for every new version of a file can get big, so
>> +`git gc` periodically compresses objects for efficiency in `.git/objects/pack`.
>
> I would claim that it's not necessary to mention object compression.
> This should be a low-level detail that users don't ever have to worry
> about. Furthermore, packing objects isn't only relevant in the context
> of blobs: trees for example also tend to compress very well as there
> typically is only small incremental updates to trees.

I discussed why I think this important in another reply,
https://lore.kernel.org/all/51e0a55c-1f1d-4cae-9459-8c2b9220e52d@app.fastmail.com/,
will paste what I said here. I'll think about this more though.

paste follows:

That's true! The reason I think this is important to mention is that I find
that people often "reject" information that they find implausible, even
if it comes from a credible source. ("that can't be true! I must be
not understanding correctly. Oh well, I'll just ignore that!")

I sometimes hear from users that "commits can't be snapshots", because
it would take up too much disk space to store every version of
every commit. So I find that sometimes explaining a little bit about the
implementation can make the information more memorable.

Certainly I'm not able to remember details that don't make sense
with my mental model of how computers work and I don't expect other
people to either, so I think it's important to give an explanation that
handles the biggest "objections".

>> +[[tag-object]]
>> +tag objects::
>> +    Tag objects (also known as "annotated tags") contain:
>> ++
>> +1. The *tagger* and tag date
>> +2. A *tag message*, similar to a commit message
>> +3. The *ID* of the object (often a commit) that they reference
>
> They can also be signed, if we want to mention that.

I guess that's true for commit objects too. Not sure whether to
mention it either, can add it if others think it's important.

>> +[[references]]
>> +REFERENCES
>> +----------
>> +
>> +References are a way to give a name to a commit.
>> +It's easier to remember "the changes I'm working on are on the `turtle`
>> +branch" than "the changes are in commit bb69721404348e".
>> +Git often uses "ref" as shorthand for "reference".
>> +
>> +References that you create are stored in the `.git/refs` directory,
>> +and Git has a few special internal references like `HEAD` that are stored
>> +in the base `.git` directory.
>
> This isn't true anymore with the introduction of the reftable backend,
> which is slated to become the default backend. I'd argue that this is
> another implementation detail that the user shouldn't have to worry
> about.

Makes sense, will fix. (as well as other references to the .git prefix and
"subdirectories").

>> +References can either be:
>> +
>> +1. References to an object ID, usually a <<commit,commit>> ID
>> +2. References to another reference. This is called a "symbolic reference".
>> +
>> +Git handles references differently based on which subdirectory of
>> +`.git/refs` they're stored in.
>
> So instead of saying "subdirectory", I'd rather say "reference
> hierarchy".
>
> In general, I think we should explain that references are layed out
> in a hierarchy. This is somewhat obvious with the "files" backend, as we
> use directories there. But as we move on to the "reftable" backend this
> may become less obvious over time.

That makes sense.

>> +[[tag]]
>> +tags: `.git/refs/tags/<name>`::
>> +    A tag is a name for a commit ID, tag object ID, or other object ID.
>> +    Tags are stored in the `refs/tags/` directory.
>> ++
>> +Even though branches and commits are both "a name for a commit ID", Git
>> +treats them very differently.
>> +Branches are expected to be regularly updated as you work on the branch,
>> +but it's expected that a tag will never change after you create it.
>
> This sounds a bit like the user itself needs to update the branch. How
> about this instead:
>
>     Even though branches and commits are both "a name for a commit ID", Git
>     treats them very differently:
>
>         - Branches can be checked out directly. If so, creating a new
>           commit will automatically update the checked-out branch to
>           point to the new commit.
>
>         - Tags cannot be checked out directly and don't move when
>           creating a new commit. Instead, one can only check out the
>           commit that a branch points to. This is called "detached
>           HEAD", and the effect is that a new commit will not update 

I think mentioning that branches can be checked out and that tags can't
is a good idea.

>> +[[HEAD]]
>> +HEAD: `.git/HEAD`::
>> +    `HEAD` is where Git stores your current <<branch,branch>>.
>> +    `HEAD` is normally a symbolic reference to your current branch, for
>> +    example `ref: refs/heads/main` if your current branch is `main`.
>> +    `HEAD` can also be a direct reference to a commit ID,
>> +    that's called "detached HEAD state".
>> +
>> +[[remote-tracking-branch]]
>> +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`::
>> +    A remote-tracking branch is a name for a commit ID.
>> +    It's how Git stores the last-known state of a branch in a remote
>> +    repository. `git fetch` updates remote-tracking branches. When
>> +    `git status` says "you're up to date with origin/main", it's looking at
>> +    this.
>
> This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic
> reference that indicates the default branch on the remote side.

Is "refs/remotes/<remote>/HEAD" a remote-tracking branch?
I've never thought about that reference and I'm not sure what to call it.

>> +[[other-refs]]
>> +Other references::
>> +    Git tools may create references in any subdirectory of `.git/refs`.
>> +    For example, linkgit:git-stash[1], linkgit:git-bisect[1],
>> +    and linkgit:git-notes[1] all create their own references
>> +    in `.git/refs/stash`, `.git/refs/bisect`, etc.
>> +    Third-party Git tools may also create their own references.
>> ++
>> +Git may also create references in the base `.git` directory
>> +other than `HEAD`, like `ORIG_HEAD`.
>
> Let's mention that such references are typically spelt all-uppercase
> with underscores between. You shouldn't ever create a reference that is
> for example called ".git/foo".
>
> We enforce this restriction inconsistently, only, but I don't think that
> should keep us from spelling out the common rule.

That makes sense. I'm also not sure whether third-party
Git tools are "supposed" to create references outside of "refs/",
or whether that's common. 

>> +*NOTE:* As an optimization, references may be stored as packed
>> +refs instead of in `.git/refs`. See linkgit:git-pack-refs[1].
>
> I'd drop this note. It's an internal implementation detail and only true
> for the "files" backend. The "reftable" backend stores references quite
> differently and doesn't really "pack" references.
>
>> +[[index]]
>> +THE INDEX
>> +---------
>> +
>> +The index, also known as the "staging area", contains the current staged
>
> Honestly, I always forget which of these two nouns we are supposed to
> use nowadays. I think consensus was to use "index" and avoid using
> "staging area"? Not sure though, but I think we should only mention
> one of these.
>
>> +version of every file in your Git repository. When you commit, the files
>> +in the index are used as the files in the next commit.
>> +
>> +Unlike a tree, the index is a flat list of files.
>> +Each index entry has 4 fields:
>> +
>> +1. The *permissions*
>> +2. The *<<blob,blob>> ID* of the file
>> +3. The *filename*
>> +4. The *number*. This is normally 0, but if there's a merge conflict
>
> I think we don't call this "number", but "stage".

Thanks, I see that it's sometimes called "stage number" which is a little
easier to search for so I'll call it that.

>> +   there can be multiple versions (with numbers 0, 1, 2, ..)
>> +   of the same filename in the index.
>> +
>> +It's extremely uncommon to look at the index directly: normally you'd
>> +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>.
>> +But you can use `git ls-files --stage` to see the index.
>> +Here's the output of `git ls-files --stage` in a repository with 2 files:
>> +
>> +----
>> +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
>> +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
>> +----
>> +
>> +[[reflogs]]
>> +REFLOGS
>> +-------
>> +
>> +Git stores the history of branch, tag, and HEAD refs in a reflog
>> +(you should read "reflog" as "ref log"). Not every ref is logged by
>> +default, but any ref can be logged.
>
> If we mention this here, do we maybe want to mention how the user can
> decide which references are logged?

Do you mean by using the setting `core.logAllRefUpdates`?

>> +Each reflog entry has:
>> +
>> +1. *Before/after *commit IDs*
>
> This will probably misformat as we have three asterisks here, not two.
>
>> +2. *User* who made the change, for example `Maya <maya@example.com>`
>> +3. *Timestamp*
>
> Suggestion: "*Timestamp* when that change has been made".

Makes sense.

>> +4. *Log message*, for example `pull: Fast-forward`
>> +
>> +Reflogs only log changes made in your local repository.
>> +They are not shared with remotes.
>
> We may want ot mention that you can reference reflog entries via
> `refs/heads/<branch>@{<reflog-nr>}`.
>
> In general, one thing that I think would be important to highlight in
> this document is revisions. Most of the commands tend to not accept
> references, but revisions instead, which are a lot more flexible. They
> use our do-what-I-mean mechanism to resolve, but also allow the user to
> specify commits relative to one another. It's probably sufficient though
> to mention them briefly and then redirect to girevisions(7).

Will think about this, I'm not sure how to best incorporate that.
Maybe under the commits section.

> Thanks for working on this!

Thanks for the review!

- Julia

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 14, 2025

This patch series was integrated into seen via git@93d3629.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 14, 2025

There was a status update in the "Cooking" section about the branch je/doc-data-model on the Git mailing list:

Add a new manual that describes the data model.
source: <pull.1981.v2.git.1759931621272.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 15, 2025

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Tue, Oct 14, 2025 at 09:12:26PM +0000, Julia Evans via GitGitGadget wrote:
[snip]
> +[[commit]]
> +commits::
> +    A commit contains these required fields
> +    (though there are other optional fields):
> ++
> +1. All the *files* in the commit, stored as the *<<tree,tree>>* ID of
> +   the commit's base directory.
> +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
> +  regular commits have 1 parent, merge commits have 2 or more parents
> +3. An *author* and the time the commit was authored
> +4. A *committer* and the time the commit was committed.
> +   If you cherry-pick (linkgit:git-cherry-pick[1]) someone else's commit,
> +   then they will be the author and you'll be the committer.
> +5. A *commit message*
> ++
> +Here's how an example commit is stored:
> ++
> +----
> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
> +author Maya <maya@example.com> 1759173425 -0400
> +committer Maya <maya@example.com> 1759173425 -0400
> +
> +Add README
> +----
> ++
> +Like all other objects, commits can never be changed after they're created.
> +For example, "amending" a commit with `git commit --amend` creates a new
> +commit with the same parent.

Let's say "parents" instead of "parent" here so that it also works for
root and merge commits.

[snip]
> +[[other-refs]]
> +Other references::
> +    Git tools may create references anywhere under `refs/`.
> +    For example, linkgit:git-stash[1], linkgit:git-bisect[1],
> +    and linkgit:git-notes[1] all create their own references
> +    in `refs/stash`, `refs/bisect`, etc.
> +    Third-party Git tools may also create their own references.
> ++
> +Git may also create references other than `HEAD` at the base of the
> +hierarchy, like `ORIG_HEAD`.
> ++
> +NOTE: By default, Git references are stored as files in the `.git` directory.
> +For example, the branch `main` is stored in `.git/refs/heads/main`.
> +This means that you can't have branches named both `maya` and `maya/some-task`,
> +because there can't be a file and a directory with the same name.

Hm. I think mentioning this can help, but it may also creates questions
when someone has a "main" branch but is unable find it in
".git/refs/heads/main" because it has either been packed, or because the
repository uses reftables.

I don't really know what to do about this. I think the most sensible
thing would be to introduce two man pages gitformat-reffiles(5) and
gitformat-reftables(5) that we can reference here for further reading.

[snip]
> +[[reflogs]]
> +REFLOGS
> +-------
> +
> +Git stores a history called a "reflog" for every branch, remote-tracking

I think it's a bit unclear what "history" means here. Maybe:

    Git stores a "reflog" for every branch, remote-tracking branch and
    "HEAD" that contains the annotated history of all updates for a
    particular reference. This means...

Other than those handful of comments I'm happy with the current version,
thanks!

Patrick

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 15, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

Patrick Steinhardt <ps@pks.im> writes:

>> +Like all other objects, commits can never be changed after they're created.
>> +For example, "amending" a commit with `git commit --amend` creates a new
>> +commit with the same parent.
>
> Let's say "parents" instead of "parent" here so that it also works for
> root and merge commits.

I just found it amusing that parents can be 0 ;-)

>> +NOTE: By default, Git references are stored as files in the `.git` directory.
>> +For example, the branch `main` is stored in `.git/refs/heads/main`.
>> +This means that you can't have branches named both `maya` and `maya/some-task`,
>> +because there can't be a file and a directory with the same name.
>
> Hm. I think mentioning this can help, but it may also creates questions
> when someone has a "main" branch but is unable find it in
> ".git/refs/heads/main" because it has either been packed, or because the
> repository uses reftables.

I had the same thought.  The only thing we want to stress here is
that the names of refs _behave_ like filesystem entities.  So how
about saying just

    Note: when you have a branch with <name>, you cannot have any
    branch whose name begins with "<name>/".

and stop at it?  It may look like an arbitrary limitation, and once
in a distant future ref-files gets retired, it will become one (as
there is no inherent reason why reftable backend must retain it; it
only enforces the same limitation to ensure that the names it stores
interoperate with another clone that uses ref-files backend).  At
the data-model level (which is the theme of this document), it is
just as immaterial as refnames may be case insensitive on some
systems.

Mentioning the limitation may be good, but the data model document
is not the right place to explain where this limitation comes from
(i.e. to be compatible with and expressible in ref-files backend).
We do not say "you may not be able to have 'maya' branch and 'mAYa'
branch at the same time on some systems", either ;-).

>> +Git stores a history called a "reflog" for every branch, remote-tracking
>
> I think it's a bit unclear what "history" means here. Maybe:

"records of updates", perhaps?

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 15, 2025

On the Git mailing list, "Julia Evans" wrote (reply to this):

On Wed, Oct 15, 2025, at 11:34 AM, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
>>> +Like all other objects, commits can never be changed after they're created.
>>> +For example, "amending" a commit with `git commit --amend` creates a new
>>> +commit with the same parent.
>>
>> Let's say "parents" instead of "parent" here so that it also works for
>> root and merge commits.
>
> I just found it amusing that parents can be 0 ;-)

Will change to parent(s).

>>> +NOTE: By default, Git references are stored as files in the `.git` directory.
>>> +For example, the branch `main` is stored in `.git/refs/heads/main`.
>>> +This means that you can't have branches named both `maya` and `maya/some-task`,
>>> +because there can't be a file and a directory with the same name.
>>
>> Hm. I think mentioning this can help, but it may also creates questions
>> when someone has a "main" branch but is unable find it in
>> ".git/refs/heads/main" because it has either been packed, or because the
>> repository uses reftables.
>
> I had the same thought.  The only thing we want to stress here is
> that the names of refs _behave_ like filesystem entities.  So how
> about saying just
>
>     Note: when you have a branch with <name>, you cannot have any
>     branch whose name begins with "<name>/".
>
> and stop at it?  It may look like an arbitrary limitation, and once
> in a distant future ref-files gets retired, it will become one (as
> there is no inherent reason why reftable backend must retain it; it
> only enforces the same limitation to ensure that the names it stores
> interoperate with another clone that uses ref-files backend).  At
> the data-model level (which is the theme of this document), it is
> just as immaterial as refnames may be case insensitive on some
> systems.
>
> Mentioning the limitation may be good, but the data model document
> is not the right place to explain where this limitation comes from
> (i.e. to be compatible with and expressible in ref-files backend).

I'm still not clear on why you think we shouldn't mention that how
references behave depends on which filesystem you're using.

Is it because the fact that how references behave depends on which FS
you're using is considered a "bug", Git is working on eventually fixing
that bug via the reftable backend, and we don't want to document
"bugs" as an expected part of the data model?

I do think it's important to tell users where the data model has "weak points"
where the abstraction leaks through to the implementation, pretending that
abstractions are stronger than they are leads to unnecessary confusion.

> We do not say "you may not be able to have 'maya' branch and 'mAYa'
> branch at the same time on some systems", either ;-).

Speaking of case-insensitive filesystems, I wonder if we should add a
short note about the rules for filenames in Git. I ran into an issue
recently where I had a filename with a colon in it, and my
collaborator (who was using Windows) could not check out the branch
because of that, and I saw another similar issue recently where one
collaborator was using a case-insensitive filesystem and the other wasn't.

My guess is that Git does not enforce any rules about filenames (?),
and it's up to the user to make sure that the filenames in the repository
will work well for everyone collaborating on the repository.

>>> +Git stores a history called a "reflog" for every branch, remote-tracking
>>
>> I think it's a bit unclear what "history" means here. Maybe:
>
> "records of updates", perhaps?

Agreed. Perhaps this instead:

Every time a branch, remote-tracking branch, or HEAD is updated, Git
updates a log called a "reflog" for that <<reference,reference>>.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 15, 2025

This patch series was integrated into seen via git@5711c45.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 15, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +[[commit]]
> +commits::
> +    A commit contains these required fields
> +    (though there are other optional fields):
> ++
> +1. All the *files* in the commit, stored as the *<<tree,tree>>* ID of
> +   the commit's base directory.

"all the files' exact contents at the time of the commit" is what we
mean here, and once readers know what a tree is, the above sentence
would be understood as such, but "All the files" felt somewhat
fuzzy.  I wonder if presenting objects in bottom-up fashion makes it
easier to see?  Learn that a blob records exact content of a file,
then learn that a tree records the set of paths with exact contents
stored at these paths, and after that, learn that a commit records a
tree, hence a snapshot of the whole set of contents.  I dunno...

> +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
> +  regular commits have 1 parent, merge commits have 2 or more parents
> +3. An *author* and the time the commit was authored
> +4. A *committer* and the time the commit was committed.
> +   If you cherry-pick (linkgit:git-cherry-pick[1]) someone else's commit,
> +   then they will be the author and you'll be the committer.

It felt a bit odd to single-out cherry-pick here.

I think the important thing to become aware of for the readers at
this point is that the author and committer can be different people,
and it does not matter how one commits somebody else's patch at the
mechanical level.

Perhaps replace "If you cherry-pick..." with something like "note: a
change authored by a person at some point in time can be committed
by another person at a different time, and these fields are to
record both persons' contributions separately", perhaps, if we
really want to say more.

> +Git does not store the diff for a commit: when you ask Git for a
> +diff it calculates it on the fly.

I think this is an attempt to demystify "are we really storing
snapshot for each commit?" thing, but then "when you ask Git to show
the commit, it calculates the diff from its parent on the fly" might
achieve that better, perhaps?

> +[[tree]]
> +trees::
> +    A tree is how Git represents a directory. It lists, for each item in
> +    the tree:
> ++
> +[[file-mode]]
> +1. The *file mode*, for example `100644`. The format is inspired by Unix
> +   permissions, but Git's modes are much more limited. Git only supports these file modes:
> ++
> +  - `100644`: regular file (with type `blob`)
> +  - `100755`: executable file (with type `blob`)
> +  - `120000`: symbolic link (with type `blob`)
> +  - `040000`: directory (with type `tree`)
> +  - `160000`: gitlink, for use with submodules (with type `commit`)

It is not really "supporting" file modes.  Rather, Git only records
5 kinds of entities associated with each path in a tree object, and
uses numbers taht remotely resemble POSIX file modes to represent
these 5 kinds.

Perhaps "supports" -> "uses"?

> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
> +  or <<commit,`commit`>> (a Git submodule, which is a
> +  commit from a different Git repository)
> +3. The <<object-id,*object ID*>>
> +4. The *filename*

Here it may be worth noting that this "filename" is a single
pathname component (roughly, what you would see in non-recursive
"ls").  In other words, it may be a directory name.

I wonder if we need to say "<blob> (a file, or a symbolic link)"?

> +[[blob]]
> +blobs::
> +    A blob is how Git represents a file. A blob object contains the
> +    file's contents.

"represents a file" hints as if the thing may know its name, but
that is not the case (its name is given only by surrounding tree).

"A blob is how Git represents uninterpreted series of bytes, and
most commonly used to store file's contents." or something, perhaps?

> +When you make a new commit, Git only needs to store new versions of
> +files which were changed in that commit. This means that commits
> +can use relatively little disk space even in a very large repository.

That invites the "aren't we storing a delta after all, then?"
confusion.

"Git only needs to newly store new versions of files and
directories.  Files and directories that were not modified by the
commit are shared with its parent commit".

> +NOTE: All of the examples in this section were generated with
> +`git cat-file -p <object-id>`, which shows the contents of a Git object.

Was this necessary to say this?  Blobs, Commits, and Tags are
textual, so "-p" does very minimum thing, but Trees are binary
garbage, so "-p" output is heavily massaged version of the contents.

> +[[branch]]
> +branches: `refs/heads/<name>`::
> +    A branch is a name for a commit ID.

Well a commit ID is an alternative way to refer to a commit object
*name*, so it is a bit strange to say "a name for a commit ID".

Perhaps "A branch ref stores a commit ID." is better?

> +[[tag]]
> +tags: `refs/tags/<name>`::
> +    A tag is a name for a commit ID, tag object ID, or other object ID.

Likewise.  "A tag ref stores any kind of object ID, but commonly
they are commit objects or tag objects"

> +    Tags that reference a tag object ID are called "annotated tags",
> +    because the tag object contains a tag message.
> +    Tags that reference a commit, blob, or tree ID are
> +    called "lightweight tags".
> ++
> +Even though branches and tags are both "a name for a commit ID", Git
> +treats them very differently.
> +Branches are expected to change over time: when you make a commit, Git
> +will update your <<HEAD,current branch>> to reference the new changes.

This sentence talks about branch moving because it advances with
more commits.  Did we want to say "HEAD" here before we explain what
it is?  "HEAD" can move for another reason (i.e. branch switching)
and using "HEAD" in the context of talking about growing history
might invite confusion.  I dunno.

> +Tags are usually not changed after they're created.

> +[[HEAD]]
> +HEAD: `HEAD`::
> +    `HEAD` is where Git stores your current <<branch,branch>>.

Hmm...

> +    `HEAD` can either be:
> +    1. A symbolic reference to your current branch, for example `ref:
> +       refs/heads/main` if your current branch is `main`.
> +    2. A direct reference to a commit ID. This is called "detached HEAD
> +	   state", see the DETACHED HEAD section of linkgit:git-checkout[1] for more.

These two are very reasonable.  But "your current <<branch>>" refers
only to #1.

    `HEAD` refers to the commit your current work is based on, and
    it is the commit that will become the first parent of the commit
    once your current work is concluded.  It can either be ...

perhaps.

> +[[remote-tracking-branch]]
> +remote tracking branches: `refs/remotes/<remote>/<branch>`::

Please always write "remote-tracking" with a hyphen (see glossary).

> +    A remote-tracking branch is a name for a commit ID.

Either "A remote-tracking branch stores a commit object name" or "A
remote-tracking branch points at a commit object", followed by "in
order to keep track of the last-nown state of ..." in a single
sentence.

> +[[index]]
> +THE INDEX
> +---------
> +
> +The index, also known as the "staging area", contains a list of every
> +file in the repository and its contents. When you commit, the files in
> +the index are used as the files in the next commit.

It is hard to define what "every file in the repository" really is.
Files that you removed last week do not count.  Files added in your
wip branch elsewhere are obviously not yet in the index when you are
working on your primary branch.

> +You can add files to the index or update the version in the index with
> +linkgit:git-add[1]. Adding a file to the index or updating its version
> +is called "staging" the file for commit.

It may be worth to clarify by saying "staging the contents of the
file" (you can edit the file further after you "git add") that you
are taking a snapshot at the time you ran "git add", instead of
giving a general instruction to "keey an eye on this file" to Git
(if it were, then the next "git commit" would behave more like "git
add -u && git commit").

> +[[reflogs]]
> +REFLOGS
> +-------
> +
> +Git stores a history called a "reflog" for every branch, remote-tracking
> +branch, and HEAD. This means that if you make a mistake and "lose" a
> +commit, you can generally recover the commit ID by running
> +`git reflog <reference>`.
> +
> +Each reflog entry has:
> +
> +1. Before/after *commit IDs*
> +2. *User* who made the change, for example `Maya <maya@example.com>`
> +3. *Timestamp* when the change was made
> +4. *Log message*, for example `pull: Fast-forward`
> +
> +Reflogs only log changes made in your local repository.
> +They are not shared with remotes.

Technically it is correct that before/after are recorded, but there
is no way for the end-user to interact with them.  "git reflog"
walking these entries will only give you a single commit object.
The username is also recorded, but I do not think of a way to view
the information, let alone using it for querying.

Especially when the reftable backend is in use, you cannot even read
the raw representation like you can do with files backend (where
something like "cat .git/logs/HEAD" would let you peek into the
details).  I am not sure if we want to go into this detail.

Perhaps drop everything after "Each reflog entry has:"?

> +For example, here's how the reflog for `HEAD` in a repository with 2
> +commits is stored:
> +
> +----
> +0000000000000000000000000000000000000000 4ccb6d7b8869a86aae2e84c56523f8705b50c647 Maya <maya@example.com> 1759173408 -0400      commit (initial): Initial commit
> +4ccb6d7b8869a86aae2e84c56523f8705b50c647 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 Maya <maya@example.com> 1759173425 -0400      commit: Add README
> +----
> +
> +GIT
> +---
> +Part of the linkgit:git[1] suite
> diff --git a/Documentation/glossary-content.adoc b/Documentation/glossary-content.adoc
> index e423e4765b..20ba121314 100644
> --- a/Documentation/glossary-content.adoc
> +++ b/Documentation/glossary-content.adoc
> @@ -297,8 +297,8 @@ This commit is referred to as a "merge commit", or sometimes just a
>  	identified by its <<def_object_name,object name>>. The objects usually
>  	live in `$GIT_DIR/objects/`.
>  
> -[[def_object_identifier]]object identifier (oid)::
> -	Synonym for <<def_object_name,object name>>.
> +[[def_object_identifier]]object identifier, object ID, oid::
> +	Synonyms for <<def_object_name,object name>>.
>  
>  [[def_object_name]]object name::
>  	The unique identifier of an <<def_object,object>>.  The
> diff --git a/Documentation/meson.build b/Documentation/meson.build
> index e34965c5b0..ace0573e82 100644
> --- a/Documentation/meson.build
> +++ b/Documentation/meson.build
> @@ -192,6 +192,7 @@ manpages = {
>    'gitcore-tutorial.adoc' : 7,
>    'gitcredentials.adoc' : 7,
>    'gitcvs-migration.adoc' : 7,
> +  'gitdatamodel.adoc' : 7,
>    'gitdiffcore.adoc' : 7,
>    'giteveryday.adoc' : 7,
>    'gitfaq.adoc' : 7,
>
> base-commit: bb69721404348ea2db0a081c41ab6ebfe75bdec8

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 15, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Julia Evans" <julia@jvns.ca> writes:

> I'm still not clear on why you think we shouldn't mention that how
> references behave depends on which filesystem you're using.

Simply because the main purpose of this document is to give a
data-model.  A case insensitive filesystem limiting the set of names
you can use depending on what other names are in use is a quality of
implementation issue, which I view as a mere distraction when we are
giving overview at the conceptual level.

Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 15, 2025

This patch series was integrated into seen via git@e85334b.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 16, 2025

On the Git mailing list, "Julia Evans" wrote (reply to this):

On Wed, Oct 15, 2025, at 4:42 PM, Junio C Hamano wrote:
> "Julia Evans" <julia@jvns.ca> writes:
>
>> I'm still not clear on why you think we shouldn't mention that how
>> references behave depends on which filesystem you're using.
>
> Simply because the main purpose of this document is to give a
> data-model.  A case insensitive filesystem limiting the set of names
> you can use depending on what other names are in use is a quality of
> implementation issue, which I view as a mere distraction when we are
> giving overview at the conceptual level.

Okay, I'll delete the note.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 16, 2025

On the Git mailing list, "Julia Evans" wrote (reply to this):

On Wed, Oct 15, 2025, at 3:58 PM, Junio C Hamano wrote:
> "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> +[[commit]]
>> +commits::
>> +    A commit contains these required fields
>> +    (though there are other optional fields):
>> ++
>> +1. All the *files* in the commit, stored as the *<<tree,tree>>* ID of
>> +   the commit's base directory.
>
> "all the files' exact contents at the time of the commit" is what we
> mean here, and once readers know what a tree is, the above sentence
> would be understood as such, but "All the files" felt somewhat
> fuzzy.  I wonder if presenting objects in bottom-up fashion makes it
> easier to see?  Learn that a blob records exact content of a file,
> then learn that a tree records the set of paths with exact contents
> stored at these paths, and after that, learn that a commit records a
> tree, hence a snapshot of the whole set of contents.  I dunno...

Will try "The contents of all the *files* in the commit..." to make it a little
more explicit that it's a snapshot.

>> +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
>> +  regular commits have 1 parent, merge commits have 2 or more parents
>> +3. An *author* and the time the commit was authored
>> +4. A *committer* and the time the commit was committed.
>> +   If you cherry-pick (linkgit:git-cherry-pick[1]) someone else's commit,
>> +   then they will be the author and you'll be the committer.
>
> It felt a bit odd to single-out cherry-pick here.
>
> I think the important thing to become aware of for the readers at
> this point is that the author and committer can be different people,
> and it does not matter how one commits somebody else's patch at the
> mechanical level.
>
> Perhaps replace "If you cherry-pick..." with something like "note: a
> change authored by a person at some point in time can be committed
> by another person at a different time, and these fields are to
> record both persons' contributions separately", perhaps, if we
> really want to say more.

I'll just delete the comment about cherry-pick.
I think it's already obvious (from the fact that are two different fields)
that the author and committer can be different (and happen at
different times), and if we don't want to explain why that might
happen there's no need to say more.

>> +Git does not store the diff for a commit: when you ask Git for a
>> +diff it calculates it on the fly.
>
> I think this is an attempt to demystify "are we really storing
> snapshot for each commit?" thing, but then "when you ask Git to show
> the commit, it calculates the diff from its parent on the fly" might
> achieve that better, perhaps?

Sure, can change it to that.

>> +[[tree]]
>> +trees::
>> +    A tree is how Git represents a directory. It lists, for each item in
>> +    the tree:
>> ++
>> +[[file-mode]]
>> +1. The *file mode*, for example `100644`. The format is inspired by Unix
>> +   permissions, but Git's modes are much more limited. Git only supports these file modes:
>> ++
>> +  - `100644`: regular file (with type `blob`)
>> +  - `100755`: executable file (with type `blob`)
>> +  - `120000`: symbolic link (with type `blob`)
>> +  - `040000`: directory (with type `tree`)
>> +  - `160000`: gitlink, for use with submodules (with type `commit`)
>
> It is not really "supporting" file modes.  Rather, Git only records
> 5 kinds of entities associated with each path in a tree object, and
> uses numbers taht remotely resemble POSIX file modes to represent
> these 5 kinds.
>
> Perhaps "supports" -> "uses"?

"Uses" sounds good to me.

>> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
>> +  or <<commit,`commit`>> (a Git submodule, which is a
>> +  commit from a different Git repository)
>> +3. The <<object-id,*object ID*>>
>> +4. The *filename*
>
> Here it may be worth noting that this "filename" is a single
> pathname component (roughly, what you would see in non-recursive
> "ls").  In other words, it may be a directory name.
>
> I wonder if we need to say "<blob> (a file, or a symbolic link)"?

I'm inclined to leave this alone because arguably a symbolic link is
a file but I don't feel strongly about this.

>> +[[blob]]
>> +blobs::
>> +    A blob is how Git represents a file. A blob object contains the
>> +    file's contents.
>
> "represents a file" hints as if the thing may know its name, but
> that is not the case (its name is given only by surrounding tree).
>
> "A blob is how Git represents uninterpreted series of bytes, and
> most commonly used to store file's contents." or something, perhaps?

I'll say "A blob is how Git represents a file's contents", unless Git has
another use for blobs that I don't know about (I think it's not
that much of a stretch to say that a symbolic link is a special kind
of file where the "contents" are the the link destination).

I think it's always clearer to be more specific when possible, if there's only
one purpose for blobs it's unnecessary (and IMO a bit misleading, because
it makes the reader wonder if there are other purposes that they should
know about) to say that blobs can be used to store any arbitrary bytes for
any purpose.

If there is another purpose I think we should give an example.

>> +When you make a new commit, Git only needs to store new versions of
>> +files which were changed in that commit. This means that commits
>> +can use relatively little disk space even in a very large repository.
>
> That invites the "aren't we storing a delta after all, then?"
> confusion.
>
> "Git only needs to newly store new versions of files and
> directories.  Files and directories that were not modified by the
> commit are shared with its parent commit".

I agree it makes it sound a little bit like we're storing a delta.
Will think about how to phrase this differently.

>> +NOTE: All of the examples in this section were generated with
>> +`git cat-file -p <object-id>`, which shows the contents of a Git object.
>
> Was this necessary to say this?  Blobs, Commits, and Tags are
> textual, so "-p" does very minimum thing, but Trees are binary
> garbage, so "-p" output is heavily massaged version of the contents.

Ah, I didn't know how trees were stored, thanks. 
I can remove "which shows the contents of a Git object", people
can read the man page for `git cat-file` if they want details.

>> +[[branch]]
>> +branches: `refs/heads/<name>`::
>> +    A branch is a name for a commit ID.
>
> Well a commit ID is an alternative way to refer to a commit object
> *name*, so it is a bit strange to say "a name for a commit ID".
>
> Perhaps "A branch ref stores a commit ID." is better?

I think I'll leave this alone, none of the many test readers reported
being confused by it.

>> +[[tag]]
>> +tags: `refs/tags/<name>`::
>> +    A tag is a name for a commit ID, tag object ID, or other object ID.
>
> Likewise.  "A tag ref stores any kind of object ID, but commonly
> they are commit objects or tag objects"
>
>> +    Tags that reference a tag object ID are called "annotated tags",
>> +    because the tag object contains a tag message.
>> +    Tags that reference a commit, blob, or tree ID are
>> +    called "lightweight tags".
>> ++
>> +Even though branches and tags are both "a name for a commit ID", Git
>> +treats them very differently.
>> +Branches are expected to change over time: when you make a commit, Git
>> +will update your <<HEAD,current branch>> to reference the new changes.
>
> This sentence talks about branch moving because it advances with
> more commits.  Did we want to say "HEAD" here before we explain what
> it is?  "HEAD" can move for another reason (i.e. branch switching)
> and using "HEAD" in the context of talking about growing history
> might invite confusion.  I dunno.

The text says "current branch", it just cross-references the "HEAD" section in the
HTML version if someone wants to read about what is meant by "current branch".

>> +Tags are usually not changed after they're created.
>
>> +[[HEAD]]
>> +HEAD: `HEAD`::
>> +    `HEAD` is where Git stores your current <<branch,branch>>.
>
> Hmm...
>
>> +    `HEAD` can either be:
>> +    1. A symbolic reference to your current branch, for example `ref:
>> +       refs/heads/main` if your current branch is `main`.
>> +    2. A direct reference to a commit ID. This is called "detached HEAD
>> +	   state", see the DETACHED HEAD section of linkgit:git-checkout[1] for more.
>
> These two are very reasonable.  But "your current <<branch>>" refers
> only to #1.
>
>     `HEAD` refers to the commit your current work is based on, and
>     it is the commit that will become the first parent of the commit
>     once your current work is concluded.  It can either be ...
>
> perhaps.

I like the idea of mentioning that HEAD will be the parent commit
of any commit that you make. Will think about how to incorporate
that, and about how to resolve " `HEAD` is where Git stores your
current <<branch,branch>>." being not exactly true.

>> +[[remote-tracking-branch]]
>> +remote tracking branches: `refs/remotes/<remote>/<branch>`::
>
> Please always write "remote-tracking" with a hyphen (see glossary).

Will fix.

>> +    A remote-tracking branch is a name for a commit ID.
>
> Either "A remote-tracking branch stores a commit object name" or "A
> remote-tracking branch points at a commit object", followed by "in
> order to keep track of the last-nown state of ..." in a single
> sentence.

I see that you don't like the "name for a commit ID" phrasing :)
Maybe there's another way to say it, though again none of the test
readers said they were confused by this or disagreed with the phrasing.

>> +[[index]]
>> +THE INDEX
>> +---------
>> +
>> +The index, also known as the "staging area", contains a list of every
>> +file in the repository and its contents. When you commit, the files in
>> +the index are used as the files in the next commit.
>
> It is hard to define what "every file in the repository" really is.
> Files that you removed last week do not count.  Files added in your
> wip branch elsewhere are obviously not yet in the index when you are
> working on your primary branch.

Agreed, I'm not so happy with "every file in the repository" either.
My intent was to make it clear that it's not "just the files you `git add`ed".
I'll think about a different phrasing that communicates the same thing.
Perhaps mentioning how it relates to the HEAD commit would help.

>> +You can add files to the index or update the version in the index with
>> +linkgit:git-add[1]. Adding a file to the index or updating its version
>> +is called "staging" the file for commit.
>
> It may be worth to clarify by saying "staging the contents of the
> file" (you can edit the file further after you "git add") that you
> are taking a snapshot at the time you ran "git add", instead of
> giving a general instruction to "keey an eye on this file" to Git
> (if it were, then the next "git commit" would behave more like "git
> add -u && git commit").

Maybe, will think about this too.

>> +[[reflogs]]
>> +REFLOGS
>> +-------
>> +
>> +Git stores a history called a "reflog" for every branch, remote-tracking
>> +branch, and HEAD. This means that if you make a mistake and "lose" a
>> +commit, you can generally recover the commit ID by running
>> +`git reflog <reference>`.
>> +
>> +Each reflog entry has:
>> +
>> +1. Before/after *commit IDs*
>> +2. *User* who made the change, for example `Maya <maya@example.com>`
>> +3. *Timestamp* when the change was made
>> +4. *Log message*, for example `pull: Fast-forward`
>> +
>> +Reflogs only log changes made in your local repository.
>> +They are not shared with remotes.
>
> Technically it is correct that before/after are recorded, but there
> is no way for the end-user to interact with them.  "git reflog"
> walking these entries will only give you a single commit object.
> The username is also recorded, but I do not think of a way to view
> the information, let alone using it for querying.

You can view the username with git reflog --format="%gn <%ge>".
(according to `man git-log`). I don't see a way to view the old commit ID.

Perhaps we should include the username but not the old commit ID then.
I'm not sure.

> Especially when the reftable backend is in use, you cannot even read
> the raw representation like you can do with files backend (where
> something like "cat .git/logs/HEAD" would let you peek into the
> details).  I am not sure if we want to go into this detail.
>
> Perhaps drop everything after "Each reflog entry has:"?

Perhaps we could give a stripped down list, like

1. The new *commit ID* the reference points to
2. *Timestamp* when the change was made
3. *Log message*, for example `pull: Fast-forward`

And then instead of giving the contents of `.git/logs/HEAD`
(which as you say includes some fields that there's no way
for the user to interact with), instead we could just show the
output of `git reflog main`, like this:

    You can view the reflog for `git reflog`, for example here's the reflog
    for a `main` branch which has changed twice:

    $ git reflog main --date=iso --no-decorate
    750b4ea main@{2025-09-29 15:17:05 -0400}: commit: Add README
    4ccb6d7 main@{2025-09-29 15:16:48 -0400}: commit (initial): Initial commit

I added `--no-decorate`  there because the decorations are a distraction
when talking about the data model.

This version omits the username which is a little weird (it is possible to
access the username) but mentioning the username is a little weird
too because it raises some questions that are hard to answer about
what that field is for, and you have to pass an obscure format string
to view it. Not sure what's best here.

>> +For example, here's how the reflog for `HEAD` in a repository with 2
>> +commits is stored:
>> +
>> +----
>> +0000000000000000000000000000000000000000 4ccb6d7b8869a86aae2e84c56523f8705b50c647 Maya <maya@example.com> 1759173408 -0400      commit (initial): Initial commit
>> +4ccb6d7b8869a86aae2e84c56523f8705b50c647 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 Maya <maya@example.com> 1759173425 -0400      commit: Add README
>> +----

Thanks for the review.
- Julia

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 16, 2025

On the Git mailing list, "Kristoffer Haugsbakk" wrote (reply to this):

> [PATCH v3] doc: add a explanation of Git's data model

s/a explanation/an explanation/

On Tue, Oct 14, 2025, at 23:12, Julia Evans via GitGitGadget wrote:
> From: Julia Evans <julia@jvns.ca>
>
> Git very often uses the terms "object", "reference", or "index" in its
> documentation.
>[snip]

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 16, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Julia Evans" <julia@jvns.ca> writes:

>>> +[[tree]]
>>> +trees::
>>> +    A tree is how Git represents a directory. It lists, for each item in
>>> +    the tree:
>>> ++
>>> +[[file-mode]]
>>> +1. The *file mode*, for example `100644`. The format is inspired by Unix
>>> +   permissions, but Git's modes are much more limited. Git only supports these file modes:
>>> ++
>>> +  - `100644`: regular file (with type `blob`)
>>> +  - `100755`: executable file (with type `blob`)
>>> +  - `120000`: symbolic link (with type `blob`)
>>> +  - `040000`: directory (with type `tree`)
>>> +  - `160000`: gitlink, for use with submodules (with type `commit`)
>>
>> It is not really "supporting" file modes.  Rather, Git only records
>> 5 kinds of entities associated with each path in a tree object, and
>> uses numbers taht remotely resemble POSIX file modes to represent
>> these 5 kinds.
>>
>> Perhaps "supports" -> "uses"?
>
> "Uses" sounds good to me.

Also "much more limited" is misleading.  We only represent 5 kinds
of things, so we use only 5 mode-bits-looking numbers.

>>> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
>>> +  or <<commit,`commit`>> (a Git submodule, which is a
>>> +  commit from a different Git repository)
>>> +3. The <<object-id,*object ID*>>
>>> +4. The *filename*
>>
>> Here it may be worth noting that this "filename" is a single
>> pathname component (roughly, what you would see in non-recursive
>> "ls").  In other words, it may be a directory name.

Comments?

>>> +[[blob]]
>>> +blobs::
>>> +    A blob is how Git represents a file. A blob object contains the
>>> +    file's contents.
>>
>> "represents a file" hints as if the thing may know its name, but
>> that is not the case (its name is given only by surrounding tree).
>>
>> "A blob is how Git represents uninterpreted series of bytes, and
>> most commonly used to store file's contents." or something, perhaps?
>
> I'll say "A blob is how Git represents a file's contents", unless Git has
> another use for blobs that I don't know about (I think it's not
> that much of a stretch to say that a symbolic link is a special kind
> of file where the "contents" are the the link destination).

A few configuration variables like mailmap.blob name a blob object,
for which _only_ its contents, i.e., the sequence of bytes, matter
and where they originally were stored does not matter.

But we are falling into the area of tautology, as any sequence of
bytes can be stored in a file so they can be called "contents of a
file".  But the point is that these bytes do not have to be stored
to become a blob (think: "git cat-file -t blob -w --stdin").

> I think it's always clearer to be more specific when possible, if there's only
> one purpose for blobs it's unnecessary (and IMO a bit misleading, because
> it makes the reader wonder if there are other purposes that they should
> know about) to say that blobs can be used to store any arbitrary bytes for
> any purpose.

I do not think describing other use cases is unnecessary.  Even if
we limit ourselves to discuss a single purpose for blob, i.e. to
represent the contents of a file, we should stress that blob is to
store _only_ contents, and not other aspects of the file (e.g., in
what paths with what mode), and that is where my reaction to "how
Git reprsents a file" comes from.

>>> +[[branch]]
>>> +branches: `refs/heads/<name>`::
>>> +    A branch is a name for a commit ID.
>>
>> Well a commit ID is an alternative way to refer to a commit object
>> *name*, so it is a bit strange to say "a name for a commit ID".
>>
>> Perhaps "A branch ref stores a commit ID." is better?
>
> I think I'll leave this alone, none of the many test readers reported
> being confused by it.

Would a confused person report that they are confused? ;-)

> I see that you don't like the "name for a commit ID" phrasing :)
> Maybe there's another way to say it, though again none of the test
> readers said they were confused by this or disagreed with the phrasing.

Yes, I get that given "refs/heads/main", you want to say "main" is
one of the ways to have repo_get_oid() to yield the commit object,
and you are using "name" in that sense, but it is more like a ref
can be used to name an object.  It is *not* the name of the object,
because the object can have other names, and more importantly, it
(i.e., to give a name for an object) is not the only thing that a
ref can do.  And that is why I do not like that phrasing, combined
with the target of giving that name is spelled "a commit ID".  The
commit ID is already another way to name the thing the refname can
be also used to name: a commit object.  A commit object and a commit
object name are different things.  The latter is a name that can
refer to the former.  And a ref can be used just like the latter to
refer to the former (i.e. "commit object").

By the way, I do like the way many of your responses are "will think
about it more", not "I'll take your version".

Very much appreciated.

Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 16, 2025

On the Git mailing list, "Julia Evans" wrote (reply to this):

On Thu, Oct 16, 2025, at 12:54 PM, Junio C Hamano wrote:
> "Julia Evans" <julia@jvns.ca> writes:
>
>>>> +[[tree]]
>>>> +trees::
>>>> +    A tree is how Git represents a directory. It lists, for each item in
>>>> +    the tree:
>>>> ++
>>>> +[[file-mode]]
>>>> +1. The *file mode*, for example `100644`. The format is inspired by Unix
>>>> +   permissions, but Git's modes are much more limited. Git only supports these file modes:
>>>> ++
>>>> +  - `100644`: regular file (with type `blob`)
>>>> +  - `100755`: executable file (with type `blob`)
>>>> +  - `120000`: symbolic link (with type `blob`)
>>>> +  - `040000`: directory (with type `tree`)
>>>> +  - `160000`: gitlink, for use with submodules (with type `commit`)
>>>
>>> It is not really "supporting" file modes.  Rather, Git only records
>>> 5 kinds of entities associated with each path in a tree object, and
>>> uses numbers taht remotely resemble POSIX file modes to represent
>>> these 5 kinds.
>>>
>>> Perhaps "supports" -> "uses"?
>>
>> "Uses" sounds good to me.
>
> Also "much more limited" is misleading.  We only represent 5 kinds
> of things, so we use only 5 mode-bits-looking numbers.

What does it mislead the reader to think? My goal is to communicate that
if you want to tell Git to remember that a file's Unix permissions were
700, that's not possible.

>>>> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
>>>> +  or <<commit,`commit`>> (a Git submodule, which is a
>>>> +  commit from a different Git repository)
>>>> +3. The <<object-id,*object ID*>>
>>>> +4. The *filename*
>>>
>>> Here it may be worth noting that this "filename" is a single
>>> pathname component (roughly, what you would see in non-recursive
>>> "ls").  In other words, it may be a directory name.
>
> Comments?

Oops, missed this in my first pass.

I looked at them man pages for a couple of commands ("mv", "cp")
and it looks like it's normal to refer to files and directories jointly
as "files", or refer to them as having a "file name". So I think it's okay
to call it a "file name" even if the "file" may be a directory.

>>>> +[[blob]]
>>>> +blobs::
>>>> +    A blob is how Git represents a file. A blob object contains the
>>>> +    file's contents.
>>>
>>> "represents a file" hints as if the thing may know its name, but
>>> that is not the case (its name is given only by surrounding tree).
>>>
>>> "A blob is how Git represents uninterpreted series of bytes, and
>>> most commonly used to store file's contents." or something, perhaps?
>>
>> I'll say "A blob is how Git represents a file's contents", unless Git has
>> another use for blobs that I don't know about (I think it's not
>> that much of a stretch to say that a symbolic link is a special kind
>> of file where the "contents" are the the link destination).
>
> A few configuration variables like mailmap.blob name a blob object,
> for which _only_ its contents, i.e., the sequence of bytes, matter
> and where they originally were stored does not matter.
>
> But we are falling into the area of tautology, as any sequence of
> bytes can be stored in a file so they can be called "contents of a
> file".  But the point is that these bytes do not have to be stored
> to become a blob (think: "git cat-file -t blob -w --stdin").

I'm trying to think through what the goal of explaining the nature of
a "blob" is.

To me describing blobs primarily as "bytes" makes it sound a bit like
"Git will treat this as opaque binary data, Git will not attempt to
interpret the contents of a blob in any way" (which is certainly true
for many blob storage systems!).

But it's not true that Git treats blobs as opaque binary data, unlike
other blob storage systems, Git has diff and merge algorithms to
interpret the contents of the file to some extent and try to do useful
things with them.

Another goal we could have is to be clear that there are no limits to
what kind of files you can store in Git: you can equally well store text
files and binary files.

>> I think it's always clearer to be more specific when possible, if there's only
>> one purpose for blobs it's unnecessary (and IMO a bit misleading, because
>> it makes the reader wonder if there are other purposes that they should
>> know about) to say that blobs can be used to store any arbitrary bytes for
>> any purpose.
>
> I do not think describing other use cases is unnecessary.  Even if
> we limit ourselves to discuss a single purpose for blob, i.e. to
> represent the contents of a file, we should stress that blob is to
> store _only_ contents, and not other aspects of the file (e.g., in
> what paths with what mode), and that is where my reaction to "how
> Git reprsents a file" comes from.

I think it does make sense to say the blob stores only the contents,
though IMO that's fairly clear already since we've already explained
where the other parts of the file are stored by the time we get to
explaining "blob".

>>>> +[[branch]]
>>>> +branches: `refs/heads/<name>`::
>>>> +    A branch is a name for a commit ID.
>>>
>>> Well a commit ID is an alternative way to refer to a commit object
>>> *name*, so it is a bit strange to say "a name for a commit ID".
>>>
>>> Perhaps "A branch ref stores a commit ID." is better?
>>
>> I think I'll leave this alone, none of the many test readers reported
>> being confused by it.
>
> Would a confused person report that they are confused? ;-)

Everyone leaving feedback gets a prompt something like this
asking them to categorize their feedback,
and "I'm confused" is one of the options.
https://jvns.ca/images/feedback-categories.png

I definitely got many "I'm confused" and "I have a question"
comments about other things that were confusing to readers.

>> I see that you don't like the "name for a commit ID" phrasing :)
>> Maybe there's another way to say it, though again none of the test
>> readers said they were confused by this or disagreed with the phrasing.
>
> Yes, I get that given "refs/heads/main", you want to say "main" is
> one of the ways to have repo_get_oid() to yield the commit object,
> and you are using "name" in that sense, but it is more like a ref
> can be used to name an object.  It is *not* the name of the object,
> because the object can have other names, and more importantly, it
> (i.e., to give a name for an object) is not the only thing that a
> ref can do.  

That's interesting,  what else can a ref do other than to give a name to
an object?

> And that is why I do not like that phrasing, combined
> with the target of giving that name is spelled "a commit ID".  The
> commit ID is already another way to name the thing the refname can
> be also used to name: a commit object.  A commit object and a commit
> object name are different things.  The latter is a name that can
> refer to the former.

I'm curious about why it's important to you to make this distinction
between a commit ID and a commit object. To me the commit ID and the
commit object come as a package, since the commit ID is calculated from
the commit object.

>  And a ref can be used just like the latter to
> refer to the former (i.e. "commit object").

> By the way, I do like the way many of your responses are "will think
> about it more", not "I'll take your version".
>
> Very much appreciated.

I'm glad to hear that! It's a fun puzzle to figure out how to express
things clearly and accurately and concisely.

- Julia

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 16, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Julia Evans" <julia@jvns.ca> writes:

>>>> It is not really "supporting" file modes.  Rather, Git only records
>>>> 5 kinds of entities associated with each path in a tree object, and
>>>> uses numbers taht remotely resemble POSIX file modes to represent
>>>> these 5 kinds.
>>>>
>>>> Perhaps "supports" -> "uses"?
>>>
>>> "Uses" sounds good to me.
>>
>> Also "much more limited" is misleading.  We only represent 5 kinds
>> of things, so we use only 5 mode-bits-looking numbers.
>
> What does it mislead the reader to think? My goal is to communicate that
> if you want to tell Git to remember that a file's Unix permissions were
> 700, that's not possible.

Yes, rewording "support" to "use" is one good way to do so.  But
"limited" implies that lifting the limitation would allow you to
store more.  That is the misguided thinking I want to avoid here.
There is no limitations to lift.  We only differentiate 5 kinds
hence we only use 5 permission-bit-looking numbers.  We do not
differenciate a file with permission 0600 from aother with 0644.

>>>> Here it may be worth noting that this "filename" is a single
>>>> pathname component (roughly, what you would see in non-recursive
>>>> "ls").  In other words, it may be a directory name.
>>
>> Comments?
>
> Oops, missed this in my first pass.
>
> I looked at them man pages for a couple of commands ("mv", "cp")
> and it looks like it's normal to refer to files and directories jointly
> as "files", or refer to them as having a "file name". So I think it's okay
> to call it a "file name" even if the "file" may be a directory.

Ah, not that part.  I was more interested in seeing how we express
"in these names, there won't be any slashes".

>>>>> +[[blob]]
>>>>> +blobs::

By the way, I kept forgetting to mention, but why are all of these
listed terms plural (not just object types but also "branches" and
"tags"?

> But it's not true that Git treats blobs as opaque binary data, unlike
> other blob storage systems, Git has diff and merge algorithms to
> interpret the contents of the file to some extent and try to do useful
> things with them.

Yes, but diff and merge happens way above the object layer, where
the question "what is blob" has a meaning.  And these "blobs are
recorded in a tree together with other blobs and trees recursively,
and the single top-level tree describes a snapshot of a single
state, which is recorded in a commit" data model descriptions is
exactly about the lower-level object layer.

> Another goal we could have is to be clear that there are no limits to
> what kind of files you can store in Git: you can equally well store text
> files and binary files.

That is a natural consequence of blobs being nothing more than
uninterpreted sequence of bytes.

>>> I see that you don't like the "name for a commit ID" phrasing :)
>>> Maybe there's another way to say it, though again none of the test
>>> readers said they were confused by this or disagreed with the phrasing.
>>
>> Yes, I get that given "refs/heads/main", you want to say "main" is
>> one of the ways to have repo_get_oid() to yield the commit object,
>> and you are using "name" in that sense, but it is more like a ref
>> can be used to name an object.  It is *not* the name of the object,
>> because the object can have other names, and more importantly, it
>> (i.e., to give a name for an object) is not the only thing that a
>> ref can do.  
>
> That's interesting,  what else can a ref do other than to give a name to
> an object?

For example, a ref is a key to reflog, so obvoiusly it is more than
just a single commit.  If you say "git checkout main" and "git
checkout main^{commit}", they refer to the same commit, but the
former is a sign that you want the next commit you make from that
state to grow that branch (and not any other branch you may have
that happen to be pointing at the same commit), while the other one
is not.

>> And that is why I do not like that phrasing, combined
>> with the target of giving that name is spelled "a commit ID".  The
>> commit ID is already another way to name the thing the refname can
>> be also used to name: a commit object.  A commit object and a commit
>> object name are different things.  The latter is a name that can
>> refer to the former.
>
> I'm curious about why it's important to you to make this distinction
> between a commit ID and a commit object. To me the commit ID and the
> commit object come as a package, since the commit ID is calculated from
> the commit object.

It may be the most natural name for the commit object, but that does
not mean the name is the object.  Let's not go phylosophical.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 16, 2025

This patch series was integrated into seen via git@446c8a7.

Git very often uses the terms "object", "reference", or "index" in its
documentation.

However, it's hard to find a clear explanation of these terms and how
they relate to each other in the documentation. The closest candidates
currently are:

1. `gitglossary`. This makes a good effort, but it's an alphabetically
    ordered dictionary and a dictionary is not a good way to learn
    concepts. You have to jump around too much and it's not possible to
    present the concepts in the order that they should be explained.
2. `gitcore-tutorial`. This explains how to use the "core" Git commands.
   This is a nice document to have, but it's not necessary to learn how
   `update-index` works to understand Git's data model, and we should
   not be requiring users to learn how to use the "plumbing" commands
   if they want to learn what the term "index" or "object" means.
3. `gitrepository-layout`. This is a great resource, but it includes a
   lot of information about configuration and internal implementation
   details which are not related to the data model. It also does
   not explain how commits work.

The result of this is that Git users (even users who have been using
Git for 15+ years) struggle to read the documentation because they don't
know what the core terms mean, and it's not possible to add links
to help them learn more.

Add an explanation of Git's data model. Some choices I've made in
deciding what "core data model" means:

1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me
   if those are intended to be user facing or if they're more like
   internal implementation details.
2. Don't talk about submodules other than by mentioning how they
   relate to trees. This is because Git has a lot of special features,
   and explaining how they all work exhaustively could quickly go
   down a rabbit hole which would make this document less useful for
   understanding Git's core behaviour.
3. Don't discuss the structure of a commit message
   (first line, trailers etc).
4. Don't mention configuration.
5. Don't mention the `.git` directory, to avoid getting too much into
   implementation details

Signed-off-by: Julia Evans <julia@jvns.ca>
@gitgitgadget
Copy link

gitgitgadget bot commented Oct 20, 2025

On the Git mailing list, "Kristoffer Haugsbakk" wrote (reply to this):

On Tue, Oct 14, 2025, at 23:12, Julia Evans via GitGitGadget wrote:
> From: Julia Evans <julia@jvns.ca>
>
> Git very often uses the terms "object", "reference", or "index" in its
> documentation.
>
> However, it's hard to find a clear explanation of these terms and how
> they relate to each other in the documentation. The closest candidates
> currently are:
>[snip]

For some reason I get an error with `Documentation/doc-diff` when run
against 446c8a72 (Merge branch 'je/doc-data-model' into seen,
2025-10-16).  Here I’m comparing with `master`.

    $ ./doc-diff 4253630c6f07a4bdcc9aa62a50e26a4d466219d1 446c8a72be6cf1b6121e643590a9acacfc21c5fb
    Previous HEAD position was b20e48e0232 doc: add a explanation of Git's data model
    HEAD is now at 446c8a72be6 Merge branch 'je/doc-data-model' into seen
    make: Entering directory '<git repo>/Documentation/tmp-doc-diff/worktree'
    install -d -m 755 '<git repo>/Documentation/tmp-doc-diff/installed/446c8a72be6cf1b6121e643590a9acacfc21c5fb+/home/kristoffer/share/man/man3'
    (cd perl/build/man/man3 && tar cf - .) | \
    (cd '<git repo>/Documentation/tmp-doc-diff/installed/446c8a72be6cf1b6121e643590a9acacfc21c5fb+/home/kristoffer/share/man/man3' && umask 022 && tar xof -)
    make -C Documentation install-man
    make[1]: Entering directory '<git repo>/Documentation/tmp-doc-diff/worktree/Documentation'
        GEN cmd-list.made
        GEN doc.dep
        GEN asciidoc.conf
        ASCIIDOC git-add.xml
        ASCIIDOC git-config.xml
        ASCIIDOC git-diff-tree.xml
        ASCIIDOC git-fast-import.xml
        ASCIIDOC git-fetch.xml
        ASCIIDOC git-fsck.xml
        ASCIIDOC git-log.xml
        ASCIIDOC git-merge-tree.xml
        ASCIIDOC git-patch-id.xml
        ASCIIDOC git-pull.xml
        ASCIIDOC git-push.xml
        ASCIIDOC git-replay.xml
        ASCIIDOC git-repo.xml
        ASCIIDOC git-rev-list.xml
        ASCIIDOC git-rev-parse.xml
        ASCIIDOC git-shortlog.xml
        ASCIIDOC git-show.xml
        ASCIIDOC git-sparse-checkout.xml
        ASCIIDOC git-stash.xml
        ASCIIDOC git-tag.xml
        ASCIIDOC git-worktree.xml
        ASCIIDOC git.xml
        ASCIIDOC gitformat-loose.xml
        ASCIIDOC gitformat-pack.xml
        ASCIIDOC gitcli.xml
        ASCIIDOC gitcredentials.xml
        XMLTO gitdatamodel.7
        XMLTO git-add.1
        XMLTO git-diff-tree.1
        XMLTO git-fast-import.1
        XMLTO git-fetch.1
        XMLTO git-fsck.1
    xmlto: <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml does not validate (status 3)
    xmlto: Fix document syntax or use --skip-validation option
    <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:71: element link: validity error : IDREF attribute linkend references an unknown ID "tree"
    <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:96: element link: validity error : IDREF attribute linkend references an unknown ID "tree"
    <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:397: element link: validity error : IDREF attribute linkend references an unknown ID "tree"
    Document <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml does not validate
    make[1]: *** [Makefile:380: gitdatamodel.7] Error 13
    make[1]: *** Waiting for unfinished jobs....
    make[1]: Leaving directory '<git repo>/Documentation/tmp-doc-diff/worktree/Documentation'
    make: *** [Makefile:3676: install-man] Error 2
    make: Leaving directory '<git repo>/Documentation/tmp-doc-diff/worktree'

The syntax looks correct.  So I don’t know what is wrong.  `make html`
works *and* makes the link.

At first look it might be to do with the anchor on a definition list but
I tried removing the anchors and expected to get an error for `blob`
next.  But that didn’t happen.

In short I don’t see what is special about `tree`.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 20, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail.com> writes:

>     xmlto: <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml does not validate (status 3)
>     xmlto: Fix document syntax or use --skip-validation option
>     <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:71: element link: validity error : IDREF attribute linkend references an unknown ID "tree"
>     <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:96: element link: validity error : IDREF attribute linkend references an unknown ID "tree"
>     <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:397: element link: validity error : IDREF attribute linkend references an unknown ID "tree"
>     Document <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml does not validate
>     make[1]: *** [Makefile:380: gitdatamodel.7] Error 13
>     make[1]: *** Waiting for unfinished jobs....
>     make[1]: Leaving directory '<git repo>/Documentation/tmp-doc-diff/worktree/Documentation'
>     make: *** [Makefile:3676: install-man] Error 2
>     make: Leaving directory '<git repo>/Documentation/tmp-doc-diff/worktree'
>
> The syntax looks correct.  So I don’t know what is wrong.  `make html`
> works *and* makes the link.
>
> At first look it might be to do with the anchor on a definition list but
> I tried removing the anchors and expected to get an error for `blob`
> next.  But that didn’t happen.
>
> In short I don’t see what is special about `tree`.

This seems to work it around without breaking .html generation too
badly for AsciiDoc and without breaking .7/.html generation for
Asciidoctor.  Generation of .7 were broken with AsciiDoc so we
cannot complain even if the result is suboptimal, but the generated
manpage with this patch using AsciiDoc did not look too bad, either.

I do not know AsciiDoc internals (and I am not particularly
interested to learn it now), but I am guessing that the bug is that
when it sees [[tree]], it tries to find an element to put id="tree",
but before it finds any approprifate one, it sees [[filemode]] and
uses the element it finds to hold id="filemode", losing sight of the
need to add id="tree" somewhere.



 Documentation/gitdatamodel.adoc | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc
index f49574dfae..7232fe3861 100644
--- a/Documentation/gitdatamodel.adoc
+++ b/Documentation/gitdatamodel.adoc
@@ -83,8 +83,10 @@ trees::
     A tree is how Git represents a directory. It lists, for each item in
     the tree:
 +
+1. The *file mode*, for example `100644`.
++
 [[file-mode]]
-1. The *file mode*, for example `100644`. The format is inspired by Unix
+The format is inspired by Unix
    permissions, but Git's modes are much more limited. Git only supports these file modes:
 +
   - `100644`: regular file (with type `blob`)
-- 
2.51.1-556-g06b2a500e9

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 20, 2025

This patch series was integrated into seen via git@d23a353.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 20, 2025

There was a status update in the "Cooking" section about the branch je/doc-data-model on the Git mailing list:

Add a new manual that describes the data model.

Comments?
source: <pull.1981.v3.git.1760476346040.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 20, 2025

This patch series was integrated into seen via git@8e0fcac.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 21, 2025

This patch series was integrated into seen via git@7c0ccde.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 22, 2025

There was a status update in the "Cooking" section about the branch je/doc-data-model on the Git mailing list:

Add a new manual that describes the data model.

Comments?
source: <pull.1981.v3.git.1760476346040.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 22, 2025

This patch series was integrated into seen via git@8b30bb2.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 23, 2025

This patch series was integrated into seen via git@6526b31.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 23, 2025

This patch series was integrated into seen via git@74d6583.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 23, 2025

There was a status update in the "Cooking" section about the branch je/doc-data-model on the Git mailing list:

Add a new manual that describes the data model.

Expecting a reroll.
cf. <0eb276ef-7b1a-4e79-93da-13a83226aa01@app.fastmail.com>
source: <pull.1981.v3.git.1760476346040.gitgitgadget@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant