Skip to content

ignores: handle non UTF-8 exclude files#2157

Closed
Matthieu-Beauchamp wants to merge 1 commit into
git:masterfrom
Matthieu-Beauchamp:unicode-support-gitignore
Closed

ignores: handle non UTF-8 exclude files#2157
Matthieu-Beauchamp wants to merge 1 commit into
git:masterfrom
Matthieu-Beauchamp:unicode-support-gitignore

Conversation

@Matthieu-Beauchamp
Copy link
Copy Markdown

@Matthieu-Beauchamp Matthieu-Beauchamp commented Jan 3, 2026

CC: Matheus Tavares matheus.tavb@gmail.com
CC: Johannes Schindelin johannes.schindelin@gmx.de
cc: Torsten Bögershausen tboegi@web.de
cc: "brian m. carlson" sandals@crustytoothpaste.net
cc: Collin Funk collin.funk1@gmail.com
cc: Phillip Wood phillip.wood123@gmail.com

@gitgitgadget-git
Copy link
Copy Markdown

Welcome to GitGitGadget

Hi @Matthieu-Beauchamp, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that either:

  • Your Pull Request has a good description, if it consists of multiple commits, as it will be used as cover letter.
  • Your Pull Request description is empty, if it consists of a single commit, as the commit message should be descriptive enough by itself.

You can CC potential reviewers by adding a footer to the PR description with the following syntax:

CC: Revi Ewer <revi.ewer@example.com>, Ill Takalook <ill.takalook@example.net>

NOTE: DO NOT copy/paste your CC list from a previous GGG PR's description,
because it will result in a malformed CC list on the mailing list. See
example.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "revisions:" to state which subsystem the change is about, and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the Libera Chat IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this. Note that any reviewers CC'd via the list in the PR description will not actually be sent emails.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Libera Chat. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@gitgitgadget-git
Copy link
Copy Markdown

There is an issue in commit ab8623c:
ignores: handle non UTF-8 exclude files

  • Lines in the body of the commit messages should be wrapped between 60 and 76 characters.
    Indented lines, and lines without whitespace, are exempt

When reading exclude files, git assumes it is encoded in UTF-8 and will
fail to apply patterns if it isn't. This is a silent failure as no warning
or errors are shown to the users. This is a problem that can take a while
to diagnose as many users will not think of checking the encoding of their
file and may believe their patterns are wrong instead. Users may also
accidentally commit undesired files.

On Windows, this happens if a user uses Windows PowerShell to create the
file, which results in a UTF-16LE file with a BOM. This issue was discussed
here git-for-windows#3329. An example of
where a user was confused that his exclude file was not working is cited
git-for-windows#3227.

A minimal fix should at least warn the user if git cannot properly decode
the exclude file. Ideally, git would handle any given Unicode file.

First, check if a BOM is present. If it is, decode the file to UTF-8.
If no BOM is detected, then try to parse the file as UTF-8. If that fails,
attempt to decode the file using the working tree encoding of the file,
if any. If that fails, print a warning to tell the user that the exclude
file could not be decoded and skip the file.

This raises the issue that if the entire tree is encoded in, for example
UTF-16BE (no BOM), then even if the encoding is given in .gitattributes,
git would not be able to decode it. I believe that this is still
acceptable since a warning will be emitted for the file (since it has no
BOM, is not valid UTF-8 and no working tree encoding could be found).

One case that isn't handled is if a wrong encoding is given in the
attributes and the exclude file has no BOM and is not UTF-8. Using
iconv to convert an UTF16BE file to UTF-8 while specifying UTF-16LE
yields gibberish without an error and so this case is a silent failure
where no patterns will match.

Signed-off-by: Matthieu Beauchamp-Boulay <matthieu.beauchamp.boulay@gmail.com>
@Matthieu-Beauchamp Matthieu-Beauchamp force-pushed the unicode-support-gitignore branch from ab8623c to 239fc55 Compare January 3, 2026 18:47
@gitgitgadget-git
Copy link
Copy Markdown

There is an issue in commit 239fc55:
ignores: handle non UTF-8 exclude files

  • Lines in the body of the commit messages should be wrapped between 60 and 76 characters.
    Indented lines, and lines without whitespace, are exempt

@dscho
Copy link
Copy Markdown
Member

dscho commented Jan 3, 2026

/allow

@gitgitgadget-git
Copy link
Copy Markdown

User Matthieu-Beauchamp is now allowed to use GitGitGadget.

@dscho
Copy link
Copy Markdown
Member

dscho commented Jan 3, 2026

@Matthieu-Beauchamp I re-ran the previously failing handle-pr-push workflow, which now succeeded thanks to your fix!

@Matthieu-Beauchamp
Copy link
Copy Markdown
Author

@dscho That's great, thank you for your help! 🎉

@Matthieu-Beauchamp
Copy link
Copy Markdown
Author

/preview

@gitgitgadget-git
Copy link
Copy Markdown

Preview email sent as pull.2157.git.git.1767477473709.gitgitgadget@gmail.com

@Matthieu-Beauchamp
Copy link
Copy Markdown
Author

/preview

@gitgitgadget-git
Copy link
Copy Markdown

Preview email sent as pull.2157.git.git.1767478358179.gitgitgadget@gmail.com

@Matthieu-Beauchamp
Copy link
Copy Markdown
Author

/submit

@gitgitgadget-git
Copy link
Copy Markdown

Submitted as pull.2157.git.git.1767478617198.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-2157/Matthieu-Beauchamp/unicode-support-gitignore-v1

To fetch this version to local tag pr-git-2157/Matthieu-Beauchamp/unicode-support-gitignore-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-2157/Matthieu-Beauchamp/unicode-support-gitignore-v1

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Matthieu Beauchamp-Boulay via GitGitGadget"
<gitgitgadget@gmail.com> writes:

> From: Matthieu Beauchamp-Boulay <matthieu.beauchamp.boulay@gmail.com>
>
> When reading exclude files, git assumes it is encoded in UTF-8 and will
> fail to apply patterns if it isn't.

Is it true?  I thought we assume that the exclude patters are
written in such a way to match the encoding of the pathnames,
whatever used on the platform that our calls to readdir(3) returns.
Some platforms may have compat/ code to convert these paths and
force use of UTF-8, but please do not write such platform local
conventions as if it were universal characteristics of our system.

"ignores" -> "exclude" on the title, as that is the canonical word
we use in the codebase to refer to the ignore mechanism.

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, Torsten Bögershausen wrote (reply to this):

On Sat, Jan 03, 2026 at 10:16:57PM +0000, Matthieu Beauchamp-Boulay via GitGitGadget wrote:
> From: Matthieu Beauchamp-Boulay <matthieu.beauchamp.boulay@gmail.com>
Thanks for contributing - some comments inlie
> 
> When reading exclude files, git assumes it is encoded in UTF-8 and will
Question: The report citet below talks about ignore files.

> fail to apply patterns if it isn't. This is a silent failure as no warning
> or errors are shown to the users. This is a problem that can take a while
> to diagnose as many users will not think of checking the encoding of their
> file and may believe their patterns are wrong instead. Users may also
> accidentally commit undesired files.
Note:
git status is your friend.
Blindly commiting without checking what is staged or not may
lead to unwanted results.

> 
> On Windows, this happens if a user uses Windows PowerShell to create the
> file, which results in a UTF-16LE file with a BOM.
>  This issue was discussed
> here https://github.com/git-for-windows/git/issues/3329. An example of
> where a user was confused that his exclude file was not working is cited
> https://github.com/git-for-windows/git/issues/3227.
A very short research indicates that powershell can be configured
to use UTF-8. I am not a powershell user, please correct if I am wrong.

> 
> A minimal fix should at least warn the user if git cannot properly decode
> the exclude file.
I think that reading an ignore file that contains a '\0' could/should
Git to complain. If someone asks my, most users are tempted to ignore
warnings for different reasons. Bailing out may feel more unpolite
but more clear that somethinh is wrong.

>Ideally, git would handle any given Unicode file.
That is debatable.

> 
> First, check if a BOM is present. If it is, decode the file to UTF-8.
> If no BOM is detected, then try to parse the file as UTF-8. If that fails,
> attempt to decode the file using the working tree encoding of the file,
> if any. If that fails, print a warning to tell the user that the exclude
> file could not be decoded and skip the file.
> 
> This raises the issue that if the entire tree is encoded in, for example
> UTF-16BE (no BOM), then even if the encoding is given in .gitattributes,
> git would not be able to decode it.
"able to decode: Yes. But willing to do so: not with the patch, right ?
> I believe that this is still
> acceptable since a warning will be emitted for the file (since it has no
> BOM, is not valid UTF-8 and no working tree encoding could be found).
> 
> One case that isn't handled is if a wrong encoding is given in the
> attributes and the exclude file has no BOM and is not UTF-8. Using
> iconv to convert an UTF16BE file to UTF-8 while specifying UTF-16LE
> yields gibberish without an error and so this case is a silent failure
> where no patterns will match.
One question is, if we should look at working_tree_encoding at all.
The other one is, how much UTF-16 handling of ignore or
other file should we have have in Git ?
It seems that this fix is for a very special case only ?

From
https://github.com/git-for-windows/git/issues/3329
we read:
/******/
if (size > 1 && buf[0] == 0xff && buf[1] == 0xfe) {
    char *reencoded = reencode_string_len(buf, size, "UTF-8", "UTF16-LE-BOM", &size);
    if (!reencoded)
        die(_("could not convert contents of '%s' from UTF-16"), fname);
    free(buf);
    buf = reencoded;
}
/******/
(Which seems a simpler suggestion)
However,  there is no UTF-16-LE-BOM in iconv 
(at least in the majority of implementations), 
so a better approach, totaly untested, may be:

if (size >= 2 && buf[0] == 0xff && buf[1] == 0xfe) {
    char *reencoded = reencode_string_len(buf+2, size-2, "UTF-8", "UTF16", &size);
    if (!reencoded)
        die(_("could not convert contents of '%s' from UTF-16"), fname);
    free(buf);
    buf = reencoded;
}

This leads to some free thinking, especially when we look at
other implementations of Git:
Would it be better to simply bail out on UTF-16 files ?
Techically all files with a '\0'.
[snip] 

@gitgitgadget-git
Copy link
Copy Markdown

User Torsten Bögershausen <tboegi@web.de> has been added to the cc: list.

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, "brian m. carlson" wrote (reply to this):

On 2026-01-03 at 22:16:57, Matthieu Beauchamp-Boulay via GitGitGadget wrote:
> When reading exclude files, git assumes it is encoded in UTF-8 and will
> fail to apply patterns if it isn't. This is a silent failure as no warning
> or errors are shown to the users. This is a problem that can take a while
> to diagnose as many users will not think of checking the encoding of their
> file and may believe their patterns are wrong instead. Users may also
> accidentally commit undesired files.

This isn't actually true.  Git allows arbitrary byte sequences in the
file because Git allows filenames to have arbitrary byte sequences, just
like Unix.

> On Windows, this happens if a user uses Windows PowerShell to create the
> file, which results in a UTF-16LE file with a BOM. This issue was discussed
> here https://github.com/git-for-windows/git/issues/3329. An example of
> where a user was confused that his exclude file was not working is cited
> https://github.com/git-for-windows/git/issues/3227.

Ah, yes, here's the problem.  UTF-16LE is used on Windows, and on
Windows, Git stores pathnames as if they were converted into UTF-8, so
you do need to write the filenames in UTF-8 in the ignore file.

> A minimal fix should at least warn the user if git cannot properly decode
> the exclude file. Ideally, git would handle any given Unicode file.

As I mentioned, the file isn't necessarily in UTF-8 or Unicode.  Here's
an example shell script to demonstrate (requires a non-macOS Unix):

----
#!/bin/sh

rm -fr test-repo
git init --object-format=sha256 test-repo
cd test-repo
touch abc.txt
touch "$(printf '\220')"
printf '\220\n' >.gitignore
git add .
git status
git ls-files -io --exclude-standard
----

I'll point out that all of this is also true for things like config
files (which are also used in `.gitmodules`) and `.gitattributes` files.
If we wanted to make a change, we would be wise to make it everywhere.

However, if we wanted to force `.gitignore` to UTF-8, we'd need to have
an escape mechanism to write non-UTF-8 sequences, and as far as I know,
we don't.

> First, check if a BOM is present. If it is, decode the file to UTF-8.
> If no BOM is detected, then try to parse the file as UTF-8. If that fails,
> attempt to decode the file using the working tree encoding of the file,
> if any. If that fails, print a warning to tell the user that the exclude
> file could not be decoded and skip the file.

We do not accept and strip BOMs in UTF-8 files elsewhere (including in
things like `git diff` output), so we should not do so here, either.
For Unicode files, if there is no BOM, then the standard is that it's
assumed to automatically be UTF-8, so a BOM is superfluous and not
recommended.

> diff --git a/t/lib-encoding.sh b/t/lib-encoding.sh
> index 2dabc8c73e..1b1cc357ba 100644
> --- a/t/lib-encoding.sh
> +++ b/t/lib-encoding.sh
> @@ -23,3 +23,11 @@ write_utf32 () {
>  	fi &&
>  	iconv -f UTF-8 -t UTF-32
>  }
> +
> +write_encoded () {
> +  iconv -f UTF-8 -t "$1"
> +}
> +
> +write_bom () {
> +  echo "$@" | perl -pe 's/\s+//g; $_=pack("H*", $_)'
> +}
> \ No newline at end of file

We place newlines at the end of our text files unless there's a good
reason no to.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

@gitgitgadget-git
Copy link
Copy Markdown

User "brian m. carlson" <sandals@crustytoothpaste.net> has been added to the cc: list.

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, Matthieu Beauchamp wrote (reply to this):

On Sat, Jan 3, 2026 at 9:54 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Matthieu Beauchamp-Boulay via GitGitGadget"
> <gitgitgadget@gmail.com> writes:
>
> > From: Matthieu Beauchamp-Boulay <matthieu.beauchamp.boulay@gmail.com>
> >
> > When reading exclude files, git assumes it is encoded in UTF-8 and will
> > fail to apply patterns if it isn't.
>
> Is it true?  I thought we assume that the exclude patters are
> written in such a way to match the encoding of the pathnames,
> whatever used on the platform that our calls to readdir(3) returns.
> Some platforms may have compat/ code to convert these paths and
> force use of UTF-8, but please do not write such platform local
> conventions as if it were universal characteristics of our system.

I believe you are correct, I wrongly assumed git would always
manipulate UTF-8 paths.
The revision will need to take the platform into consideration.

> "ignores" -> "exclude" on the title, as that is the canonical word
> we use in the codebase to refer to the ignore mechanism.

Thank you, I will update in the revision.

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, Matthieu Beauchamp wrote (reply to this):

On Sun, Jan 4, 2026 at 12:35 PM Torsten Bögershausen <tboegi@web.de> wrote:
>
> On Sat, Jan 03, 2026 at 10:16:57PM +0000, Matthieu Beauchamp-Boulay via GitGitGadget wrote:
> > From: Matthieu Beauchamp-Boulay <matthieu.beauchamp.boulay@gmail.com>
> Thanks for contributing - some comments inlie
> >
> > When reading exclude files, git assumes it is encoded in UTF-8 and will
> Question: The report citet below talks about ignore files.
>
> > fail to apply patterns if it isn't. This is a silent failure as no warning
> > or errors are shown to the users. This is a problem that can take a while
> > to diagnose as many users will not think of checking the encoding of their
> > file and may believe their patterns are wrong instead. Users may also
> > accidentally commit undesired files.
> Note:
> git status is your friend.
> Blindly commiting without checking what is staged or not may
> lead to unwanted results.

Yes of course, I'll remove that last line as it is not the problem I'm
really trying to fix.

> >
> > On Windows, this happens if a user uses Windows PowerShell to create the
> > file, which results in a UTF-16LE file with a BOM.
> >  This issue was discussed
> > here https://github.com/git-for-windows/git/issues/3329. An example of
> > where a user was confused that his exclude file was not working is cited
> > https://github.com/git-for-windows/git/issues/3227.
> A very short research indicates that powershell can be configured
> to use UTF-8. I am not a powershell user, please correct if I am wrong.
>

Yes you are correct, but I want to address the issues for users who may not
realize that they used the wrong encoding when creating their exclude file.
For that case I don't see how the fact that powershell can be configured to
UTF-8 helps, aside from preventing repeating the same mistake.

> >
> > A minimal fix should at least warn the user if git cannot properly decode
> > the exclude file.
> I think that reading an ignore file that contains a '\0' could/should
> Git to complain. If someone asks my, most users are tempted to ignore
> warnings for different reasons. Bailing out may feel more unpolite
> but more clear that somethinh is wrong.

While I agree that warnings may be ignored, I feel like a wrongly encoded
exclude file is not an error that warrants stopping git entirely.

As other reviewers mentioned, I wrongly assumed that the encoding would
be UTF-8. The idea of looking for the null byte in the exclude file
may be helpful
since any d_name from readdir (3) is null terminated. Checking for a null byte
before the end of the file could be a simple check to detect a bad exclude file.

> >Ideally, git would handle any given Unicode file.
> That is debatable.

Of course, I'll rephrase that part.

> >
> > First, check if a BOM is present. If it is, decode the file to UTF-8.
> > If no BOM is detected, then try to parse the file as UTF-8. If that fails,
> > attempt to decode the file using the working tree encoding of the file,
> > if any. If that fails, print a warning to tell the user that the exclude
> > file could not be decoded and skip the file.
> >
> > This raises the issue that if the entire tree is encoded in, for example
> > UTF-16BE (no BOM), then even if the encoding is given in .gitattributes,
> > git would not be able to decode it.
> "able to decode: Yes. But willing to do so: not with the patch, right ?
> > I believe that this is still
> > acceptable since a warning will be emitted for the file (since it has no
> > BOM, is not valid UTF-8 and no working tree encoding could be found).
> >
> > One case that isn't handled is if a wrong encoding is given in the
> > attributes and the exclude file has no BOM and is not UTF-8. Using
> > iconv to convert an UTF16BE file to UTF-8 while specifying UTF-16LE
> > yields gibberish without an error and so this case is a silent failure
> > where no patterns will match.
> One question is, if we should look at working_tree_encoding at all.
> The other one is, how much UTF-16 handling of ignore or
> other file should we have have in Git ?
> It seems that this fix is for a very special case only ?
>
> From
> https://github.com/git-for-windows/git/issues/3329
> we read:
> /******/
> if (size > 1 && buf[0] == 0xff && buf[1] == 0xfe) {
>     char *reencoded = reencode_string_len(buf, size, "UTF-8", "UTF16-LE-BOM", &size);
>     if (!reencoded)
>         die(_("could not convert contents of '%s' from UTF-16"), fname);
>     free(buf);
>     buf = reencoded;
> }
> /******/
> (Which seems a simpler suggestion)
> However,  there is no UTF-16-LE-BOM in iconv
> (at least in the majority of implementations),
> so a better approach, totaly untested, may be:
>
> if (size >= 2 && buf[0] == 0xff && buf[1] == 0xfe) {
>     char *reencoded = reencode_string_len(buf+2, size-2, "UTF-8", "UTF16", &size);
>     if (!reencoded)
>         die(_("could not convert contents of '%s' from UTF-16"), fname);
>     free(buf);
>     buf = reencoded;
> }
>
> This leads to some free thinking, especially when we look at
> other implementations of Git:
> Would it be better to simply bail out on UTF-16 files ?
> Techically all files with a '\0'.
> [snip]

I was trying to cover more possible use cases, but this may not be a desired
behavior after all. Other reviewers pointed out that the exclude file may have
an abitrary encoding that needs to match the encoding of the paths as read
by git when using readdir (3).

You are correct, UTF-16-LE-BOM is a 'fictional' encoding handled by git. Git
handles the BOM and iconv will be passed the UTF-16LE encoding instead.

I would've liked to be able to handle any wrongly encoded exclude files, but
it's more complicated than I originally thought. Checking for a null byte
could be a simple way to detect some wrong encodings.

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, Matthieu Beauchamp wrote (reply to this):

On Sun, Jan 4, 2026 at 2:40 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> On 2026-01-03 at 22:16:57, Matthieu Beauchamp-Boulay via GitGitGadget wrote:
> > When reading exclude files, git assumes it is encoded in UTF-8 and will
> > fail to apply patterns if it isn't. This is a silent failure as no warning
> > or errors are shown to the users. This is a problem that can take a while
> > to diagnose as many users will not think of checking the encoding of their
> > file and may believe their patterns are wrong instead. Users may also
> > accidentally commit undesired files.
>
> This isn't actually true.  Git allows arbitrary byte sequences in the
> file because Git allows filenames to have arbitrary byte sequences, just
> like Unix.

Yes thank you for pointing that out, I had some wrong assumptions about the
encodings.

> > On Windows, this happens if a user uses Windows PowerShell to create the
> > file, which results in a UTF-16LE file with a BOM. This issue was discussed
> > here https://github.com/git-for-windows/git/issues/3329. An example of
> > where a user was confused that his exclude file was not working is cited
> > https://github.com/git-for-windows/git/issues/3227.
>
> Ah, yes, here's the problem.  UTF-16LE is used on Windows, and on
> Windows, Git stores pathnames as if they were converted into UTF-8, so
> you do need to write the filenames in UTF-8 in the ignore file.
>

Yes, the conversion from UTF16-LE to UTF-8 would need to be platform
specific.

> > A minimal fix should at least warn the user if git cannot properly decode
> > the exclude file. Ideally, git would handle any given Unicode file.
>
> As I mentioned, the file isn't necessarily in UTF-8 or Unicode.  Here's
> an example shell script to demonstrate (requires a non-macOS Unix):
>
> ----
> #!/bin/sh
>
> rm -fr test-repo
> git init --object-format=sha256 test-repo
> cd test-repo
> touch abc.txt
> touch "$(printf '\220')"
> printf '\220\n' >.gitignore
> git add .
> git status
> git ls-files -io --exclude-standard
> ----
>
> I'll point out that all of this is also true for things like config
> files (which are also used in `.gitmodules`) and `.gitattributes` files.
> If we wanted to make a change, we would be wise to make it everywhere.
>
> However, if we wanted to force `.gitignore` to UTF-8, we'd need to have
> an escape mechanism to write non-UTF-8 sequences, and as far as I know,
> we don't.

Right, I don't think forcing UTF-8 everywhere is worth it for a relatively
simple issue. If I can find a portable way to determine that an encoding
is incorrect (and possibly reencode it), I could apply it to those other files
as well.

> > First, check if a BOM is present. If it is, decode the file to UTF-8.
> > If no BOM is detected, then try to parse the file as UTF-8. If that fails,
> > attempt to decode the file using the working tree encoding of the file,
> > if any. If that fails, print a warning to tell the user that the exclude
> > file could not be decoded and skip the file.
>
> We do not accept and strip BOMs in UTF-8 files elsewhere (including in
> things like `git diff` output), so we should not do so here, either.
> For Unicode files, if there is no BOM, then the standard is that it's
> assumed to automatically be UTF-8, so a BOM is superfluous and not
> recommended.

I meant checking for UTF-16 and UTF-32 BOMs and then converting to UTF-8,
I will clarify if this part is still in the revision.

> > diff --git a/t/lib-encoding.sh b/t/lib-encoding.sh
> > index 2dabc8c73e..1b1cc357ba 100644
> > --- a/t/lib-encoding.sh
> > +++ b/t/lib-encoding.sh
> > @@ -23,3 +23,11 @@ write_utf32 () {
> >       fi &&
> >       iconv -f UTF-8 -t UTF-32
> >  }
> > +
> > +write_encoded () {
> > +  iconv -f UTF-8 -t "$1"
> > +}
> > +
> > +write_bom () {
> > +  echo "$@" | perl -pe 's/\s+//g; $_=pack("H*", $_)'
> > +}
> > \ No newline at end of file
>
> We place newlines at the end of our text files unless there's a good
> reason no to.
> --
> brian m. carlson (they/them)
> Toronto, Ontario, CA

I will fix it, I would've assumed that clang-format would fix that.

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, "brian m. carlson" wrote (reply to this):

On 2026-01-06 at 20:45:56, Matthieu Beauchamp wrote:
> On Sun, Jan 4, 2026 at 2:40 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > Ah, yes, here's the problem.  UTF-16LE is used on Windows, and on
> > Windows, Git stores pathnames as if they were converted into UTF-8, so
> > you do need to write the filenames in UTF-8 in the ignore file.
> >
> 
> Yes, the conversion from UTF16-LE to UTF-8 would need to be platform
> specific.

We typically don't want platform-specific behaviour in Git.  Many Git
contributors do not work on Windows but we want things to work as much
as possible identically across all platforms because it makes
development easier, as well as making it easier for users to reason
about the project.  I, for one, don't have a Windows system (nor do I
want one) but I do want my Git code to just work there.

As an example, we still use a POSIX shell in aliases and other settings
on Windows despite the fact that PowerShell is built into Windows
because it means that aliases and similar functionality just work
correctly regardless of platform and it allows users to write a config
file that works everywhere.

Instead of trying to force Git to gracefully handle UTF-16 in its config
files, my strong recommendation is to adjust your PowerShell scripts to
use UTF-8 instead[0] or use a POSIX shell.  I'll note that Microsoft's
new Edit text editor[1] defaults to UTF-8 (and, except on Windows, LF
line endings), so I know that Microsoft understands that UTF-8 is the
proper encoding to use on the Internet today.

[0] https://stackoverflow.com/questions/5596982/using-powershell-to-write-a-file-in-utf-8-without-the-bom
[1] Available at https://github.com/microsoft/edit and apparently
shipped with Windows.  I will say that I was impressed at its
functionality for a 231 KiB binary footprint.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, Collin Funk wrote (reply to this):

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> Instead of trying to force Git to gracefully handle UTF-16 in its config
> files, my strong recommendation is to adjust your PowerShell scripts to
> use UTF-8 instead[0] or use a POSIX shell.  I'll note that Microsoft's
> new Edit text editor[1] defaults to UTF-8 (and, except on Windows, LF
> line endings), so I know that Microsoft understands that UTF-8 is the
> proper encoding to use on the Internet today.
>
> [1] Available at https://github.com/microsoft/edit and apparently
> shipped with Windows.  I will say that I was impressed at its
> functionality for a 231 KiB binary footprint.

Does it handle text that is not UTF-8 encoded?

An unfortunate trend that I have seen with Rust programs is that they
completely disregard the systems locale. E.g. using
LC_ALL=en_US.ISO-8859-1 and passing an "À" character as an option will
typically fail since it is encoded as 0xC0 which is not a valid UTF-8
character.

I figured it was worth bringing up since Git may wany to think about it
some before introducing more Rust. I think it can be worked around by
using OsString [1], but I guess many people choose not to.

Collin

[1] https://doc.rust-lang.org/std/ffi/struct.OsString.html

@gitgitgadget-git
Copy link
Copy Markdown

User Collin Funk <collin.funk1@gmail.com> has been added to the cc: list.

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, Phillip Wood wrote (reply to this):

On 07/01/2026 01:35, Collin Funk wrote:
> > An unfortunate trend that I have seen with Rust programs is that they
> completely disregard the systems locale. E.g. using
> LC_ALL=en_US.ISO-8859-1 and passing an "À" character as an option will
> typically fail since it is encoded as 0xC0 which is not a valid UTF-8
> character.
> > I figured it was worth bringing up since Git may wany to think about it
> some before introducing more Rust. I think it can be worked around by
> using OsString [1], but I guess many people choose not to.

Git will certainly want to continue to support non-utf8 encodings. That's perfectly possible in rust but in my (rather limited) experience it does take a bit more effort than the equivalent code using the standard library's String type. I find it particularly annoying that "cargo run" refuses to pass non-utf8 arguments to the program being run when the program has been carefully written to support them.

Thanks

Phillip

@gitgitgadget-git
Copy link
Copy Markdown

User Phillip Wood <phillip.wood123@gmail.com> has been added to the cc: list.

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, Phillip Wood wrote (reply to this):

On 06/01/2026 20:32, Matthieu Beauchamp wrote:
> > Yes you are correct, but I want to address the issues for users who may not
> realize that they used the wrong encoding when creating their exclude file.
> For that case I don't see how the fact that powershell can be configured to
> UTF-8 helps, aside from preventing repeating the same mistake.

My concern with that is that it ends up hampering collaboration with people using bash on Windows or a native shell on other platforms. If they append to a UTF-16 encoded .gitignore with "echo path >>.gitignore" you'll end up with a mix of encodings in the same file. Similarly if you use powershell to append to an existing file that is UTF-8 encoded with "echo hello >>.gitignore" is the appended text UTF-16 encoded resulting in mixed encodings in the same file?

Thanks

Phillip

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, "brian m. carlson" wrote (reply to this):

On 2026-01-07 at 01:35:11, Collin Funk wrote:
> An unfortunate trend that I have seen with Rust programs is that they
> completely disregard the systems locale. E.g. using
> LC_ALL=en_US.ISO-8859-1 and passing an "À" character as an option will
> typically fail since it is encoded as 0xC0 which is not a valid UTF-8
> character.

Git does not usually directly read input and then convert it to other
encodings unless specifically asked to (e.g., `working-tree-encoding`),
so I fully expect that nothing will change there.  However, in many
cases, Git also currently does not honour LC_ALL, such as for commit
messages.

> I figured it was worth bringing up since Git may wany to think about it
> some before introducing more Rust. I think it can be worked around by
> using OsString [1], but I guess many people choose not to.

The people who have been working on Rust have been very careful to not
make assumptions that all data is UTF-8, and I don't expect that to
change.

OsString is slightly problematic because it is effectively UTF-8-ish (on
Windows, it's actually WTF-8 and on Unix it allows arbitrary bytes) but
there is no portable way to get any consistent byte encoding out of it.
(In versions of Rust too new for us to use, there is a function that
provides a byte encoding but it's not guaranteed to be stable across
versions.)  I have some custom code in one of my branches to handle the
conversion to and from OsString to a consistent byte encoding using some
traits to paper over the operating system differences.

In general, I expect we will continue to use some C-based interfaces
(possibly called via Rust wrappers) because Rust also does not expose
things like file descriptors on Windows or the full range of stat or
other information we need.

One assumption I do think is safe to make is that arbitrary Unicode can
be printed to the terminal, such as in error messages.  Considering that
virtually everybody sets IUTF8 in Unix terminals and we effectively do
that right now with localized text, I think that's okay.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

@gitgitgadget-git
Copy link
Copy Markdown

On the Git mailing list, Collin Funk wrote (reply to this):

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On 2026-01-07 at 01:35:11, Collin Funk wrote:
>> An unfortunate trend that I have seen with Rust programs is that they
>> completely disregard the systems locale. E.g. using
>> LC_ALL=en_US.ISO-8859-1 and passing an "À" character as an option will
>> typically fail since it is encoded as 0xC0 which is not a valid UTF-8
>> character.
>
> Git does not usually directly read input and then convert it to other
> encodings unless specifically asked to (e.g., `working-tree-encoding`),
> so I fully expect that nothing will change there.  However, in many
> cases, Git also currently does not honour LC_ALL, such as for commit
> messages.

That makes sense.

>> I figured it was worth bringing up since Git may wany to think about it
>> some before introducing more Rust. I think it can be worked around by
>> using OsString [1], but I guess many people choose not to.
>
> The people who have been working on Rust have been very careful to not
> make assumptions that all data is UTF-8, and I don't expect that to
> change.

Great, glad that it was considered. I guess you have to worry about
crates, but I think I recall wide agreement that Git was going to be
careful with what it decides to use.

> OsString is slightly problematic because it is effectively UTF-8-ish (on
> Windows, it's actually WTF-8 and on Unix it allows arbitrary bytes) but
> there is no portable way to get any consistent byte encoding out of it.
> (In versions of Rust too new for us to use, there is a function that
> provides a byte encoding but it's not guaranteed to be stable across
> versions.)  I have some custom code in one of my branches to handle the
> conversion to and from OsString to a consistent byte encoding using some
> traits to paper over the operating system differences.

Interesting, good to know. Thanks.

Unrelated to encoding, but two other things I noticed about Rust. Before
main() SIGPIPE is set to SIG_IGN which can be seen with the programs
below:

    $ cat main.rs 
    use std::io::{self, Write};
    fn main() -> io::Result<()> {
        io::stdout().write_all(b"hello world\n")?;
        Ok(())
    }
    $ cat main.c
    #include <unistd.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <errno.h>
    int
    main (void)
    {
      static const char message[] = "hello world\n";
      if (write (STDOUT_FILENO, message, sizeof message - 1) < 0)
        {
          fprintf (stderr, "%s\n", strerror (errno));
          return EXIT_FAILURE;
        }
      return EXIT_SUCCESS;
    }
    $ rustc main.rs
    $ gcc main.c
    $ ./main | :
    Error: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }
    $ echo ${PIPESTATUS[@]}
    1 0
    $ ./a.out | :
    $ echo ${PIPESTATUS[@]}
    141 0

Before executing a program using the standard library, SIGPIPE will be
set to SIG_DFL. That is better than not doing that, but both behaviors
mean that the typical behavior of inheriting signal actions from the
parent process is impossible without hacks or an unstable feature that
has been unfortunately stagnant for years [1].

Before main() all standard file descriptors are also opened. While
reasonable in many cases, is not the desired behavior for all programs.
Using the same example programs:

    $ ./main >&-
    $ echo $?
    0
    $ ./a.out >&-
    Bad file descriptor
    $ echo $?
    1

I'm not sure if either of those will affect 'git' at all, assuming it is
mostly library code that is called from C.

But it will likely have to be considered if someone wants to write a
program that goes in libexec that is executed by 'git'.

Collin

[1] https://dev-doc.rust-lang.org/beta/unstable-book/language-features/unix-sigpipe.html

webstech pushed a commit to gitgitgadget/git that referenced this pull request Mar 12, 2026
…orktree (git#2157)

Running `git rebase -x 'npm test'` from a worktree wreaked havoc on my
real repository: `core.bare = true` and test `url.*.insteadOf` settings
ended up in the shared `.git/config`, test commits landed on the real
HEAD, and a bogus `refs/notes/gitgitgadget` ref appeared out of nowhere.
For a terrifying moment I thought I had lost actual work.

The root cause turned out to be `git rebase --exec` setting `GIT_DIR` in
the environment. This leaked into the Node.js test processes, causing
`git init <target>` to silently reinitialize the *real* repository
instead of creating a fresh one in the target directory -- git simply
ignores the target argument when `GIT_DIR` is set.

Diagnosing this was a bit trickier than I hoped for. Adding
`GIT_CEILING_DIRECTORIES` alone did not help because `GIT_DIR` takes
precedence over repository discovery, making ceiling directories
irrelevant. A secondary leak hid in `misc-helper.test.ts` which
captured `process.env` at module load time, before `testCreateRepo` had
a chance to clear the offending variables.

This series clears `GIT_DIR`/`GIT_WORK_TREE` early, adds
`GIT_CEILING_DIRECTORIES` as defense-in-depth, introduces a
`validateWorkDir()` safety net that fails loudly if git would operate
outside the test directory, and fixes a pre-existing test that
accidentally depended on the enclosing repo's config. It also tells Jest
to ignore `.test-dir/` so stale files from corrupted runs don't cause
confusing failures on the next run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants