Use PCRE for our fallback regex engine when regcomp_l is unavailable#4935
Use PCRE for our fallback regex engine when regcomp_l is unavailable#4935
Conversation
589ed7b to
5cbf617
Compare
|
Aaaaaand we've reached that point in a C project where our circular |
232611c to
3883800
Compare
|
I broke some things out into their own header files. I'm not exactly thrilled with this in |
|
Please consider pcre2 instead of pcre. |
I did consider it, as I mentioned in this PR description and commit messages. What's the benefit to pcre2? |
|
From a cursory look (FWIW PHP RFC), it seems the supported Unicode version is 10 (vs. 7), and there might be some minor behavioral differences. |
pks-t
left a comment
There was a problem hiding this comment.
I think using an external library like pcre or pcre2 is a good solution to our problem. Does the syntax match the normal POSIX syntax, though? From documentation of PCRE2 1:
When PCRE2 is called via these functions, it is only the API that is POSIX-like in style. The syntax and semantics of the regular expressions themselves are still those of Perl, subject to the setting of various PCRE2 options, as described below. "POSIX-like in style" means that the API approximates to the POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding domains it is probably even less compatible.
So to me, it doesn't sound like they really are compatible. I don't know whether the same applies to libpcre(1), too.
Otherwise, I'm not a big fan of including the dependency directly. I'd rather say that we should try to use system provided dependencies, in the following order: regcomp_l, libpcre/libpcre2, regcomp, builtin regcomp. The last two obviously still have the notable shortcomings, but developers are actually able to work around them by calling setlocale at the start of their program. And given that we didn't have any reports of that problem for such a long time, I expect that this is really only a quite limited problem. If a developer hits it, then he may just make sure to also distribute the required system libraries.
| * a Linking Exception. For full terms see the included COPYING file. | ||
| */ | ||
| #ifndef INCLUDE_posix_regex_h__ | ||
| #define INCLUDE_posix_regex_h__ |
There was a problem hiding this comment.
Why do we use a separate header for this? Couldn't it simply be part of "posix.h"? I guess it'll get clearer in the following commits
| #define p_regexec regexec | ||
| #define p_regfree regfree | ||
|
|
||
| #ifdef GIT_USE_REGCOMP_L |
There was a problem hiding this comment.
This is broken, right? If you directly included "posix_regex.h", this define might never be set due to the header missing an include of "common.h"->"features.h"
| @@ -0,0 +1,137 @@ | |||
| INCLUDE(CheckIncludeFile) | |||
There was a problem hiding this comment.
Honestly, I'm not much of a fan of including another dependency into our code which we have to keep up-to-date. I'd rather make this a compile-time option, in the following order:
- libc's
regcomp_l - system libpcre
- system libpcre2
- libc's
regcomp, with the known shortcomings - built-in
regcomp, with the known shortcomings
I don't feel like it's reasonable to expect developers to have to change the locale settings in order for compiled-in parts of libgit2 to work, though. I thought that was the whole point of making these changes.
We have to include some regex dependency. Today it's code from glibc that always misbehaves in some situations. If we change that dependency to be PCRE instead, we can potentially fix that issue. Note that PCRE (v1, not PCRE2) is also git's fallback. So even if there are slight incompatibilities between GNU and PCRE, they're the same incompatibilities that git has, meaning paradoxically, that PCRE's incompatibilities keep us more compatible with git. |
Well, the point is giving developers at least some way of working around the issues, agreed. In some cases, they might just change locales. If that's not possible, they may decide to use this new route of using PCRE as a dependency instead by packaging it along with their application.
That we need to use some regex dependency is obvious. My point is only whether we want to bundle the dependency into our own codebase or make use of a system-provided PCRE dependency. I'm more inclined to use a system-provided one instead of bundling it with us, as otherwise we are responsible to notice and update whenever important bugfixes happen in bundled dependencies.
Ah, good to know. Indeed, git recently gained the ability to also use PCRE2 instead of PCRE1. |
Hmm, okay. So what you're saying is ship the current You gave this list of priorities:
It sounds like you're saying that you want to keep shipping the glibc regcomp. I guess I don't understand why shipping one dependency (the current glibc |
|
On Fri, Jan 25, 2019 at 09:19:05AM -0800, Edward Thomson wrote:
> libc's regcomp_l
> system libpcre
> system libpcre2
> libc's regcomp, with the known shortcomings
> built-in regcomp, with the known shortcomings
It sounds like you're saying that you want to keep shipping the
glibc regcomp. I guess I don't understand why shipping one
dependency (the current glibc `regcomp`) is superior to
shipping a different dependency (PCRE). We have the same
problem where we are responsible for noticing and updating when
important bugfixes happen, but there's no opportunity for us to
fix #4528 on systems where we use our built-in regcomp (eg,
Windows).
This is indeed a very good point. I mean we could say "Let's drop
regex and don't include PCRE" to make them kind of equal, but
this would make the setup on Windows a lot harder.
Still, my point kind of still applies: it sucks to have bundled
dependencies. But I'd be happy if we simply allowed users of
libgit2 to choose between system system `regcomp_l`, system
`regcomp`, system PCRE or builtin PCRE. So two things I'd like to
have:
- We should have a compile-time switch that allows users to
explicitly choose which backend to use. If no preference is
given, then we should pick: system `regcomp_l`, system PCRE,
builtin PCRE, in this order. Plain `regcomp` should not
automatically be chosen due to the reported problems, but it
might be useful to allow users to explicitly request it.
- The ability to choose between either system-provided libpcre or
builtin libpcre. On Linux-based distros, I expect that
distributors would rather like to link against libpcre than
using the bundled one.
|
This seems like a fine plan to me. |
3883800 to
a4c5336
Compare
|
You need any help with regards to my proposal? Happy to help, as it's essentially my request :) |
56639a2 to
c94cb30
Compare
|
OK, I updated this to take a If that is not specified during the cmake, then we will now prefer system |
| # include "win32/w32_common.h" | ||
| # include "win32/win32-compat.h" | ||
| # include "win32/error.h" | ||
| # include "win32/version.h" |
There was a problem hiding this comment.
It might make sense to have a "win32/win32.h" header that automatically handles all win32-specific includes. This is not in the scope of this PR, though
| #include "posix.h" | ||
| #include "userdiff.h" | ||
|
|
||
| #if LC_ALL > 0 |
There was a problem hiding this comment.
This is a bit weird. Does Windows define LC_ALL to 0 or doesn't it define it at all? If the latter's the case, then we should definitely use #ifndef instead. I'll also upload a PR that enables -Wundef in the short-term future to generate a warning for such uses
There was a problem hiding this comment.
Windows defines LC_ALL as 0.
6cef68c to
8885931
Compare
Users can now select which regex implementation they want to use: one of the system `regcomp_l`, the system PCRE, the builtin PCRE or the system's `regcomp`. By default the system `regcomp_l` will be used if it exists, otherwise the system PCRE will be used. If neither of those exist, then the builtin PCRE implementation will be used. The system's `regcomp` is not used by default due to problems with locales.
Attempt to locate a system-installed version of PCRE and use its POSIX compatibility layer, if possible.
Use PCRE2 and its POSIX compatibility layer if requested by the user. Although PCRE2 is adequate for our needs, the PCRE2 POSIX layer as installed on Debian and Ubuntu systems is broken, so we do not opt-in to it by default to avoid breaking users on those platforms.
|
OK, updated this and rebased it. |
| #define p_regexec regexec | ||
| #define p_regfree regfree | ||
|
|
||
| #ifdef GIT_USE_REGCOMP_L |
| OPTION(ENABLE_WERROR "Enable compilation with -Werror" OFF) | ||
| OPTION(USE_BUNDLED_ZLIB "Use the bundled version of zlib" OFF) | ||
| OPTION(DEPRECATE_HARD "Do not include deprecated functions in the library" OFF) | ||
| SET(REGEX "" CACHE STRING "Regular expression implementation. One of regcomp_l, pcre, regcomp, or builtin.") |
There was a problem hiding this comment.
This variable name may be a bit problematic, as it is a keyword in CMake. We might want to rename it. "REGEX_BACKEND" would come to my mind, as there's precedent for this already, even though I seem to recall that you don't like the name "backend"
There was a problem hiding this comment.
I'm fine with regex backend (my avoidance of "backend" comes from using it without qualification. That is I'm fine with having a regex backend but to talk about a "backend" as something concrete sounds odd to me.)
| */ | ||
|
|
||
| #ifdef GIT_REGEX_PCRE | ||
| #ifdef GIT_REGEX_BUILTIN |
There was a problem hiding this comment.
Huh? Where do we define p_regex_ stuff for GIT_REGEX_PCRE now?
Edit: never mind, the following commit clears that up
| IF(REGEX STREQUAL "regcomp_l") | ||
| ADD_FEATURE_INFO(regex ON "using system regcomp_l") | ||
| SET(GIT_REGEX_REGCOMP_L 1) | ||
| ELSEIF(REGEX STREQUAL "pcre2") |
There was a problem hiding this comment.
Any reason why we shouldn't auto-detect the PCRE2 library IF(REGEX STREQUAL "")?
There was a problem hiding this comment.
Hmm, I tried to mention this in a comment, but it looks like it got buried. (It is mentioned in the commit that introduces PCRE2.)
Basically, the POSIX compatibility layer for PCRE2 is completely broken on Ubuntu and Debian. If you look at PCRE, you'll see that it has symbols of pcre_regcomp, and then pcreposix.h sets up that indirection as #define regcomp pcre_regcomp. This avoids conflicting with the regex functions that are in libc.
The PCRE2 packages only do half of this, they have symbols for PCRE2regcomp but they forgot to set up the indirection in the header files. (They just declare regcomp.) So you end up using the regular libc regex bits. Worse, you use the pcre2 header files and their #defines for constants, so you end up passing the wrong constants to libc, and things go very wrong.
So we allow people to set PCRE2 manually, in case they have a working installation, but we don't autodetect it to avoid breaking Debian/Ubuntu users.
There was a problem hiding this comment.
Hm. Couldn't we just use CMake to detect whether the system PCRE2 is broken or not with CHECK_FUNCTION_EXISTS?
There was a problem hiding this comment.
Hmm, maybe yes, but since they haven't fixed it yet, I don't even have a test case that I can verify that we're getting it correct (yet).
My preference would be to hold off until a (shipping) version of Debian/Ubuntu has this fixed and we can test it...
There was a problem hiding this comment.
Fair enough. By holding off you probably just refer to the proper detection of non-broken PCRE2, right?
There was a problem hiding this comment.
Yeah, keep it as-is where people can explicitly opt-in to PCRE2 support but not enable PCRE2 by default (yet).
There was a problem hiding this comment.
For the record, pcre2 has added compatibility aliases (#define regexec pcre2_regexec) in 10.33 (in r1064).
|
I'll update |
|
I'm still a bit hesitant with regards to the PCRE2 bits, but this is really nothing that should hold back this PR as we do not loose any functionality at all. So: go for it! |
|
Cool. I want to get this merged so that we can stomp on the warnings in Windows (a lot of them come from the existing regex implementation). We can iterate on the PCRE2 support. |
This avoids any misunderstanding with the REGEX keyword in cmake.
|
Updated to use |
|
By the way, just came to my mind: should we update Azure for this? Like, having at least one build explicitly using bundled PCRE and |
|
(relatedly, macOS also has both the POSIX2008.smth |
Explicitly enable the `builtin` regex backend and the PCRE backend for some Linux builds.
|
OK, I've added PCRE to the trusty docker image, and set some explicit regex configuration during builds. (One Linux build explicitly sets PCRE, one explicitly sets builtin, and now the macOS built explicitly sets |
Perfect, thanks! No more remarks from my side now, I promise. |
In that case, 🚢 . Thanks for the review everyone! |
libgit2 has bundled pcre (not pcre2) in libgit2/libgit2#4935 Before this change configure printed "regex, using bundled PCRE", after this change it prints "regex, using system PCRE".
libgit2 has bundled pcre (not pcre2) in libgit2/libgit2#4935 Before this change configure printed "regex, using bundled PCRE", after this change it prints "regex, using system PCRE".
This removes the regex implementation that we took from glibc and replaces it with PCRE. PCRE contains a POSIX regex compatibility layer and ignores the currently running locale settings by default, which ensures that
LC_COLLATEwill not alter our functionality.Now we will always use
regcomp_lon systems where it is available, and when it is not, we will fallback to our builtin PCRE implementation. This means that we will never call the systemregcompwhich could be influenced by environment variables.This adds the older PCRE 8.42 library, instead of the newer PCRE2 library. The original PCRE has a smaller footprint, which is nice for us, and is still supported and receiving bugfixes. Minor modifications have been made to remove the CLI components included with PCRE and to prefix the names of the POSIX compatibility functions and structures with
pcre_so that they do not collide with the system's library (if it has one).I've also cherry-picked the tests from #4560.