Fix GitDagBundle submodule clone not using SSH settings from connection#64879
Fix GitDagBundle submodule clone not using SSH settings from connection#64879shunsuke-sugita wants to merge 16 commits intoapache:mainfrom
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates GitDagBundle submodule operations to inherit SSH configuration from the Git connection by running git submodule sync/update under GitPython’s custom_environment (using the worktree Repo).
Changes:
- Wraps submodule sync/update calls with
GIT_SSH_COMMAND(when present inhook.env). - Introduces a no-op context (
nullcontext) when no SSH env override is needed.
| self._log.info("Initializing and updating submodules", repo_path=self.repo_path) | ||
| self.repo.git.submodule("sync", "--recursive") | ||
| self.repo.git.submodule("update", "--init", "--recursive", "--jobs", "1") | ||
| cm = nullcontext() |
There was a problem hiding this comment.
nullcontext must be in scope for this to run. If it isn’t already imported in this module, add from contextlib import nullcontext (or reference it as contextlib.nullcontext) to avoid a NameError at runtime.
There was a problem hiding this comment.
nullcontext is already imported at the top of this module: from contextlib import nullcontext (line 21)
| cm = nullcontext() | ||
| if self.hook and (cmd := self.hook.env.get("GIT_SSH_COMMAND")): | ||
| cm = self.repo.git.custom_environment(GIT_SSH_COMMAND=cmd) | ||
| with cm: |
There was a problem hiding this comment.
The name cm is very generic for a value that controls an environment override. Consider renaming it to something more descriptive (e.g., env_cm / ssh_env_cm) to make the intent clearer when reading and debugging this method.
There was a problem hiding this comment.
Variable name: cm is now ssh_env_cm in _fetch_bare_repo and _fetch_submodules to reflect that this context manager controls the SSH environment for git subprocesses. I left cm as-is for configure_hook_env() blocks since that’s broader hook setup.
| if self.hook and (cmd := self.hook.env.get("GIT_SSH_COMMAND")): | ||
| cm = self.repo.git.custom_environment(GIT_SSH_COMMAND=cmd) | ||
| with cm: | ||
| self._log.info("Initializing and updating submodules", repo_path=self.repo_path) | ||
| self.repo.git.submodule("sync", "--recursive") | ||
| self.repo.git.submodule("update", "--init", "--recursive", "--jobs", "1") |
There was a problem hiding this comment.
This change introduces new behavior (wrapping submodule sync/update with custom_environment when GIT_SSH_COMMAND is present). Add/extend a unit test to assert custom_environment(GIT_SSH_COMMAND=...) is invoked (and that submodule commands run within that context) when hook.env provides GIT_SSH_COMMAND.
There was a problem hiding this comment.
added tests in providers/git/tests/unit/git/bundles/test_git.py that assert custom_environment(GIT_SSH_COMMAND=...) when hook.env includes it, that submodule sync/update run inside that context (context manager enter/exit and call order on repo.git), and that custom_environment is not used when GIT_SSH_COMMAND is missing.
|
Hi @kaxil , sorry to bother you while you’re busy — just a gentle reminder on this PR. |
| mock_repo = mock.MagicMock() | ||
| ssh_ctx = mock.MagicMock() | ||
| mock_repo.git.custom_environment.return_value = ssh_ctx |
There was a problem hiding this comment.
These new tests create MagicMock() instances without a spec/autospec (for both the repo and the context manager). Using a spec helps catch typos and mismatched GitPython APIs (e.g. git.custom_environment, git.submodule) and prevents silently passing tests when production code changes. Consider switching these mocks to create_autospec(...) or MagicMock(spec=...) targeting the minimal interface you use here.
| mock_repo = mock.MagicMock() | ||
| bundle = GitDagBundle( |
There was a problem hiding this comment.
This test uses an unspec'd MagicMock() for bundle.repo. Adding a spec/autospec (even if only for the git attribute and its submodule method) will make the test better at detecting API misuse and reduce the chance of false positives.
| [mock.call("sync", "--recursive"), mock.call("update", "--init", "--recursive", "--jobs", "1")] | ||
| ) | ||
| assert mock_repo.git.mock_calls[0] == mock.call.custom_environment(GIT_SSH_COMMAND=expected_ssh_cmd) | ||
| assert mock_repo.git.mock_calls[1] == mock.call.submodule("sync", "--recursive") |
There was a problem hiding this comment.
This assertion will fail. With a with block, mock_repo.git.mock_calls is [call.custom_environment(...), call.custom_environment().__enter__(), call.submodule("sync", "--recursive"), call.submodule("update", ...), call.custom_environment().__exit__(None, None, None)], so index [1] is __enter__(), not the submodule call. I verified this empirically. Either drop these two index assertions (the assert_has_calls above already enforces submodule ordering) or use index [2].
There was a problem hiding this comment.
Implemented the suggestions from #64879 (comment)
See: 2dcfd1d
| mock_repo.git.submodule.assert_has_calls( | ||
| [mock.call("sync", "--recursive"), mock.call("update", "--init", "--recursive", "--jobs", "1")] | ||
| ) | ||
| assert mock_repo.git.mock_calls[0] == mock.call.custom_environment(GIT_SSH_COMMAND=expected_ssh_cmd) |
There was a problem hiding this comment.
This pair of mock_calls[0]/mock_calls[1] checks is redundant with assert_has_calls on the line above, which already enforces the order of submodule calls. Dropping both also fixes the broken [1] index. A cleaner check for "custom_environment is entered before submodule runs" is ssh_ctx.__enter__.assert_called_once() plus assert_has_calls, which you already have.
| self.repo.git.submodule("update", "--init", "--recursive", "--jobs", "1") | ||
| ssh_env_cm = nullcontext() | ||
| if self.hook and (cmd := self.hook.env.get("GIT_SSH_COMMAND")): | ||
| ssh_env_cm = self.repo.git.custom_environment(GIT_SSH_COMMAND=cmd) |
There was a problem hiding this comment.
Only GIT_SSH_COMMAND is forwarded via custom_environment, but configure_hook_env also populates hook.env with SSH_ASKPASS/DISPLAY/SSH_ASKPASS_REQUIRE for passphrase-protected keys. These work today because _fetch_submodules is always called inside an active configure_hook_env() in _initialize/refresh, which sets those on os.environ. If a future refactor drops that outer context, passphrase-protected keys will silently break while the non-passphrase path keeps working. Would you consider forwarding the full dict (custom_environment(**self.hook.env)) or at minimum a comment documenting that this depends on the outer configure_hook_env context?
|
@kaxil Thanks for the review. I’ve pushed updates addressing your feedback—please take another look when you have a moment. |
| def test_fetch_submodules_skips_custom_environment_without_git_ssh_command(self, mock_githook_class): | ||
| """When hook.env is empty, submodule update does not use custom_environment.""" | ||
| mock_hook = mock_githook_class.return_value | ||
| mock_hook.repo_url = "git@github.com:apache/airflow.git" | ||
| mock_hook.env = {} |
There was a problem hiding this comment.
The test name test_fetch_submodules_skips_custom_environment_without_git_ssh_command doesn’t match what the test actually verifies: it sets hook.env = {} and asserts custom_environment is not called. In production code, custom_environment is used whenever hook.env is a non-empty dict (even if it didn’t contain GIT_SSH_COMMAND). Rename this test to reflect the empty-env behavior, or change the fixture to use a non-empty env without GIT_SSH_COMMAND and assert the intended behavior.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| mock.call.submodule("sync", "--recursive"), | ||
| mock.call.submodule("update", "--init", "--recursive", "--jobs", "1"), | ||
| mock.call.exit(None, None, None), | ||
| ] |
There was a problem hiding this comment.
Add a blank line between these two test methods to match the surrounding formatting (one blank line between methods inside the test class). As-is, the next @mock.patch decorator starts immediately after the previous assertion block.
| ] | |
| ] |
| bundle._fetch_submodules() | ||
|
|
||
| mock_git.custom_environment.assert_called_once_with(GIT_SSH_COMMAND=expected_ssh_cmd) | ||
|
|
||
| parent = mock.Mock() | ||
| parent.attach_mock(mock_git.custom_environment, "custom_environment") | ||
| parent.attach_mock(ssh_ctx.__enter__, "enter") | ||
| parent.attach_mock(mock_git.submodule, "submodule") | ||
| parent.attach_mock(ssh_ctx.__exit__, "exit") | ||
|
|
||
| assert parent.mock_calls == [ | ||
| mock.call.custom_environment(GIT_SSH_COMMAND=expected_ssh_cmd), | ||
| mock.call.enter(), | ||
| mock.call.submodule("sync", "--recursive"), | ||
| mock.call.submodule("update", "--init", "--recursive", "--jobs", "1"), | ||
| mock.call.exit(None, None, None), | ||
| ] |
There was a problem hiding this comment.
parent.attach_mock(...) is done after bundle._fetch_submodules() has already run, so parent.mock_calls will not include the earlier calls (calls are only propagated to the parent at call time, when the child is already attached). This makes the call-order assertion unreliable/likely to fail. Attach the mocks to parent before invoking _fetch_submodules() (or assert ordering using the individual mocks’ mock_calls/method_calls instead).
What
GitDagBundlenow wrapsgit submodule sync/submodule updatewithGIT_SSH_COMMANDfrom the Git connection (via GitPythoncustom_environment), using the worktreeRepo(self.repo.git), not the bare mirror.Why
configure_hook_env()populateshook.env, and bare clone passes it withclone_from(..., env=...). Submodule commands are separate Git subprocesses and did not receiveGIT_SSH_COMMAND, so SSH URLs in.gitmodulescould fail (host key / identity) even when the main repo clone worked.Example failure (before this fix)
Submodule update could fail with errors such as:
Notes
_fetch_bare_repofororigin.fetchon the bare repo.submodules=True; no behavior change to other bundle modes.Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.