Skip to content

Fix GitDagBundle submodule clone not using SSH settings from connection#64879

Open
shunsuke-sugita wants to merge 16 commits intoapache:mainfrom
shunsuke-sugita:provider-git-fix-fetch-submodul
Open

Fix GitDagBundle submodule clone not using SSH settings from connection#64879
shunsuke-sugita wants to merge 16 commits intoapache:mainfrom
shunsuke-sugita:provider-git-fix-fetch-submodul

Conversation

@shunsuke-sugita
Copy link
Copy Markdown

What

GitDagBundle now wraps git submodule sync / submodule update with GIT_SSH_COMMAND from the Git connection (via GitPython custom_environment), using the worktree Repo (self.repo.git), not the bare mirror.

Why

configure_hook_env() populates hook.env, and bare clone passes it with clone_from(..., env=...). Submodule commands are separate Git subprocesses and did not receive GIT_SSH_COMMAND, so SSH URLs in .gitmodules could fail (host key / identity) even when the main repo clone worked.

Example failure (before this fix)

Submodule update could fail with errors such as:

2026-04-08T03:39:43.516370889Z Cloning into '{submodul_path}'...
2026-04-08T03:39:43.516372014Z Host key verification failed.
2026-04-08T03:39:43.516373389Z fatal: Could not read from remote repository.
2026-04-08T03:39:43.516374555Z
2026-04-08T03:39:43.516375430Z Please make sure you have the correct access rights
2026-04-08T03:39:43.516376389Z and the repository exists.
2026-04-08T03:39:43.516377430Z fatal: clone of '{submodule}' into submodule path 
'/tmp/airflow/dag_bundles/{submodule_path}s' failed
2026-04-08T03:39:43.516378680Z Failed to clone '{submodule}' a second time, aborting'
2026-04-08T03:39:43.518691472Z RuntimeError: Error pulling submodule from repository

Notes

  • Aligns with the existing pattern in _fetch_bare_repo for origin.fetch on the bare repo.
  • Unit tests still assert that submodule sync/update are invoked when submodules=True; no behavior change to other bundle modes.

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented Apr 8, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@shunsuke-sugita shunsuke-sugita marked this pull request as draft April 9, 2026 06:23
@shunsuke-sugita shunsuke-sugita marked this pull request as ready for review April 9, 2026 06:24
@kaxil kaxil requested a review from Copilot April 10, 2026 19:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates GitDagBundle submodule operations to inherit SSH configuration from the Git connection by running git submodule sync/update under GitPython’s custom_environment (using the worktree Repo).

Changes:

  • Wraps submodule sync/update calls with GIT_SSH_COMMAND (when present in hook.env).
  • Introduces a no-op context (nullcontext) when no SSH env override is needed.

self._log.info("Initializing and updating submodules", repo_path=self.repo_path)
self.repo.git.submodule("sync", "--recursive")
self.repo.git.submodule("update", "--init", "--recursive", "--jobs", "1")
cm = nullcontext()
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullcontext must be in scope for this to run. If it isn’t already imported in this module, add from contextlib import nullcontext (or reference it as contextlib.nullcontext) to avoid a NameError at runtime.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullcontext is already imported at the top of this module: from contextlib import nullcontext (line 21)

Comment on lines +303 to +306
cm = nullcontext()
if self.hook and (cmd := self.hook.env.get("GIT_SSH_COMMAND")):
cm = self.repo.git.custom_environment(GIT_SSH_COMMAND=cmd)
with cm:
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name cm is very generic for a value that controls an environment override. Consider renaming it to something more descriptive (e.g., env_cm / ssh_env_cm) to make the intent clearer when reading and debugging this method.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

@shunsuke-sugita shunsuke-sugita Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable name: cm is now ssh_env_cm in _fetch_bare_repo and _fetch_submodules to reflect that this context manager controls the SSH environment for git subprocesses. I left cm as-is for configure_hook_env() blocks since that’s broader hook setup.

fd68b54

Comment on lines +304 to +309
if self.hook and (cmd := self.hook.env.get("GIT_SSH_COMMAND")):
cm = self.repo.git.custom_environment(GIT_SSH_COMMAND=cmd)
with cm:
self._log.info("Initializing and updating submodules", repo_path=self.repo_path)
self.repo.git.submodule("sync", "--recursive")
self.repo.git.submodule("update", "--init", "--recursive", "--jobs", "1")
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces new behavior (wrapping submodule sync/update with custom_environment when GIT_SSH_COMMAND is present). Add/extend a unit test to assert custom_environment(GIT_SSH_COMMAND=...) is invoked (and that submodule commands run within that context) when hook.env provides GIT_SSH_COMMAND.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests in providers/git/tests/unit/git/bundles/test_git.py that assert custom_environment(GIT_SSH_COMMAND=...) when hook.env includes it, that submodule sync/update run inside that context (context manager enter/exit and call order on repo.git), and that custom_environment is not used when GIT_SSH_COMMAND is missing.

7e67b4c

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@shunsuke-sugita
Copy link
Copy Markdown
Author

Hi @kaxil , sorry to bother you while you’re busy — just a gentle reminder on this PR.
I’ve addressed the previous feedback and Copilot checks are passing now.
Could you take another look when you have time? Thanks!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment on lines +935 to +937
mock_repo = mock.MagicMock()
ssh_ctx = mock.MagicMock()
mock_repo.git.custom_environment.return_value = ssh_ctx
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new tests create MagicMock() instances without a spec/autospec (for both the repo and the context manager). Using a spec helps catch typos and mismatched GitPython APIs (e.g. git.custom_environment, git.submodule) and prevents silently passing tests when production code changes. Consider switching these mocks to create_autospec(...) or MagicMock(spec=...) targeting the minimal interface you use here.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed on my side. See: 6346cb2

Comment on lines +966 to +967
mock_repo = mock.MagicMock()
bundle = GitDagBundle(
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses an unspec'd MagicMock() for bundle.repo. Adding a spec/autospec (even if only for the git attribute and its submodule method) will make the test better at detecting API misuse and reduce the chance of false positives.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed on my side. See: 6346cb2

[mock.call("sync", "--recursive"), mock.call("update", "--init", "--recursive", "--jobs", "1")]
)
assert mock_repo.git.mock_calls[0] == mock.call.custom_environment(GIT_SSH_COMMAND=expected_ssh_cmd)
assert mock_repo.git.mock_calls[1] == mock.call.submodule("sync", "--recursive")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion will fail. With a with block, mock_repo.git.mock_calls is [call.custom_environment(...), call.custom_environment().__enter__(), call.submodule("sync", "--recursive"), call.submodule("update", ...), call.custom_environment().__exit__(None, None, None)], so index [1] is __enter__(), not the submodule call. I verified this empirically. Either drop these two index assertions (the assert_has_calls above already enforces submodule ordering) or use index [2].

Copy link
Copy Markdown
Author

@shunsuke-sugita shunsuke-sugita Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the suggestions from #64879 (comment)

See: 2dcfd1d

mock_repo.git.submodule.assert_has_calls(
[mock.call("sync", "--recursive"), mock.call("update", "--init", "--recursive", "--jobs", "1")]
)
assert mock_repo.git.mock_calls[0] == mock.call.custom_environment(GIT_SSH_COMMAND=expected_ssh_cmd)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pair of mock_calls[0]/mock_calls[1] checks is redundant with assert_has_calls on the line above, which already enforces the order of submodule calls. Dropping both also fixes the broken [1] index. A cleaner check for "custom_environment is entered before submodule runs" is ssh_ctx.__enter__.assert_called_once() plus assert_has_calls, which you already have.

Copy link
Copy Markdown
Author

@shunsuke-sugita shunsuke-sugita Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed on my side. See: 2dcfd1d

self.repo.git.submodule("update", "--init", "--recursive", "--jobs", "1")
ssh_env_cm = nullcontext()
if self.hook and (cmd := self.hook.env.get("GIT_SSH_COMMAND")):
ssh_env_cm = self.repo.git.custom_environment(GIT_SSH_COMMAND=cmd)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only GIT_SSH_COMMAND is forwarded via custom_environment, but configure_hook_env also populates hook.env with SSH_ASKPASS/DISPLAY/SSH_ASKPASS_REQUIRE for passphrase-protected keys. These work today because _fetch_submodules is always called inside an active configure_hook_env() in _initialize/refresh, which sets those on os.environ. If a future refactor drops that outer context, passphrase-protected keys will silently break while the non-passphrase path keeps working. Would you consider forwarding the full dict (custom_environment(**self.hook.env)) or at minimum a comment documenting that this depends on the outer configure_hook_env context?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed on my side. See: fe31907

@shunsuke-sugita
Copy link
Copy Markdown
Author

@kaxil Thanks for the review. I’ve pushed updates addressing your feedback—please take another look when you have a moment.

@shunsuke-sugita shunsuke-sugita requested a review from kaxil April 20, 2026 01:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment on lines +1006 to +1010
def test_fetch_submodules_skips_custom_environment_without_git_ssh_command(self, mock_githook_class):
"""When hook.env is empty, submodule update does not use custom_environment."""
mock_hook = mock_githook_class.return_value
mock_hook.repo_url = "git@github.com:apache/airflow.git"
mock_hook.env = {}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name test_fetch_submodules_skips_custom_environment_without_git_ssh_command doesn’t match what the test actually verifies: it sets hook.env = {} and asserts custom_environment is not called. In production code, custom_environment is used whenever hook.env is a non-empty dict (even if it didn’t contain GIT_SSH_COMMAND). Rename this test to reflect the empty-env behavior, or change the fixture to use a non-empty env without GIT_SSH_COMMAND and assert the intended behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed on my side. See: 6d27243

Comment thread providers/git/tests/unit/git/bundles/test_git.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

mock.call.submodule("sync", "--recursive"),
mock.call.submodule("update", "--init", "--recursive", "--jobs", "1"),
mock.call.exit(None, None, None),
]
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a blank line between these two test methods to match the surrounding formatting (one blank line between methods inside the test class). As-is, the next @mock.patch decorator starts immediately after the previous assertion block.

Suggested change
]
]

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed on my side. See: 76fd8ca

Comment on lines +962 to +978
bundle._fetch_submodules()

mock_git.custom_environment.assert_called_once_with(GIT_SSH_COMMAND=expected_ssh_cmd)

parent = mock.Mock()
parent.attach_mock(mock_git.custom_environment, "custom_environment")
parent.attach_mock(ssh_ctx.__enter__, "enter")
parent.attach_mock(mock_git.submodule, "submodule")
parent.attach_mock(ssh_ctx.__exit__, "exit")

assert parent.mock_calls == [
mock.call.custom_environment(GIT_SSH_COMMAND=expected_ssh_cmd),
mock.call.enter(),
mock.call.submodule("sync", "--recursive"),
mock.call.submodule("update", "--init", "--recursive", "--jobs", "1"),
mock.call.exit(None, None, None),
]
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parent.attach_mock(...) is done after bundle._fetch_submodules() has already run, so parent.mock_calls will not include the earlier calls (calls are only propagated to the parent at call time, when the child is already attached). This makes the call-order assertion unreliable/likely to fail. Attach the mocks to parent before invoking _fetch_submodules() (or assert ordering using the individual mocks’ mock_calls/method_calls instead).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed on my side. See: 51f14ca

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants