Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge JSON responses from gh api #5652

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from
Open

Conversation

heaths
Copy link
Contributor

@heaths heaths commented May 16, 2022

Fixes #1268

@heaths
Copy link
Contributor Author

heaths commented May 16, 2022

A WIP that still needs some refactoring to avoid duplicate code. Also depends on darccio/mergo#210 getting merged or, if we want to stick with that library, vendoring or forking it. This was more a proof of concept but was hoping for some early feedback on the concept. With the mergo change this would handle both top-level JSON arrays or objects, so both REST and GraphQL responses.

@heaths
Copy link
Contributor Author

heaths commented May 17, 2022

@mislav I need to add some specific tests, but wanted to see - if you have time to review - if you feel this is on the right track. I maintained writing responses as they come without pagination so that the user can see something happening. I do ponder adding a progress indicator if paginating, though, because it's pretty long. The main concern with that is that, even writing to stderr, it's likely callers have redirected at least stdout if not stderr using . Writing progress could, therefore, be a breaking change.

EDIT: Actually, considering we'd only show progress if stderr wasn't redirected, this would probably be fine. I'll make the change, but certainly can pull it if there's concern.

@heaths heaths force-pushed the issue1268 branch 2 times, most recently from e042de9 to a7ca779 Compare May 17, 2022 07:29
@heaths heaths marked this pull request as ready for review May 17, 2022 12:45
@heaths heaths requested a review from a team as a code owner May 17, 2022 12:45
@heaths heaths requested review from mislav and removed request for a team May 17, 2022 12:45
@cliAutomation cliAutomation added the external pull request originating outside of the CLI core team label May 17, 2022
@cliAutomation cliAutomation added this to Needs review 🤔 in The GitHub CLI May 17, 2022
@heaths
Copy link
Contributor Author

heaths commented May 19, 2022

@mislav how much do we care about keeping errors printed in the same order. This is what we do currently:

{
  "message":"Not Found",
  "documentation_url":"https://docs.github.com/rest"
}
gh: Not Found (HTTP 404)
exit status 1

I'm trying to better refactor this so the code is cleaner and easier to maintain (basically, fewer branches in the code and having to pass fewer parameters) but trying to keep this error format is making that more difficult unless I just inline all the code into the for loop.

But, if the error format could be the following, there's a lot cleaner refactorings possible:

gh: Not Found (HTTP 404)
{
  "message":"Not Found",
  "documentation_url":"https://docs.github.com/rest"
}
exit status 1

Frankly, I like this format better anyway. That an error occurred is shown immediately, followed by more details from the server.

go.mod Outdated Show resolved Hide resolved
@heaths
Copy link
Contributor Author

heaths commented May 24, 2022

@mislav I've been using this in a few instances and not exhibited any problems. What do you think?

@heaths
Copy link
Contributor Author

heaths commented May 25, 2022

darccio/mergo#210 was merged so we can actually pick up v0.3.13 and not need a redirect.

@heaths
Copy link
Contributor Author

heaths commented Jun 1, 2022

@mislav @samcoe what do you think of this? Makes the output of paginated responses much easier to use.

Copy link
Member

@samcoe samcoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heaths Thanks for your patients on this review. Overall, I like this functionality and it feels straight forward with the mergo package. Just left a couple comments for your consideration.

pkg/cmd/api/api.go Outdated Show resolved Hide resolved
pkg/cmd/api/api.go Outdated Show resolved Hide resolved
@heaths
Copy link
Contributor Author

heaths commented Jun 18, 2022

@mislav @samcoe this is ready to review. Even before the refactoring, the response body would be read completely into a buffer anyway - and in some cases, multiple times - so I ended up allocating a single bytes.Reader that implements io.ReadSeeker so I could seek to the beginning as needed without additional allocations, which was possible before.

pkg/cmd/api/api.go Outdated Show resolved Hide resolved
Copy link
Member

@samcoe samcoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heaths This looks good to me. Thanks for the thorough work here.

The GitHub CLI automation moved this from Needs review 🤔 to Needs to be merged 🎉 Jun 27, 2022
@eddiebergman
Copy link

Hi, thanks for the work! If the pr is approved, can we get it merged to fix some downstream issues at pwntester/octo.nvim#302 :)

@heaths
Copy link
Contributor Author

heaths commented Aug 4, 2022

@samcoe @mislav is this good to merge for next release? It would sure be helpful.

@vilmibm
Copy link
Contributor

vilmibm commented Aug 22, 2022

We're still interested in this but wanted consensus; waiting on @mislav 's review

@heaths
Copy link
Contributor Author

heaths commented Sep 6, 2022

/ping @mislav

@LangLangBart
Copy link

LangLangBart commented Sep 13, 2022

(was about to ask this, thanks for already fixing it 😄 )

The only thing that was irritating was the wait time for large requests.
Example: Pulling all the starredRepositories from a user.

# test with user: mislav ~600 stars
time bin/gh api graphql --silent --paginate --raw-field userName="mislav" --raw-field query='query($userName: String!, $endCursor: String) { user(login: $userName) { starredRepositories(orderBy: { field: STARRED_AT direction: DESC }, first: 100, after: $endCursor) { totalCount pageInfo { hasNextPage, endCursor } nodes { nameWithOwner stargazerCount pushedAt description }}}}'

user=0.06s system=0.02s cpu=2% total=3.231


# test with user: sindresorhus ~2.8k stars
time bin/gh api graphql --silent --paginate --raw-field userName="sindresorhus" --raw-field query='query($userName: String!, $endCursor: String) { user(login: $userName) { starredRepositories(orderBy: { field: STARRED_AT direction: DESC }, first: 100, after: $endCursor) { totalCount pageInfo { hasNextPage, endCursor } nodes { nameWithOwner stargazerCount pushedAt description }}}}'

user=0.17s system=0.04s cpu=0% total=23.907

An indicator/spinner, such as is present in the search command, would be beneficial to the user to display that their query is still in progress.

spinner

@heaths
Copy link
Contributor Author

heaths commented Sep 13, 2022

@LangLangBart that's a good idea. It wouldn't have worked well before because a paged result would have to have reset the progress indicator on stderr with each page, but now that it's all read into memory and written to stdout in one shot would make this fairly trivial.

@samcoe
Copy link
Member

samcoe commented Sep 14, 2022

@heaths If it makes it easier to merge in trunk definitely feel free to squash your commits.

@heaths
Copy link
Contributor Author

heaths commented Sep 14, 2022

That would make it easier, but any chance of getting this merged soon?

@heaths
Copy link
Contributor Author

heaths commented Oct 18, 2022

@samcoe do you know if this is still of interest? The original issue got quite a few votes, and this is making automation more difficult especially when jq isn't installed by default on any platform, not to mention harder to get for Windows.

@heaths
Copy link
Contributor Author

heaths commented Nov 8, 2022

I see another couple releases have gone out. @vilmibm @samcoe @mislav is anything else holding this up? There's been a lot of interest in merging gh api responses so external tools - not available by default on various platforms, making scripting harder which gh api basically targets - and this does it fairly cheaply. There's a bit of refactoring, but the tests are mostly unchanged to properly test regressions.

@heaths
Copy link
Contributor Author

heaths commented Dec 12, 2022

@mislav @samcoe I rebased on trunk and squashed all commits into a single commit if that makes it easier to review, and certainly to rebase. Given the votes on #1268, seems this would be a welcomed changed.

@heaths
Copy link
Contributor Author

heaths commented Jan 15, 2023

@vilmibm, @samcoe I see there are merge conflicts again. Is it worth me spending more time reolving them - even if easier now that I squashed the commits? Or can someone just tell me if this will never be merged and I'll close it despite many others running into the problem this fixes, especially with little recourse on Windows.

@heaths
Copy link
Contributor Author

heaths commented Mar 13, 2023

@mislav you made mention of this the other day for someone else that logged a dup bug. I'm glad you're still interested in it, but can you help me understand what you'd like me to do to get it over the finish line? I can resolve the merge conflicts, but I've been doing that for many months and it takes time, though less after I rebased and squashed. I'm happy to resolve the conflicts if you're going to take this or otherwise leave comments so I can get this in suitable shape.

For anyone else that comes across this PR and not issue #1268, I used basically the same code here in an extension that will merge after the fact* without having to install utilities like jq that might be harder to acquire for your system e.g., Windows. Run:

gh ext install heaths/gh-merge-json
gh api "{endpoint}" --paginate | gh merge-json # returns valid JSON

* albeit slower, since gh already stringified the response(s), which the extension has jsonify again, merge, and re-stringify.

/cc @samcoe

Copy link
Contributor

@mislav mislav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry for the late review. I tried multiple times to come up with suggestions on how to potentially make this PR better, but I didn't have good ideas, and I've put it off until later. A lot of time has passed since and here we are. I do really want to see this feature in, but even though I care about this a lot, I also want us to approach the implementation carefully and with as little disruption to the gh api command as possible.

My qualms with the implementation you've proposed here are:

  • Code size: quite a lot of existing code and non-pagination supporting code paths needed to be modified in order to support this feature.
  • No streaming support: we used to stream API responses directly from the network to stdout, but this implementation relies on buffering all the responses until all pages are fetched. That means that the client receiving the output of gh api cannot start processing JSON until everything has finished fetching.
  • Passing entire API responses through a 3rd-party "mergo" library: we used to return API responses unchanged, and now we're significantly massaging them through something which to me is a black box. I feel this is risky, since a bug in mergo can potentially corrupt API responses.
  • Deep-merging of GraphQL responses: GraphQL queries can also be paginated, but GraphQL queries can also contain multiple arrays in response data. The "mergo" library doesn't know which array is the one that is being paginated, so it will merge them all according to its deep merge strategy.

As part of this review, I wanted to explore an alternate implementation to see if it's even possible to address these points. I came up with #7190, where I'd value your thoughts when you find time. Specifically, the code spike tries to keep the implementation small, have the modifications only affect the --pagination code path, avoid a 3rd party library, and continue streaming API responses to stdout. This is done by wrapping resp.Body in a Reader that strategically omits the leading [ or trailing ] of JSON arrays. This seems to work on my testing, so please try it out! GraphQL pagination is intentionally non-addressed in this exploration since solving it would be non-trivial.

The GitHub CLI automation moved this from Needs to be merged 🎉 to Needs review 🤔 Mar 17, 2023
@heaths
Copy link
Contributor Author

heaths commented Mar 17, 2023

  1. I'll take a look at your PR soon. Conceptually, I like the approach and tried to come up with similar that would handle GraphQL, but see my thoughts below.
  2. I appreciate that this buffers responses and that's a behavioral change, but don't understand why a new third-party library is a concern when the vast majority of modules in go.mod also are. It seems to be a solid library with a lot of importers - over 3,200 - on pkg.go.dev. Again, though I appreciate the concern about buffering responses. As a mere suggestion, what about allowing people to opt into paging internally with the caveat responses will no longer be buffered? For the majority of use cases I imagine it would be of little consequence. Or maybe only do it for GraphQL paged responses.

I honestly can't think of any way to do this when an array is nested and you could have multiple arrays. With any nested arrays in a REST response, those would be properties of a resource (an array of resources or not) so only a top-level array would need "merging" and your approach (based on your description) would work fine there.

Alternatively, what about considering a built-in command to optionally buffer responses a la https://github.com/heaths/gh-merge-json? Ideally, output from gh api just works but at least not requiring a separate install may be nice. That said, #7173 simplifies that scenario so it may be moot.

@heaths
Copy link
Contributor Author

heaths commented Mar 17, 2023

Thinking about this more, given the GraphQL paginated code was invalid JSON anyway, does the fact it was streaming output really mattered? Either you're piping to something that is consuming JSON somehow (jq, PowerShell's ConvertFrom-Json, whatever) and would err on invalid JSON, or you're redirecting to a file in which case streaming doesn't really matter anyway.

For simple arrays that only REST could return, your approach in #7173 would seem to solve the problem but what about limiting this to just paginated GraphQL responses? Looking through your PR, I could use a similar approach that would require less refactoring and only impact a limited set of scenarios e.g., a reader (though, since instanced - perhaps passed in, which yours could be too) that under the right conditions does an internal read, unmarshals the JSON, merges, and when done, marshals back out to the original reader on last page.

@mislav
Copy link
Contributor

mislav commented Mar 28, 2023

2. but don't understand why a new third-party library is a concern when the vast majority of modules in go.mod also are. It seems to be a solid library with a lot of importers

I'm not insinuating that something is wrong with the library: it's just that gh api used to stream the actual, unmodified GitHub API results directly to stdout, but with this change, all --paginate results will have been deserialized into Go values, proxied through a 3rd-party library not maintained by us, and then serialized back into JSON. That, I feel, is a too radical of a departure from that gh api used to do. Imagine if curl did any of this before returning you the results! So if we could avoid that, I'd be super happy since it would mean that we are not compromising too much of the command's design.

I could use a similar approach that would require less refactoring and only impact a limited set of scenarios e.g., a reader (though, since instanced - perhaps passed in, which yours could be too) that under the right conditions does an internal read, unmarshals the JSON, merges, and when done, marshals back out to the original reader on last page.

That sounds good! I'd be down with that. Would you explore this approach when you have time?

@heaths
Copy link
Contributor Author

heaths commented Mar 29, 2023

That sounds good! I'd be down with that. Would you explore this approach when you have time?

@mislav sure I can explore it if there's an active interest in taking it (I'm certainly interested in solving it), but it might be easier to pick up from your changes in #7190 if you're planning to go ahead with that; and I can either close this PR and open anew, or rebase it entirely so the history is here, but I would then recommend treating it as a completely new PR i.e., don't review the diff.

@mislav mislav removed their assignment Jun 9, 2023
@andyfeller
Copy link
Contributor

andyfeller commented Nov 30, 2023

@heaths: what makes the most sense regarding this PR? Are you okay with closing it until such a time that either you or the CLI team can prioritize the work?

Given the age and triage efforts to ensure PRs don't grow stale, I appreciate you understanding the desire to ensure the CLI team is staying engaged while having a up to date view of work in flight.

If you want my $0.02, I think closing this and creating a new PR when that work can be prioritized while linking it back here for posterity makes the most sense. 🤷

By the way, nice to meet you and thank you for being a consistent CLI friend and contributor! 👋 ❤️ 🙇

@heaths
Copy link
Contributor Author

heaths commented Nov 30, 2023

What work do you need to prioritize? The PR has been done for ages. Seems the outstanding concern was using a third party module, which is sort of absurd given how many are already in use. This module has decent usage and it's straight forward. It could even be reimplimented, but that's not very idiomatic.

There's no rush of breaking changes here. The output is invalid JSON that would be valid. Any work around like slurping wouldn't break because there'd be nothing to slurp. I see no downside.

I could rebase this and make it conditional for GraphQL.

@andyfeller
Copy link
Contributor

Thanks for the follow up, @heaths 🙇

It seems this PR needs prioritization from both sides:

  1. core maintainers end to discuss these changes and how to move forward following Mislav's departure
  2. yours on addressing the conflicting files

There is a lot of history and mentions of the related issue in other places. I admit there is a lot to catch up on here.

@heaths
Copy link
Contributor Author

heaths commented Dec 4, 2023

Would a quick sync over Teams be acceptable (we're on the same network)? Most of the history is just me keeping this up to date (which gets harder with the PR age, hence trying to get buy-in from Mislav) and not getting it reviewed with requested changes for many months, but we could maybe go through this PR and recap quickly.

The quick gist is that Mislav wasn't keen on the mergo dependency I took after some investigation. I think concern about this being a breaking change - given it was already broken - was resolved, but there was still some concern about a behavioral change when logging headers/logs via --verbose since it'd only happen with the first or last (I forget which we settled on off hand) now. Personally, I think that likelihood is low. Given this doesn't emit valid JSON and everyone had to do work arounds (I even wrote an extension to make it easier on non-*nix) I'd be surprised if any were doing verbose logging in automation. Seems all that would likely have been quelled or, at the very least, difficult to associate with any particular request unless people used something like tee. I feel having this emit valid JSON so it's useable "out of the box" would be worth the potential for unlikely but possible breaking behavior changes with corner case behaviors (verbose logging).

EDIT: I sent you a message over Teams in case you wanted to chat briefly tomorrow morning.

@andyfeller
Copy link
Contributor

@heaths : sorry for being slow to respond, I'm going to raise this up with @samcoe and @williammartin, seeing if we can make time next week to get on the same page and get this settled 👍

@andyfeller andyfeller added the discuss Feature changes that require discussion primarily among the GitHub CLI team label Dec 8, 2023
@andyfeller
Copy link
Contributor

@heaths : Happy new year! Just making sure to deliver on pre-holidays conversation about keeping an eye on this PR so we can prioritize it and get it shipped. No pressure or anything, just wanting to deliver on the commitment on our end 🫂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature changes that require discussion primarily among the GitHub CLI team external pull request originating outside of the CLI core team
Projects
No open projects
The GitHub CLI
  
Needs review 🤔
Development

Successfully merging this pull request may close these issues.

Output a single JSON document from api --paginate
8 participants