Skip to content

Fix: Limit the number of generated questions#13596

Merged
nerdai merged 2 commits intorun-llama:mainfrom
tuomastik:limit-generated-questions-count
May 20, 2024
Merged

Fix: Limit the number of generated questions#13596
nerdai merged 2 commits intorun-llama:mainfrom
tuomastik:limit-generated-questions-count

Conversation

@tuomastik
Copy link
Copy Markdown
Contributor

@tuomastik tuomastik commented May 20, 2024

Description

Limit the number of generated questions to avoid cases where more questions are generated than requested by the parameter num_questions_per_chunk.

Fixes #10694 and #10089

Added the fix to all the modules mentioning num_questions_per_chunk except the following modules that contain deprecated classes that possibly do not have this issue due to their num parameter in _agenerate_dataset method:

  • llama-index-core/llama_index/core/evaluation/dataset_generation.py
    • The module implements QueryResponseDataset that is deprecated in favor of LabelledRagDataset
  • llama-index-legacy/llama_index/legacy/evaluation/dataset_generation.py
    • The module implements DatasetGenerator that is deprecated in favor of RagDatasetGenerator

New Package?

  • Yes
  • No

Version Bump?

  • Yes
  • No

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms.ollama import Ollama
from llama_index.core.schema import TextNode

node = TextNode(
    text="New York, often called New York City or simply NYC, "
    "is the most populous city in the United States, located at the "
    "southern tip of New York State on one of the world's largest "
    "natural harbors. The city comprises five boroughs, each of which "
    "is coextensive with a respective county. New York is a global center "
    "of finance and commerce, culture and technology, entertainment and "
    "media, academics and scientific output, and the arts and fashion, and, "
    "as home to the headquarters of the United Nations, is an important "
    "center for international diplomacy. New York City is the center of "
    "the world's principal metropolitan economy."
)

llm = Ollama(model="gemma:2b", request_timeout=60.0)
num_questions_per_chunk = 1
for i in range(10):
    gen = RagDatasetGenerator(
        nodes=[node],
        llm=llm,
        num_questions_per_chunk=num_questions_per_chunk,
    )
    generated_questions = gen.generate_questions_from_nodes()
    print(
        f"Number of questions expected: {num_questions_per_chunk} - "
        f"Number of questions generated: {len(generated_questions.examples)}"
    )

Suggested Checklist:

  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label May 20, 2024
Copy link
Copy Markdown
Contributor

@nerdai nerdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tuomastik! Looks good. Though let's not update legacy llama-index at this point (I've marked in my review where we can revert your changes).

I guess this doesn't cover the case where less than the desired questions were generated, but this problem existed even before your PR

Comment thread llama-index-legacy/llama_index/legacy/finetuning/cross_encoders/dataset_gen.py Outdated
Comment thread llama-index-legacy/llama_index/legacy/finetuning/embeddings/common.py Outdated
Comment thread llama-index-legacy/llama_index/legacy/llama_dataset/generator.py Outdated
Comment thread llama-index-legacy/pyproject.toml Outdated
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 20, 2024
…e questions are generated than requested by parameter 'num_questions_per_chunk'
@tuomastik tuomastik force-pushed the limit-generated-questions-count branch from 50904de to a8e074f Compare May 20, 2024 13:08
@tuomastik
Copy link
Copy Markdown
Contributor Author

@nerdai

I guess this doesn't cover the case where less than the desired questions were generated, but this problem existed even before your PR

That's correct. One approach to solving that issue would be to regenerate questions until the desired number of questions has been generated.

@nerdai
Copy link
Copy Markdown
Contributor

nerdai commented May 20, 2024

@tuomastik that makes sense. Maybe for now, we should at least raise a warning to the user that the LLM call resulted in less than the desired amount of questions per chunk. Similar to here:

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels May 20, 2024
@tuomastik
Copy link
Copy Markdown
Contributor Author

Maybe for now, we should at least raise a warning to the user that the LLM call resulted in less than the desired amount of questions per chunk.

✅ Done

Copy link
Copy Markdown
Contributor

@nerdai nerdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm! Thanks @tuomastik!

@nerdai nerdai enabled auto-merge (squash) May 20, 2024 19:44
@nerdai nerdai merged commit baa3e82 into run-llama:main May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Unexpected Number of Questions Generated When Requesting FAQ Generation

4 participants