Fix: Limit the number of generated questions by tuomastik · Pull Request #13596 · run-llama/llama_index

tuomastik · 2024-05-20T09:50:23Z

Description

Limit the number of generated questions to avoid cases where more questions are generated than requested by the parameter num_questions_per_chunk.

Fixes #10694 and #10089

Added the fix to all the modules mentioning num_questions_per_chunk except the following modules that contain deprecated classes that possibly do not have this issue due to their num parameter in _agenerate_dataset method:

llama-index-core/llama_index/core/evaluation/dataset_generation.py
- The module implements QueryResponseDataset that is deprecated in favor of LabelledRagDataset
llama-index-legacy/llama_index/legacy/evaluation/dataset_generation.py
- The module implements DatasetGenerator that is deprecated in favor of RagDatasetGenerator

New Package?

Yes
No

Version Bump?

Yes
No

Type of Change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms.ollama import Ollama
from llama_index.core.schema import TextNode

node = TextNode(
    text="New York, often called New York City or simply NYC, "
    "is the most populous city in the United States, located at the "
    "southern tip of New York State on one of the world's largest "
    "natural harbors. The city comprises five boroughs, each of which "
    "is coextensive with a respective county. New York is a global center "
    "of finance and commerce, culture and technology, entertainment and "
    "media, academics and scientific output, and the arts and fashion, and, "
    "as home to the headquarters of the United Nations, is an important "
    "center for international diplomacy. New York City is the center of "
    "the world's principal metropolitan economy."
)

llm = Ollama(model="gemma:2b", request_timeout=60.0)
num_questions_per_chunk = 1
for i in range(10):
    gen = RagDatasetGenerator(
        nodes=[node],
        llm=llm,
        num_questions_per_chunk=num_questions_per_chunk,
    )
    generated_questions = gen.generate_questions_from_nodes()
    print(
        f"Number of questions expected: {num_questions_per_chunk} - "
        f"Number of questions generated: {len(generated_questions.examples)}"
    )

Suggested Checklist:

I have performed a self-review of my own code
My changes generate no new warnings
I ran make format; make lint to appease the lint gods

nerdai

Thanks @tuomastik! Looks good. Though let's not update legacy llama-index at this point (I've marked in my review where we can revert your changes).

I guess this doesn't cover the case where less than the desired questions were generated, but this problem existed even before your PR

…e questions are generated than requested by parameter 'num_questions_per_chunk'

tuomastik · 2024-05-20T13:13:07Z

@nerdai

I guess this doesn't cover the case where less than the desired questions were generated, but this problem existed even before your PR

That's correct. One approach to solving that issue would be to regenerate questions until the desired number of questions has been generated.

nerdai · 2024-05-20T13:19:31Z

@tuomastik that makes sense. Maybe for now, we should at least raise a warning to the user that the LLM call resulted in less than the desired amount of questions per chunk. Similar to here:

llama_index/llama-index-core/llama_index/core/ingestion/pipeline.py

Line 713 in 8c23e01

warnings.warn(

tuomastik · 2024-05-20T19:42:47Z

Maybe for now, we should at least raise a warning to the user that the LLM call resulted in less than the desired amount of questions per chunk.

✅ Done

nerdai

Lgtm! Thanks @tuomastik!

dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label May 20, 2024

nerdai approved these changes May 20, 2024

View reviewed changes

dosubot Bot added the lgtm This PR has been approved by a maintainer label May 20, 2024

fix: Limit the number of generated questions to avoid cases where mor…

a8e074f

…e questions are generated than requested by parameter 'num_questions_per_chunk'

tuomastik force-pushed the limit-generated-questions-count branch from 50904de to a8e074f Compare May 20, 2024 13:08

Docs: Add warning for insufficient question generation

280bf47

dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels May 20, 2024

nerdai approved these changes May 20, 2024

View reviewed changes

nerdai enabled auto-merge (squash) May 20, 2024 19:44

logan-markewich approved these changes May 20, 2024

View reviewed changes

nerdai merged commit baa3e82 into run-llama:main May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Limit the number of generated questions#13596

Fix: Limit the number of generated questions#13596
nerdai merged 2 commits intorun-llama:mainfrom
tuomastik:limit-generated-questions-count

tuomastik commented May 20, 2024 •

edited

Loading

Uh oh!

nerdai left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tuomastik commented May 20, 2024

Uh oh!

nerdai commented May 20, 2024

Uh oh!

tuomastik commented May 20, 2024

Uh oh!

nerdai left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tuomastik commented May 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Uh oh!

nerdai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tuomastik commented May 20, 2024

Uh oh!

nerdai commented May 20, 2024

Uh oh!

tuomastik commented May 20, 2024

Uh oh!

nerdai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tuomastik commented May 20, 2024 •

edited

Loading

nerdai left a comment •

edited

Loading