DocQA Generator

Last Update: 05/02/2025

Overview

Easily upload or drag and drop multiple PDF and/or text files to automatically generate question-answer (QA) pairs. Choose from the following output formats based on your use case.

DocQA Generator Overview

DocSynth Single-Turn

Generates simple single-turn question-answer exchanges between a human and an assistant.

Format: json

{
  "conversations": [
    {"from": "human", "value": "Q"},
    {"from": "assistant", "value": "A"}
  ]
}

QA Format

Provides clean, minimalistic question-answer pairs suitable for lightweight applications.

Format: json

{
  "question": "Q",
  "answer": "A"
}

OpenAI Format

Generates data compatible with OpenAI's fine-tuning requirements, including an optional system prompt.

Format: json

{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "Q"},
    {"role": "assistant", "content": "A"}
  ]
}

File Configuration

Customize how DocSynth generates question-answer (QA) pairs by configuring the following parameters:

Chunk Size

Default: 1900 words

Splits the input file into segments, with each chunk containing the specified number of words. This controls the context window for question generation.

Questions per Chunk

Default: 3

Defines how many QA pairs should be generated from each chunk. Increasing this number can yield more granular coverage of the source material.

Generation Count

Default: 1

Specifies how many separate JSONL files containing QA pairs will be generated from a single input file. Useful for creating multiple versions or batches of QA data.

Temperature

Default: 0.7

Controls the randomness and creativity of the generated QA pairs.

  • Lower values (e.g., 0.2–0.4) yield more deterministic and focused outputs.
  • Higher values (e.g., 0.8–1.0) produce more diverse and creative questions.

System Prompt

Provides high-level instructions to guide the generation process.

Use this field to define the tone, style, or focus of the QA pairs.

For example:

"Generate clear, concise questions suitable for high school students reviewing biology topics."

(OpenAI Format) System Message

Specifically configures the system role in the OpenAI-compatible output format. This message serves as a foundational instruction for how the assistant should behave when generating QA pairs.

Example Output:

{
  "messages": [
    {
      "role": "system",
      "content": "Generate concise and factual QA pairs based on the provided document, focusing on key concepts."
    },
    {
      "role": "user",
      "content": "Q"
    },
    {
      "role": "assistant",
      "content": "A"
    }
  ]
}

You can reference the general System Prompt here for consistency or provide unique, format-specific instructions tailored to fine-tuning and API compatibility.