Garbage In, Garbage Out: The Dataset
The Format: JSONL
You cannot just feed a model a PDF. You must format your data into Instruction Pairs. The standard format is often called \"Alpaca\" or \"ShareGPT\". It is a `.jsonl` file where every line is a separate training example.Example Structure
{\"instruction\": \"Classify this customer email.\", \n \"input\": \"I hate this product, return it now!\", \n \"output\": \"Sentiment: Negative. Action: Route to Retention Team.\"}How much data do I need?
- For Style/Tone: 500 to 1,000 high-quality examples are often enough.
- For New Knowledge: Thousands or tens of thousands (but remember, use RAG for this instead).
Pro Tip: Synthetic Data
Don't write 1,000 examples by hand. Use GPT-4 to generate your training data. Give GPT-4 a few examples of your desired style and ask it to generate a JSONL file with 50 more examples. Repeat until you have a dataset.
DijiPilot Academy Access Required
This comprehensive masterclass (8.9.9.2 - The Fine-Tuning Workflow: From Data to Model (Difficulty: Hero | Path: Lab)) is locked. Upgrade your plan to unlock the full technical roadmap.
Loading lesson roadmap for Phase 8.9.9.2...
Questions & Answers
Reviewing this step? Browse questions from other DijiPilot users below. If you are stuck, check the existing answers to bridge the gap between setup and success.