%load_ext autoreload
%autoreload 2

API#

download a wrapper for the API, in this case in python: pip install ollama

we then define our model as a variable and verify its content:

import ollama
from src.models import Models, get_models, load_model

available = get_models()
print(available)
['llama_3_1_8b', 'llama_3_2_1b', 'llama_3_2_3b', 'mistral_sm_24b']
model = available[2]
print(f"Selected model: {Models.model_fields.get(model).default}")
model = load_model(model)
Selected model: hf.co/unsloth/Llama-3.2-3B-Instruct-GGUF:Q6_K
info = ollama.show(model).modelinfo
for k, v in info.items():
    print(f"{k}: {v}")
general.architecture: llama
general.basename: Llama-3.2
general.file_type: 18
general.finetune: Instruct
general.organization: Meta Llama
general.parameter_count: 3212749888
general.quantization_version: 2
general.size_label: 3B
general.type: model
llama.attention.head_count: 24
llama.attention.head_count_kv: 8
llama.attention.key_length: 128
llama.attention.layer_norm_rms_epsilon: 1e-05
llama.attention.value_length: 128
llama.block_count: 28
llama.context_length: 131072
llama.embedding_length: 3072
llama.feed_forward_length: 8192
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
tokenizer.ggml.bos_token_id: 128000
tokenizer.ggml.eos_token_id: 128009
tokenizer.ggml.merges: None
tokenizer.ggml.model: gpt2
tokenizer.ggml.padding_token_id: 128004
tokenizer.ggml.pre: llama-bpe
tokenizer.ggml.token_type: None
tokenizer.ggml.tokens: None

Generating responses#

from ollama import chat

prompt = "Give me a simple recipe for a delicious citrusy cake. Make sure units are in grams when it makes sense. Temperatures should be in C."

response = chat(
    model=model,
    messages=[
        {"role": "user", "content": prompt},
    ],
)
print(response.message.content)
Here's a simple recipe for a delicious citrusy cake that yields a moist and flavorful result:

**Citrus Syrup Cake with Lemon-Poppyseed Frosting**

**Cake:**

Ingredients:

* 250g all-purpose flour
* 200g granulated sugar
* 100g unsalted butter, softened
* 4 large eggs, at room temperature
* 120ml freshly squeezed orange juice
* 60ml freshly squeezed lemon juice
* 1 teaspoon grated lemon zest
* 1/2 teaspoon baking powder
* Salt to taste

Instructions:

1. Preheat the oven to 180°C (350°F) and grease two 8-inch round cake pans.
2. In a medium bowl, whisk together flour, sugar, baking powder, and salt.
3. In a large mixing bowl, whisk together softened butter, eggs, orange juice, lemon juice, and lemon zest.
4. Gradually add the dry ingredients to the wet ingredients, whisking until smooth.
5. Divide the batter evenly between the prepared pans and smooth the tops.
6. Bake for 25-30 minutes or until a toothpick inserted into the center comes out clean.

**Lemon-Poppyseed Frosting:**

Ingredients:

* 150g unsalted butter, softened
* 250g confectioners' sugar
* 2 tablespoons freshly squeezed lemon juice
* 1 teaspoon grated lemon zest
* 1 tablespoon poppy seeds

Instructions:

1. Beat the softened butter in a large mixing bowl until light and fluffy.
2. Gradually add the confectioners' sugar, beating until smooth.
3. Add the lemon juice and lemon zest, beating until combined.
4. Stir in the poppy seeds.

**Assembly:**

1. Once the cakes are completely cool, place one layer on a serving plate or cake stand.
2. Spread a thick layer of frosting on top of the first layer.
3. Place the second layer on top and frost the entire cake with the remaining frosting.

Enjoy your delicious citrusy cake!

Note: If you want to make this recipe more vibrant, you can add a few drops of yellow food coloring to the batter and frosting.

Controlling the responses…#

Can only do so much with raw text… Let’s up the controllability!

First off, there are a few common parameters that can be used to tune the outputs. Some are related to “creativity”, whereas some control the predictability and determinism of the outputs.

"num_ctx": "Maximum number of tokens the model can process in a single input."
"seed": "Random seed for deterministic generation."
"num_predict": "Maximum number of tokens to generate in output."
"top_k": "Limits sampling to the top K most probable tokens."
"top_p": "Limits sampling to the smallest set of tokens with cumulative probability >= top_p."
"temperature": "Controls randomness in generation; higher values increase randomness."
"repeat_penalty": "Penalty for repeated tokens to reduce repetition in output."

We can also add a system prompt (see the role in the messages below) to guide the model in the right direction. Being explicit on its role can help both the behavior and output formats.

from ollama import ChatResponse, chat
from pydantic.types import JsonSchemaValue
from typing import Optional, List

system_prompt = "You are a helpful assistant that provides clear and concise answers to the user's needs. Always answer in a JSON format."

def generate(
    prompt: str,
    json_format: Optional[JsonSchemaValue] = None,
    model=model,
    system_prompt=system_prompt,
) -> str:
    response: ChatResponse = chat(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
        options={
            "num_ctx": 4096,
            "num_predict": 1024,
            "top_k": 50,
            "top_p": 0.95,
            "temperature": 0.0,
            "seed": 0,  # this is not needed when temp is 0
            "repeat_penalty": 1.0,  # remain default for json outputs, from experience.
        },
        format=json_format,
        stream=False,
    )
    return response.message.content

print(generate(prompt))
```
{
  "recipe": {
    "name": "Citrusy Cake",
    "servings": 8-10,
    "ingredients": [
      {
        "name": "all-purpose flour",
        "quantity": "250g"
      },
      {
        "name": "granulated sugar",
        "quantity": "200g"
      },
      {
        "name": "unsalted butter, softened",
        "quantity": "150g"
      },
      {
        "name": "large eggs",
        "quantity": "3"
      },
      {
        "name": "freshly squeezed orange juice",
        "quantity": "120ml"
      },
      {
        "name": "zest of 1 orange",
        "quantity": "30g"
      },
      {
        "name": "zest of 1 lemon",
        "quantity": "30g"
      },
      {
        "name": "baking powder",
        "quantity": "10g"
      },
      {
        "name": "salt",
        "quantity": "5g"
      }
    ],
    "instructions": [
      {
        "step": "Preheat oven to 180°C. Grease two 20cm round cake pans and line the bottoms with parchment paper."
      },
      {
        "step": "In a medium bowl, whisk together flour, sugar, baking powder, and salt."
      },
      {
        "step": "In a large bowl, using an electric mixer, beat the butter until creamy. Add eggs one at a time, beating well after each addition."
      },
      {
        "step": "With the mixer on low speed, gradually add the flour mixture to the butter mixture in three parts, alternating with the orange juice, beginning and ending with the flour mixture. Beat just until combined."
      },
      {
        "step": "Divide the batter evenly between the prepared pans and smooth the tops."
      },
      {
        "step": "Bake for 25-30 minutes or until a toothpick inserted in the center comes out clean."
      },
      {
        "step": "Let the cakes cool in the pans for 5 minutes, then transfer to a wire rack to cool completely."
      }
    ]
  }
}
```

Even more control#

Now, the format is already following a JSON from the system prompt, but we cannot know beforehand what fields are inside it. Let’s fix this by introducing a schema, a structured definition of our output.

We start building our schema through a typed BaseModel in pydantic (which will be converted to a grammar-like format called GBNF, that you can read about here: ggerganov/llama.cpp)

If you were not to use ollama, you could pass a schema directly, which again will be converted to GBNF.

Here is an example of a schema that forces the output to contain three fields: “questions”, “score”, “summary” - three fields very useful for extracting information around a larger document. Note how you can specify the types, and even constrain the “score” to specific values through the enum keyword, along with min/maxItems for arrays.

schema = {
    "type": "object",
    "properties": {
        "questions": {
            "type": "array",
            "minItems": 1,
            "maxItems": 3,
            "items": {
                "type": "object",
                "properties": {"question": {"type": "string"}},
                "required": ["question"],
            },
        },
        "score": {"type": "integer", "enum": [0, 1, 2, 3]},
        "summary": {"type": "string"},
    },
    "required": ["questions", "score", "summary"],
}

However, the easiest and most programmatic way of handling this is define interfaces that are automatically parsed as a schema before being sent through to the llama.cpp api. We continue with the recipe data format!

from pydantic import BaseModel

class Ingredient(BaseModel):
    name: str
    quantity: float
    unit: str

class RecipeInstruction(BaseModel):
    step: int
    description: str

class Recipe(BaseModel):
    title: str
    ingredients: List[Ingredient]
    instructions: List[RecipeInstruction]
    tools: List[str]

# we can now use eval to properly format the json as an object
# using `eval`from the output of an API is generally not safe, but we can safely do it from the JSON-output of a local model.
eval(generate(prompt, json_format=Recipe.model_json_schema()))
{'title': 'Citrusy Cake Recipe',
 'ingredients': [{'name': 'all-purpose flour', 'quantity': 250, 'unit': 'g'},
  {'name': 'granulated sugar', 'quantity': 200, 'unit': 'g'},
  {'name': 'unsalted butter, softened', 'quantity': 150, 'unit': 'g'},
  {'name': 'large eggs', 'quantity': 3, 'unit': ''},
  {'name': 'freshly squeezed orange juice', 'quantity': 120, 'unit': 'ml'},
  {'name': 'freshly squeezed lemon juice', 'quantity': 60, 'unit': 'ml'},
  {'name': 'zest of 1 orange', 'quantity': 20, 'unit': 'g'},
  {'name': 'zest of 1 lemon', 'quantity': 10, 'unit': 'g'},
  {'name': 'baking powder', 'quantity': 5, 'unit': 'g'},
  {'name': 'salt', 'quantity': 2, 'unit': 'g'}],
 'instructions': [{'step': 1,
   'description': 'Preheat oven to 180°C. Grease two 20cm round cake pans and line the bottoms with parchment paper.'},
  {'step': 2,
   'description': 'In a medium bowl, whisk together flour, sugar, baking powder, and salt.'},
  {'step': 3,
   'description': 'In a large bowl, using an electric mixer, beat the butter until creamy. Add eggs one at a time, beating well after each addition.'},
  {'step': 4,
   'description': 'With the mixer on low speed, gradually add the flour mixture to the butter mixture in three parts, alternating with the orange and lemon juices, beginning and ending with the flour mixture. Beat just until combined.'},
  {'step': 5,
   'description': 'Divide the batter evenly between the prepared pans and smooth the tops.'},
  {'step': 6,
   'description': 'Bake for 25-30 minutes or until a toothpick inserted in the center comes out clean.'},
  {'step': 7,
   'description': 'Remove from the oven and let cool in the pans for 5 minutes. Then, transfer to a wire rack to cool completely.'}],
 'tools': ['electric mixer',
  'whisk',
  'measuring cups',
  'measuring spoons',
  'parchment paper',
  'wire rack']}