Edit

Share via


Tutorial: Build a multi-turn chat assistant with Foundry Local

In this tutorial, you build an interactive chat assistant that runs entirely on your device. The assistant maintains conversation context across multiple exchanges, so it remembers what you discussed earlier in the conversation. You use the Foundry Local SDK to select a model, define a system prompt, and stream responses token by token.

In this tutorial, you learn how to:

  • Set up a project and install the Foundry Local SDK
  • Browse the model catalog and select a model
  • Define a system prompt to shape assistant behavior
  • Implement multi-turn conversation with message history
  • Stream responses for a responsive experience
  • Clean up resources when the conversation ends

Prerequisites

  • A Windows, macOS, or Linux computer with at least 8 GB of RAM.

Samples repository

The complete sample code for this article is available in the Foundry Local GitHub repository. To clone the repository and navigate to the sample use:

git clone https://github.com/microsoft/Foundry-Local.git
cd Foundry-Local/samples/cs/tutorial-chat-assistant

Install packages

If you're developing or shipping on Windows, select the Windows tab. The Windows package integrates with the Windows ML runtime — it provides the same API surface area with a wider breadth of hardware acceleration.

dotnet add package Microsoft.AI.Foundry.Local.WinML
dotnet add package OpenAI

The C# samples in the GitHub repository are preconfigured projects. If you're building from scratch, you should read the Foundry Local SDK reference for more details on how to set up your C# project with Foundry Local.

Browse the catalog and select a model

The Foundry Local SDK provides a model catalog that lists all available models. In this step, you initialize the SDK and select a model for your chat assistant.

  • Open Program.cs and replace its contents with the following code to initialize the SDK and select a model:

    CancellationToken ct = CancellationToken.None;
    
    var config = new Configuration
    {
        AppName = "foundry_local_samples",
        LogLevel = Microsoft.AI.Foundry.Local.LogLevel.Information
    };
    
    using var loggerFactory = LoggerFactory.Create(builder =>
    {
        builder.SetMinimumLevel(Microsoft.Extensions.Logging.LogLevel.Information);
    });
    var logger = loggerFactory.CreateLogger<Program>();
    
    // Initialize the singleton instance
    await FoundryLocalManager.CreateAsync(config, logger);
    var mgr = FoundryLocalManager.Instance;
    
    // Select and load a model from the catalog
    var catalog = await mgr.GetCatalogAsync();
    var model = await catalog.GetModelAsync("qwen2.5-0.5b")
        ?? throw new Exception("Model not found");
    
    await model.DownloadAsync(progress =>
    {
        Console.Write($"\rDownloading model: {progress:F2}%");
        if (progress >= 100f) Console.WriteLine();
    });
    
    await model.LoadAsync();
    Console.WriteLine("Model loaded and ready.");
    
    // Get a chat client
    var chatClient = await model.GetChatClientAsync();
    

    The GetModelAsync method accepts a model alias, which is a short friendly name that maps to a specific model in the catalog. The DownloadAsync method fetches the model weights to your local cache, and LoadAsync makes the model ready for inference.

Define a system prompt

A system prompt sets the assistant's personality and behavior. It's the first message in the conversation history and the model references it throughout the conversation.

Add a system prompt to shape how the assistant responds:

// Start the conversation with a system prompt
var messages = new List<ChatMessage>
{
    new ChatMessage
    {
        Role = "system",
        Content = "You are a helpful, friendly assistant. Keep your responses " +
                  "concise and conversational. If you don't know something, say so."
    }
};

Tip

Experiment with different system prompts to change the assistant's behavior. For example, you can instruct it to respond as a pirate, a teacher, or a domain expert.

Implement multi-turn conversation

A chat assistant needs to maintain context across multiple exchanges. You achieve this by keeping a list of all messages (system, user, and assistant) and sending the full list with each request. The model uses this history to generate contextually relevant responses.

Add a conversation loop that:

  • Reads user input from the console.
  • Appends the user message to the history.
  • Sends the complete history to the model.
  • Appends the assistant's response to the history for the next turn.
while (true)
{
    Console.Write("You: ");
    var userInput = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(userInput) ||
        userInput.Equals("quit", StringComparison.OrdinalIgnoreCase) ||
        userInput.Equals("exit", StringComparison.OrdinalIgnoreCase))
    {
        break;
    }

    // Add the user's message to conversation history
    messages.Add(new ChatMessage { Role = "user", Content = userInput });

    // Stream the response token by token
    Console.Write("Assistant: ");
    var fullResponse = string.Empty;
    var streamingResponse = chatClient.CompleteChatStreamingAsync(messages, ct);
    await foreach (var chunk in streamingResponse)
    {
        var content = chunk.Choices[0].Message.Content;
        if (!string.IsNullOrEmpty(content))
        {
            Console.Write(content);
            Console.Out.Flush();
            fullResponse += content;
        }
    }
    Console.WriteLine("\n");

    // Add the complete response to conversation history
    messages.Add(new ChatMessage { Role = "assistant", Content = fullResponse });
}

Each call to CompleteChatAsync receives the full message history. This is how the model "remembers" previous turns — it doesn't store state between calls.

Add streaming responses

Streaming prints each token as it's generated, which makes the assistant feel more responsive. Replace the CompleteChatAsync call with CompleteChatStreamingAsync to stream the response token by token.

Update the conversation loop to use streaming:

// Stream the response token by token
Console.Write("Assistant: ");
var fullResponse = string.Empty;
var streamingResponse = chatClient.CompleteChatStreamingAsync(messages, ct);
await foreach (var chunk in streamingResponse)
{
    var content = chunk.Choices[0].Message.Content;
    if (!string.IsNullOrEmpty(content))
    {
        Console.Write(content);
        Console.Out.Flush();
        fullResponse += content;
    }
}
Console.WriteLine("\n");

The streaming version accumulates the full response so it can be added to the conversation history after the stream completes.

Complete code

Replace the contents of Program.cs with the following complete code:

using Microsoft.AI.Foundry.Local;
using Betalgo.Ranul.OpenAI.ObjectModels.RequestModels;
using Microsoft.Extensions.Logging;

CancellationToken ct = CancellationToken.None;

var config = new Configuration
{
    AppName = "foundry_local_samples",
    LogLevel = Microsoft.AI.Foundry.Local.LogLevel.Information
};

using var loggerFactory = LoggerFactory.Create(builder =>
{
    builder.SetMinimumLevel(Microsoft.Extensions.Logging.LogLevel.Information);
});
var logger = loggerFactory.CreateLogger<Program>();

// Initialize the singleton instance
await FoundryLocalManager.CreateAsync(config, logger);
var mgr = FoundryLocalManager.Instance;

// Select and load a model from the catalog
var catalog = await mgr.GetCatalogAsync();
var model = await catalog.GetModelAsync("qwen2.5-0.5b")
    ?? throw new Exception("Model not found");

await model.DownloadAsync(progress =>
{
    Console.Write($"\rDownloading model: {progress:F2}%");
    if (progress >= 100f) Console.WriteLine();
});

await model.LoadAsync();
Console.WriteLine("Model loaded and ready.");

// Get a chat client
var chatClient = await model.GetChatClientAsync();

// Start the conversation with a system prompt
var messages = new List<ChatMessage>
{
    new ChatMessage
    {
        Role = "system",
        Content = "You are a helpful, friendly assistant. Keep your responses " +
                  "concise and conversational. If you don't know something, say so."
    }
};

Console.WriteLine("\nChat assistant ready! Type 'quit' to exit.\n");

while (true)
{
    Console.Write("You: ");
    var userInput = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(userInput) ||
        userInput.Equals("quit", StringComparison.OrdinalIgnoreCase) ||
        userInput.Equals("exit", StringComparison.OrdinalIgnoreCase))
    {
        break;
    }

    // Add the user's message to conversation history
    messages.Add(new ChatMessage { Role = "user", Content = userInput });

    // Stream the response token by token
    Console.Write("Assistant: ");
    var fullResponse = string.Empty;
    var streamingResponse = chatClient.CompleteChatStreamingAsync(messages, ct);
    await foreach (var chunk in streamingResponse)
    {
        var content = chunk.Choices[0].Message.Content;
        if (!string.IsNullOrEmpty(content))
        {
            Console.Write(content);
            Console.Out.Flush();
            fullResponse += content;
        }
    }
    Console.WriteLine("\n");

    // Add the complete response to conversation history
    messages.Add(new ChatMessage { Role = "assistant", Content = fullResponse });
}

// Clean up - unload the model
await model.UnloadAsync();
Console.WriteLine("Model unloaded. Goodbye!");

Run the chat assistant:

dotnet run

You see output similar to:

Downloading model: 100.00%
Model loaded and ready.

Chat assistant ready! Type 'quit' to exit.

You: What is photosynthesis?
Assistant: Photosynthesis is the process plants use to convert sunlight, water, and carbon
dioxide into glucose and oxygen. It mainly happens in the leaves, inside structures
called chloroplasts.

You: Why is it important for other living things?
Assistant: It's essential because photosynthesis produces the oxygen that most living things
breathe. It also forms the base of the food chain — animals eat plants or eat other
animals that depend on plants for energy.

You: quit
Model unloaded. Goodbye!

Notice how the assistant remembers context from previous turns — when you ask "Why is it important for other living things?", it knows you're still talking about photosynthesis.

Samples repository

The complete sample code for this article is available in the Foundry Local GitHub repository. To clone the repository and navigate to the sample use:

git clone https://github.com/microsoft/Foundry-Local.git
cd Foundry-Local/samples/js/tutorial-chat-assistant

Install packages

If you're developing or shipping on Windows, select the Windows tab. The Windows package integrates with the Windows ML runtime — it provides the same API surface area with a wider breadth of hardware acceleration.

npm install foundry-local-sdk-winml openai

Browse the catalog and select a model

The Foundry Local SDK provides a model catalog that lists all available models. In this step, you initialize the SDK and select a model for your chat assistant.

  1. Create a file called index.js.

  2. Add the following code to initialize the SDK and select a model:

    // Initialize the Foundry Local SDK
    const manager = FoundryLocalManager.create({
        appName: 'foundry_local_samples',
        logLevel: 'info'
    });
    
    // Select and load a model from the catalog
    const model = await manager.catalog.getModel('qwen2.5-0.5b');
    
    await model.download((progress) => {
        process.stdout.write(`\rDownloading model: ${progress.toFixed(2)}%`);
    });
    console.log('\nModel downloaded.');
    
    await model.load();
    console.log('Model loaded and ready.');
    
    // Create a chat client
    const chatClient = model.createChatClient();
    

    The getModel method accepts a model alias, which is a short friendly name that maps to a specific model in the catalog. The download method fetches the model weights to your local cache, and load makes the model ready for inference.

Define a system prompt

A system prompt sets the assistant's personality and behavior. It's the first message in the conversation history and the model references it throughout the conversation.

Add a system prompt to shape how the assistant responds:

// Start the conversation with a system prompt
const messages = [
    {
        role: 'system',
        content: 'You are a helpful, friendly assistant. Keep your responses ' +
                 'concise and conversational. If you don\'t know something, say so.'
    }
];

Tip

Experiment with different system prompts to change the assistant's behavior. For example, you can instruct it to respond as a pirate, a teacher, or a domain expert.

Implement multi-turn conversation

A chat assistant needs to maintain context across multiple exchanges. You achieve this by keeping a list of all messages (system, user, and assistant) and sending the full list with each request. The model uses this history to generate contextually relevant responses.

Add a conversation loop that:

  • Reads user input from the console.
  • Appends the user message to the history.
  • Sends the complete history to the model.
  • Appends the assistant's response to the history for the next turn.
while (true) {
    const userInput = await askQuestion('You: ');
    if (userInput.trim().toLowerCase() === 'quit' ||
        userInput.trim().toLowerCase() === 'exit') {
        break;
    }

    // Add the user's message to conversation history
    messages.push({ role: 'user', content: userInput });

    // Stream the response token by token
    process.stdout.write('Assistant: ');
    let fullResponse = '';
    await chatClient.completeStreamingChat(messages, (chunk) => {
        const content = chunk.choices?.[0]?.message?.content;
        if (content) {
            process.stdout.write(content);
            fullResponse += content;
        }
    });
    console.log('\n');

    // Add the complete response to conversation history
    messages.push({ role: 'assistant', content: fullResponse });
}

Each call to completeChat receives the full message history. This is how the model "remembers" previous turns — it doesn't store state between calls.

Add streaming responses

Streaming prints each token as it's generated, which makes the assistant feel more responsive. Replace the completeChat call with completeStreamingChat to stream the response token by token.

Update the conversation loop to use streaming:

// Stream the response token by token
process.stdout.write('Assistant: ');
let fullResponse = '';
await chatClient.completeStreamingChat(messages, (chunk) => {
    const content = chunk.choices?.[0]?.message?.content;
    if (content) {
        process.stdout.write(content);
        fullResponse += content;
    }
});
console.log('\n');

The streaming version accumulates the full response so it can be added to the conversation history after the stream completes.

Complete code

Create a file named index.js and add the following complete code:

import { FoundryLocalManager } from 'foundry-local-sdk';
import * as readline from 'readline';

// Initialize the Foundry Local SDK
const manager = FoundryLocalManager.create({
    appName: 'foundry_local_samples',
    logLevel: 'info'
});

// Select and load a model from the catalog
const model = await manager.catalog.getModel('qwen2.5-0.5b');

await model.download((progress) => {
    process.stdout.write(`\rDownloading model: ${progress.toFixed(2)}%`);
});
console.log('\nModel downloaded.');

await model.load();
console.log('Model loaded and ready.');

// Create a chat client
const chatClient = model.createChatClient();

// Start the conversation with a system prompt
const messages = [
    {
        role: 'system',
        content: 'You are a helpful, friendly assistant. Keep your responses ' +
                 'concise and conversational. If you don\'t know something, say so.'
    }
];

// Set up readline for console input
const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout
});

const askQuestion = (prompt) => new Promise((resolve) => rl.question(prompt, resolve));

console.log('\nChat assistant ready! Type \'quit\' to exit.\n');

while (true) {
    const userInput = await askQuestion('You: ');
    if (userInput.trim().toLowerCase() === 'quit' ||
        userInput.trim().toLowerCase() === 'exit') {
        break;
    }

    // Add the user's message to conversation history
    messages.push({ role: 'user', content: userInput });

    // Stream the response token by token
    process.stdout.write('Assistant: ');
    let fullResponse = '';
    await chatClient.completeStreamingChat(messages, (chunk) => {
        const content = chunk.choices?.[0]?.message?.content;
        if (content) {
            process.stdout.write(content);
            fullResponse += content;
        }
    });
    console.log('\n');

    // Add the complete response to conversation history
    messages.push({ role: 'assistant', content: fullResponse });
}

// Clean up - unload the model
await model.unload();
console.log('Model unloaded. Goodbye!');
rl.close();

Run the chat assistant:

node index.js

You see output similar to:

Downloading model: 100.00%
Model downloaded.
Model loaded and ready.

Chat assistant ready! Type 'quit' to exit.

You: What is photosynthesis?
Assistant: Photosynthesis is the process plants use to convert sunlight, water, and carbon
dioxide into glucose and oxygen. It mainly happens in the leaves, inside structures
called chloroplasts.

You: Why is it important for other living things?
Assistant: It's essential because photosynthesis produces the oxygen that most living things
breathe. It also forms the base of the food chain — animals eat plants or eat other
animals that depend on plants for energy.

You: quit
Model unloaded. Goodbye!

Notice how the assistant remembers context from previous turns — when you ask "Why is it important for other living things?", it knows you're still talking about photosynthesis.

Samples repository

The complete sample code for this article is available in the Foundry Local GitHub repository. To clone the repository and navigate to the sample use:

git clone https://github.com/microsoft/Foundry-Local.git
cd Foundry-Local/samples/python/tutorial-chat-assistant

Install packages

If you're developing or shipping on Windows, select the Windows tab. The Windows package integrates with the Windows ML runtime — it provides the same API surface area with a wider breadth of hardware acceleration.

pip install foundry-local-sdk-winml openai

Browse the catalog and select a model

The Foundry Local SDK provides a model catalog that lists all available models. In this step, you initialize the SDK and select a model for your chat assistant.

  1. Create a file called main.py.

  2. Add the following code to initialize the SDK and select a model:

    # Initialize the Foundry Local SDK
    config = Configuration(app_name="foundry_local_samples")
    FoundryLocalManager.initialize(config)
    manager = FoundryLocalManager.instance
    
    # Select and load a model from the catalog
    model = manager.catalog.get_model("qwen2.5-0.5b")
    model.download(lambda progress: print(f"\rDownloading model: {progress:.2f}%", end="", flush=True))
    print()
    model.load()
    print("Model loaded and ready.")
    
    # Get a chat client
    client = model.get_chat_client()
    

    The get_model method accepts a model alias, which is a short friendly name that maps to a specific model in the catalog. The download method fetches the model weights to your local cache, and load makes the model ready for inference.

Define a system prompt

A system prompt sets the assistant's personality and behavior. It's the first message in the conversation history and the model references it throughout the conversation.

Add a system prompt to shape how the assistant responds:

# Start the conversation with a system prompt
messages = [
    {
        "role": "system",
        "content": "You are a helpful, friendly assistant. Keep your responses "
                   "concise and conversational. If you don't know something, say so."
    }
]

Tip

Experiment with different system prompts to change the assistant's behavior. For example, you can instruct it to respond as a pirate, a teacher, or a domain expert.

Implement multi-turn conversation

A chat assistant needs to maintain context across multiple exchanges. You achieve this by keeping a list of all messages (system, user, and assistant) and sending the full list with each request. The model uses this history to generate contextually relevant responses.

Add a conversation loop that:

  • Reads user input from the console.
  • Appends the user message to the history.
  • Sends the complete history to the model.
  • Appends the assistant's response to the history for the next turn.
while True:
    user_input = input("You: ")
    if user_input.strip().lower() in ("quit", "exit"):
        break

    # Add the user's message to conversation history
    messages.append({"role": "user", "content": user_input})

    # Stream the response token by token
    print("Assistant: ", end="", flush=True)
    full_response = ""
    for chunk in client.complete_streaming_chat(messages):
        content = chunk.choices[0].message.content
        if content:
            print(content, end="", flush=True)
            full_response += content
    print("\n")

    # Add the complete response to conversation history
    messages.append({"role": "assistant", "content": full_response})

Each call to complete_chat receives the full message history. This is how the model "remembers" previous turns — it doesn't store state between calls.

Add streaming responses

Streaming prints each token as it's generated, which makes the assistant feel more responsive. Replace the complete_chat call with complete_streaming_chat to stream the response token by token.

Update the conversation loop to use streaming:

# Stream the response token by token
print("Assistant: ", end="", flush=True)
full_response = ""
for chunk in client.complete_streaming_chat(messages):
    content = chunk.choices[0].message.content
    if content:
        print(content, end="", flush=True)
        full_response += content
print("\n")

The streaming version accumulates the full response so it can be added to the conversation history after the stream completes.

Complete code

Create a file named main.py and add the following complete code:

import asyncio
from foundry_local_sdk import Configuration, FoundryLocalManager


async def main():
    # Initialize the Foundry Local SDK
    config = Configuration(app_name="foundry_local_samples")
    FoundryLocalManager.initialize(config)
    manager = FoundryLocalManager.instance

    # Select and load a model from the catalog
    model = manager.catalog.get_model("qwen2.5-0.5b")
    model.download(lambda progress: print(f"\rDownloading model: {progress:.2f}%", end="", flush=True))
    print()
    model.load()
    print("Model loaded and ready.")

    # Get a chat client
    client = model.get_chat_client()

    # Start the conversation with a system prompt
    messages = [
        {
            "role": "system",
            "content": "You are a helpful, friendly assistant. Keep your responses "
                       "concise and conversational. If you don't know something, say so."
        }
    ]

    print("\nChat assistant ready! Type 'quit' to exit.\n")

    while True:
        user_input = input("You: ")
        if user_input.strip().lower() in ("quit", "exit"):
            break

        # Add the user's message to conversation history
        messages.append({"role": "user", "content": user_input})

        # Stream the response token by token
        print("Assistant: ", end="", flush=True)
        full_response = ""
        for chunk in client.complete_streaming_chat(messages):
            content = chunk.choices[0].message.content
            if content:
                print(content, end="", flush=True)
                full_response += content
        print("\n")

        # Add the complete response to conversation history
        messages.append({"role": "assistant", "content": full_response})

    # Clean up - unload the model
    model.unload()
    print("Model unloaded. Goodbye!")


if __name__ == "__main__":
    asyncio.run(main())

Run the chat assistant:

python main.py

You see output similar to:

Downloading model: 100.00%
Model loaded and ready.

Chat assistant ready! Type 'quit' to exit.

You: What is photosynthesis?
Assistant: Photosynthesis is the process plants use to convert sunlight, water, and carbon
dioxide into glucose and oxygen. It mainly happens in the leaves, inside structures
called chloroplasts.

You: Why is it important for other living things?
Assistant: It's essential because photosynthesis produces the oxygen that most living things
breathe. It also forms the base of the food chain — animals eat plants or eat other
animals that depend on plants for energy.

You: quit
Model unloaded. Goodbye!

Notice how the assistant remembers context from previous turns — when you ask "Why is it important for other living things?", it knows you're still talking about photosynthesis.

Samples repository

The complete sample code for this article is available in the Foundry Local GitHub repository. To clone the repository and navigate to the sample use:

git clone https://github.com/microsoft/Foundry-Local.git
cd Foundry-Local/samples/rust/tutorial-chat-assistant

Install packages

If you're developing or shipping on Windows, select the Windows tab. The Windows package integrates with the Windows ML runtime — it provides the same API surface area with a wider breadth of hardware acceleration.

cargo add foundry-local-sdk --features winml
cargo add tokio --features full
cargo add tokio-stream anyhow

Browse the catalog and select a model

The Foundry Local SDK provides a model catalog that lists all available models. In this step, you initialize the SDK and select a model for your chat assistant.

  • Open src/main.rs and replace its contents with the following code to initialize the SDK and select a model:

    // Initialize the Foundry Local SDK
    let manager = FoundryLocalManager::create(FoundryLocalConfig::new("chat-assistant"))?;
    
    // Select and load a model from the catalog
    let model = manager.catalog().get_model("qwen2.5-0.5b").await?;
    
    if !model.is_cached().await? {
        println!("Downloading model...");
        model
            .download(Some(|progress: f64| {
                print!("\r  {progress:.1}%");
                io::stdout().flush().ok();
            }))
            .await?;
        println!();
    }
    
    model.load().await?;
    println!("Model loaded and ready.");
    
    // Create a chat client
    let client = model.create_chat_client().temperature(0.7).max_tokens(512);
    

    The get_model method accepts a model alias, which is a short friendly name that maps to a specific model in the catalog. The download method fetches the model weights to your local cache, and load makes the model ready for inference.

Define a system prompt

A system prompt sets the assistant's personality and behavior. It's the first message in the conversation history and the model references it throughout the conversation.

Add a system prompt to shape how the assistant responds:

// Start the conversation with a system prompt
let mut messages: Vec<ChatCompletionRequestMessage> = vec![
    ChatCompletionRequestSystemMessage::from(
        "You are a helpful, friendly assistant. Keep your responses \
         concise and conversational. If you don't know something, say so.",
    )
    .into(),
];

Tip

Experiment with different system prompts to change the assistant's behavior. For example, you can instruct it to respond as a pirate, a teacher, or a domain expert.

Implement multi-turn conversation

A chat assistant needs to maintain context across multiple exchanges. You achieve this by keeping a vector of all messages (system, user, and assistant) and sending the full list with each request. The model uses this history to generate contextually relevant responses.

Add a conversation loop that:

  • Reads user input from the console.
  • Appends the user message to the history.
  • Sends the complete history to the model.
  • Appends the assistant's response to the history for the next turn.
loop {
    print!("You: ");
    io::stdout().flush()?;

    let mut input = String::new();
    stdin.lock().read_line(&mut input)?;
    let input = input.trim();

    if input.eq_ignore_ascii_case("quit") || input.eq_ignore_ascii_case("exit") {
        break;
    }

    // Add the user's message to conversation history
    messages.push(ChatCompletionRequestUserMessage::from(input).into());

    // Stream the response token by token
    print!("Assistant: ");
    io::stdout().flush()?;
    let mut full_response = String::new();
    let mut stream = client.complete_streaming_chat(&messages, None).await?;
    while let Some(chunk) = stream.next().await {
        let chunk = chunk?;
        if let Some(choice) = chunk.choices.first() {
            if let Some(ref content) = choice.delta.content {
                print!("{content}");
                io::stdout().flush()?;
                full_response.push_str(content);
            }
        }
    }
    println!("\n");

    // Add the complete response to conversation history
    let assistant_msg: ChatCompletionRequestMessage = serde_json::from_value(
        serde_json::json!({"role": "assistant", "content": full_response}),
    )?;
    messages.push(assistant_msg);
}

Each call to complete_chat receives the full message history. This is how the model "remembers" previous turns — it doesn't store state between calls.

Add streaming responses

Streaming prints each token as it's generated, which makes the assistant feel more responsive. Replace the complete_chat call with complete_streaming_chat to stream the response token by token.

Update the conversation loop to use streaming:

// Stream the response token by token
print!("Assistant: ");
io::stdout().flush()?;
let mut full_response = String::new();
let mut stream = client.complete_streaming_chat(&messages, None).await?;
while let Some(chunk) = stream.next().await {
    let chunk = chunk?;
    if let Some(choice) = chunk.choices.first() {
        if let Some(ref content) = choice.delta.content {
            print!("{content}");
            io::stdout().flush()?;
            full_response.push_str(content);
        }
    }
}
println!("\n");

The streaming version accumulates the full response so it can be added to the conversation history after the stream completes.

Complete code

Replace the contents of src/main.rs with the following complete code:

use foundry_local_sdk::{
    ChatCompletionRequestMessage,
    ChatCompletionRequestSystemMessage, ChatCompletionRequestUserMessage,
    FoundryLocalConfig, FoundryLocalManager,
};
use std::io::{self, BufRead, Write};
use tokio_stream::StreamExt;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Initialize the Foundry Local SDK
    let manager = FoundryLocalManager::create(FoundryLocalConfig::new("chat-assistant"))?;

    // Select and load a model from the catalog
    let model = manager.catalog().get_model("qwen2.5-0.5b").await?;

    if !model.is_cached().await? {
        println!("Downloading model...");
        model
            .download(Some(|progress: f64| {
                print!("\r  {progress:.1}%");
                io::stdout().flush().ok();
            }))
            .await?;
        println!();
    }

    model.load().await?;
    println!("Model loaded and ready.");

    // Create a chat client
    let client = model.create_chat_client().temperature(0.7).max_tokens(512);

    // Start the conversation with a system prompt
    let mut messages: Vec<ChatCompletionRequestMessage> = vec![
        ChatCompletionRequestSystemMessage::from(
            "You are a helpful, friendly assistant. Keep your responses \
             concise and conversational. If you don't know something, say so.",
        )
        .into(),
    ];

    println!("\nChat assistant ready! Type 'quit' to exit.\n");

    let stdin = io::stdin();
    loop {
        print!("You: ");
        io::stdout().flush()?;

        let mut input = String::new();
        stdin.lock().read_line(&mut input)?;
        let input = input.trim();

        if input.eq_ignore_ascii_case("quit") || input.eq_ignore_ascii_case("exit") {
            break;
        }

        // Add the user's message to conversation history
        messages.push(ChatCompletionRequestUserMessage::from(input).into());

        // Stream the response token by token
        print!("Assistant: ");
        io::stdout().flush()?;
        let mut full_response = String::new();
        let mut stream = client.complete_streaming_chat(&messages, None).await?;
        while let Some(chunk) = stream.next().await {
            let chunk = chunk?;
            if let Some(choice) = chunk.choices.first() {
                if let Some(ref content) = choice.delta.content {
                    print!("{content}");
                    io::stdout().flush()?;
                    full_response.push_str(content);
                }
            }
        }
        println!("\n");

        // Add the complete response to conversation history
        let assistant_msg: ChatCompletionRequestMessage = serde_json::from_value(
            serde_json::json!({"role": "assistant", "content": full_response}),
        )?;
        messages.push(assistant_msg);
    }

    // Clean up - unload the model
    model.unload().await?;
    println!("Model unloaded. Goodbye!");

    Ok(())
}

Run the chat assistant:

cargo run

You see output similar to:

Downloading model: 100.00%
Model loaded and ready.

Chat assistant ready! Type 'quit' to exit.

You: What is photosynthesis?
Assistant: Photosynthesis is the process plants use to convert sunlight, water, and carbon
dioxide into glucose and oxygen. It mainly happens in the leaves, inside structures
called chloroplasts.

You: Why is it important for other living things?
Assistant: It's essential because photosynthesis produces the oxygen that most living things
breathe. It also forms the base of the food chain — animals eat plants or eat other
animals that depend on plants for energy.

You: quit
Model unloaded. Goodbye!

Notice how the assistant remembers context from previous turns — when you ask "Why is it important for other living things?", it knows you're still talking about photosynthesis.

Clean up resources

The model weights remain in your local cache after you unload a model. This means the next time you run the application, the download step is skipped and the model loads faster. No extra cleanup is needed unless you want to reclaim disk space.