Agent Fine-Tune

YEAR

2025

ABOUT

Poly (poly.app) is currently building an intelligent cloud file system. We're redesigning the traditional file storage experience to give you smarter, more native ways to search, browse and organize your files.


The objective was to fine-tune AI agent that understands all your media, so users can find anything in natural language, summarize a folder of documents, automatically organize, tag & rename files, and more.

View ↗

KEY INITIATIVES

LLM tone dissection study, prompt–response crafting, user studies

LLM Tone Analysis

[EVALUATION CRITERIA]

How does one even start to deconstruct tone? To classify a model’s tone with any precision, I focused on seven key tonal traits — each defined as a contrasting pair, to allow for clear, quantifiable outcomes.


Traits ranged from Formality (Formal vs. Informal) to Inquisitiveness (Inquisitive vs. Declarative).

View ↗

[EVALUATION SET]

The next step was selecting the prompts for evaluation. I curated a set of 26 queries designed to surface the model’s tone and personality — ranging from simple questions like “Who are you?” to more complex ones that test the model's error handling.

View ↗

[RESPONSE SAMPLE SET]

Response generation involved two steps. First, I ran inference across the five models under comparison (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Flash, Inflection-2.5, and Grok 2) using the full evaluation set.

View ↗

[RESULTS]

Then, each response was evaluated (with support from models like OpenAI's O1) to identify which tonal traits were present and which specific phrases invoked them. This allowed for trait counts per model to be tallied and visualized to uncover distinct model patterns.

View ↗

[KEY OBSERVATIONS]

Some of these patterns showed strong statistical correlations with the broader perceived behaviors of the models. For example…

GPT-4o: Adaptable, most balanced distribution across most trait pairs

E.g. formal (12) vs informal (12), Confident (8) vs Tentative (11), Inquisitive (7) vs Declarative (8)


Claude 3.5 Sonnet: Prompt deflection bias, most significant contrasting pairs

E.g. Inquisitive (17) vs Declarative (2), Confident (4) vs Tentative (14), Persuasive (2) vs Suggestive (13), and LowEmpathy (1) vs HighEmpathy (15)


Gemini 1.5 Flash: Supportive, non confrontational

E.g. Strong in Suggestive (14) and HighEmpathy (14)


Inflection-2.5: Strongly conversational-oriented

E.g. Highest in Informal (15), lowest in Complex (6) among all models


Grok-2: Character layer very distinct from it’s Predictive Ground Layer

E.g. Strongly favors Formal (16) and Complex (16), while being low in Informal (6)

View ↗

[CHARACTER ANALYSIS]

These patterns were then used to further analyze and profile each model’s character and personality — highlighting distinctive behaviors, key phrases, and tonal alignment based on user sentiment. This was particularly helpful when it came time to craft our own agent's Voice Guideline, where I was able to determine Optimal Tone Weights.

View ↗

[TONE DISSECTION]

More importantly, it let me break down each tonal trait using specific examples from the sample set, grouped into clear, referenceable categories and settings. That’ll be especially handy when crafting responses for fine-tuning our own model, giving me concrete ways to work in traits like Inquisitiveness or Persuasion.

View ↗

Prompt Library

[PROMPT DESIGN]

When it came time to write our own prompt–response pairs, I could quickly benchmark our initial outputs against the original 26 queries from the Evaluation Set.


It also gave me a chance to expand the size of the input set 5X to 119 queries, introducing new categories tailored to our agent — such as File System Actions, User Intent, Dialogue Flow, Context Management, and Error Handling.

View ↗

[CHARACTERISTICS]

I also put together a set of characteristic tags to clearly define the model’s intended actions and behaviors — covering aspects like search types, disambiguation cases, context handling, and judgment calls.

View ↗

[RESPONSE WRITING]

The final step involved hand-crafting each response to reflect the previously defined Tone, while embedding the model’s search and file read/edit behaviors directly inline between the text responses.

View ↗

Agentic Capabilities User Study

[USER STUDY]

While validating the hypothesis of integrating the agent into the file system, I also conducted a series of user studies to assess use cases, interface concepts, and other key aspects. Many of the user-identified use cases were later adapted directly into queries for the prompt-response pairs.

View ↗

[PRODUCT]

By breaking down the tonal patterns of top models and crafting Poly’s own voice through tailored prompt-response pairs, we’ve brought a fully functioning agent to life — now in Private Beta.

View ↗