Large Language Model Settings: Temperature, Top P and Max Tokens

Large Language Model Settings: Temperature, Top P and Max Tokens

By Albert Mao

Jan 15, 2024

Large Language Model Settings: Temperature, Top P and Max Tokens

This article summarizes configuration settings of large language models, explains the advantages and disadvantages of large context windows and provides practical examples of how changing these settings impacts LLM output on various tasks.

By design, large language models base their output on probabilities determining the most likely subsequent token or word based on their training, prompting and settings. In our previous blogs, we discussed how prompting techniques induce human-like reasoning in LLMs, while Retrieval Augmented Generation expands these capabilities even further by allowing access to external knowledge sources. 

However, adjusting LLM settings or parameters will influence the final output irrespective of the context and prompts fed to the LLM. Mastering these settings is essential to gain the most out of working with large language models and steer them into the expected behavior. Below, we discuss fundamental LLM parameters such as temperature, top P, max tokens as well as context window and how they impact model output.

Understanding Temperature 

Simply put, temperature is a parameter ranging from 0 to 1, which defines the randomness of LLM responses. The higher the temperature, the more diverse and creative the output would be. On the opposite, when the temperature is low, an LLM will deliver more conservative and deterministic results.

For example, when the temperature is set to zero, an LLM will produce the same output over and over again when given the same prompt. As the temperature parameter is raised, an LLM becomes more creative and offers more diverse outputs. However, when the temperature gets too high, the output can lose meaning and become erratic or nonsensical. 

The default temperature settings for various LLMs vary depending on the type of LLM, whether someone is using the API or web interface, and typically are not disclosed. For example, speaking of ChatGPT-3.5 and 4, most discussions on various boards mention temperature settings at 0.7 and 0.8.

Understanding Top P 

Top P, aka nucleus sampling, is another parameter which impacts randomness of LLM output. This parameter sets the threshold probability for token inclusion in a candidate set used by LLM to generate output. The lower this parameter is, the more factual and exact LLM responses are. Meanwhile, a higher top p value enables the model to increase randomness and generate more diverse output.

For example, when top p is set to 0.1, the LLM generates deterministic and focused output. Meanwhile, setting the top p at 0.8 provides for less constrained and creative responses.

Understanding Context Limits

While temperature and top p define the randomness of LLM responses, they don't set any size boundaries either for the input accepted or for the output generated by the model. Meanwhile, LLM performance and data processing capabilities are also directly related to two other parameters, such as the context window and the max tokens.

Measured in tokens, which can be whole words, subwords or even characters, the context window equals the number of words an LLM can process at once. Meanwhile max tokens parameter sets the limit for the total number of tokens, including both the input fed to an LLM as prompt and the output tokens produced by LLM in response to such prompt.

Why is a large context window an advantage?

Generally speaking, the larger the context window is, the more information an LLM can "remember" when generating its output, which is crucial for providing coherent and accurate responses. Once the input exceeds the LLM context window, the model "forgets" earlier input, which can potentially result in irrelevant responses and lower-quality output.

At the same time, the size of LLM's context window sets limits to prompt engineering approaches. Some techniques, such as Tree-of-Thoughts or Retrieval Augmented Generation (RAG), can only be effective with large context windows, which allow feeding the model with enough information to provide for a higher-quality response.

As large language models develop, so do their context windows, constantly expanding the number of tokens the model can process at once. For example, while GPT-4 has a context limit of 8,192 tokens, the newer GPT-4 Turbo model supports a 128K context window. Anthropic follows the lead increasing context limit in the Claude tool, equaling 9,000 tokens to 200K tokens in the Claude 2.1 version.

When does a large context window not result in better output?

Although a large context window allows LLM to accept long inputs contributing to more relevant responses, expanding context limits is not always viewed as an advantage. First, the larger the input dataset is, the more time it takes for a large language model to come up with a response. Second, the computational costs of large context windows increase proportionately as well. 

Finally, Liu et al. (2023) have demonstrated that large language models struggle to access information in long input context, showing the highest performance when relevant information is placed either at the beginning or the end of the input. At the same time, LLMs demonstrate the lowest performance when using information in the middle of a long context.

How to Adjust LLM Settings?

As shown above, changing temperature, top p, and max tokens can help fine-tune LLM to achieve more relevant and accurate results. Thus, lowering the temperature or top p to 0.1 or 0.2 provides for less randomness and more focused answers, which are useful for tasks like code generation or data analysis scripting. On the opposite, increasing temperature or top p parameters to 0.7 or 0.8  provides for more creative and diverse output, which would be sought after for creative writing tasks or storytelling. Importantly, it is generally recommended to change either temperature of top p parameters, but not both.

Meanwhile, changing the max tokens parameter allows tailoring the length of the LLM output, which may be required when fine-tuning LLM to produce either short-form responses for chatbots or longer-form content when writing articles. Other examples where adjusting the max tokens parameter could be useful include code generation for software development or producing summaries from longer documents.

Learn More with VectorShift

Adjusting the output of large language models is possible in many ways beyond prompt engineering, for example, by adjusting their configuration parameters. While LLMs have a number of settings which can be adjusted depending on the requirements, controlling these three – temperature, top p and max tokens – enables users to effectively configure an LLM for various tasks.

At the same time, leveraging the capabilities of AI and adjusting the output of large language models can be greatly enhanced through no-code functionality and SDK interfaces offered on the VectorShift platform. For more information on LLM applications and adjustments, please don't hesitate to get in touch with our team or request a free demo.

© 2023 VectorShift, Inc. All Rights Reserved.

© 2023 VectorShift, Inc. All Rights Reserved.

© 2023 VectorShift, Inc. All Rights Reserved.