Completions
Given a prompt, the model will return one or more predicted completions, and can also return the probabilities of alternative tokens at each position.
Create completion
POST https://api.goose.ai/v1/engines/{engine_id}/completions
Creates a new completion for the provided prompt and parameters
Path parameters
The ID of the engine to use for this request
engine_id | string | Required
Request Body
prompt | string or array | Optional | Defaults to <|endoftext|>
UPDATED: The prompt(s) to generate completions for, encoded as a string, array of strings, array of tokens, or array of token arrays.
If multiple prompts are provided in a single request, they will be acted upon concurrently.
Note that <|endoftext|>
is the document separator that the model sees during training, so if a prompt is not specified the model will generate as if from the beginning of a new document.
n | integer | Optional | Defaults to 1
NEW: n
number of completions to perform per prompt.
If more than 1
completions are requested, they will be fulfilled concurrently.
i.e. if two prompts were provided, with a requested n
of 5
, 10
completions will be returned, 5
for each prompt.
max_tokens | integer | Optional | Defaults to 16
The maximum number of tokens to generate in the completion.
The token count of your prompt plus max_tokens cannot exceed the model's context length. Most models have a context length of 2048
tokens.
min_tokens | integer | Optional | Defaults to 1
NEW: The minimum number of tokens to generate in the completion.
This mostly applies against stop
sequences. If stop
tokens are seen before min_tokens
, these tokens will be allowed to be generated. However, if the generation is longer than min_tokens
, stop
tokens will take effect.
temperature | number | Optional | Defaults to 1.0
Number between 0 and 1.0. What sampling temperature to use. Higher values means the model will take more risks. Try 0.9 for more creative applications, and 0 (argmax sampling) for ones with a well-defined answer.
We generally recommend altering this or top_p but not both.
top_p | number | Optional | Defaults to 1.0
Number between 0 and 1.0. An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
We generally recommend altering this or temperature but not both.
logit_bias | map | Optional | Defaults to null
NEW Bias for or against the specified tokens appearing.
map
is a JSON object where the keys are token IDs which can be looked up in the Tokenizer, and the values are floating points ranging from -100
to 100
. The biases are added to the logits before sampling, and the effect will vary from model to model.
NOTE: Token id's are not directly mappable between gpt-neo-20b
and other models due to the different tokenizers in use.
Example:
Suppose we wanted to ban the output of <|endoftext|>
, and wanted to increase the likelihood of a new line, \n
:
gpt-neo-20b:
{...
"logit_bias": {"0": -100,
"187": 0.5},
...}
Other Models:
{...
"logit_bias": {"50256": -100,
"198": 0.5},
...}
stop | string or array | Optional | Defaults to null
NEW Stops completion when string or one of the strings in the array is encountered.
NOTE: Tokens often include a preceding space, so you may need to have multiple forms. Matching is performed on the sequence of tokenized strings rather than the strings themselves.
Example:
{...
"stop": " goose",
...}
{...
"stop: [" geese", " gooose", "<|endoftext|>"],
...}
top_k | integer | Optional | Defaults to 0
Truncates logits to the set value.
tfs | number | Optional | Defaults to 1.0
Number between 0 and 1.0. Similar to nucleus sampling, but it sets its cutoff point based on the cumulative sum of the accelerations (second derivatives) of the sorted token probabilities rather than the probabilities themselves.
top_a | number | Optional | Defaults to 1.0
NEW Number between 0 and 1.0. Remove all tokens that have probability below the threshold of: limit = pow(max(probs), 2.0) * top_a
typical_p | number | Optional | Defaults to 1.0
NEW Number between 0 and 1.0. Selects tokens according to the expected amount of information they contribute. Ref: Typical Decoding for Natural Language Generation
stream | boolean | Optional | Defaults to false
Whether to stream back partial progress. If set, tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message.
logprobs | integer | Optional | Defaults to null
Include the log probabilities on the logprobs most likely tokens, as well the chosen tokens. For example, if logprobs is 5, the API will return a list of the 5 most likely tokens. The API will always return the logprob of the sampled token, so there may be up to logprobs+1 elements in the response.
echo | boolean | Optional | Defaults to false
Echo back the prompt in addition to the completion
presence_penalty | number | Optional | Defaults to 0
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
frequency_penalty | number | Optional | Defaults to 0
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
repetition_penalty | number | Optional | Defaults to 1.0, disabled
NEW Number between 0 and 8.0. HuggingFace repetition penalty implementation, uses a divisor. Ref: CTRL - A Conditional Transformer Language Model for Controllable Generation
repetition_penalty_slope | number | Optional | Defaults to 0, disabled
NEW Number between 0 and 1.0. Slope applied to repetition penalty: m * (x*2-1) / (1 + abs(x*2-1) * (m - 1)), x = [0, 1]
Ref: Wolfram Alpha Equation
repetition_penalty_range | number | Optional | Defaults to 0, disabled
NEW Number between 0 and 2048. The token range to apply the repetition_penalty
and repetition_penalty_slope