Models

GPT-Neo

GPT-Neo is an implementation of model & data-parallel GPT-2 and GPT-3-like models by Eleuther Ai, utilizing Mesh Tensorflow for distributed support and specially designed for TPUs.

Causal Language Modelling

Causal language modelling is the task of predicting the token following a sequence of tokens. In this situation, the model only attends to the left context (tokens on the left of the mask) (HuggingFace (n.d.)) and thus is useful for generation tasks.

Gradient Accumulation

GPT-Neo architecture

Training

Training is done using the training scripts available here.

For fine-tuning GPTNeo-125M on CodeClippy dataset we used AdamW optimizer (beta1=0.9, beta2=0.95) with GPT3-like learning rate schedule (4k warmup steps from 0 to 5e-5 followed by 50k cosine decay steps to 5e-6), weight decay 0.1 and batch size 1024, sequence length 2048. The choice of relatively large batch size and low LR with long warmup are made to avoid agressive updates and preserve the knowledge contained in pretrained GPTNeo weights.

For fine-tuning GPTNe0-125M on APPS dataset we used AdamW optimizer (beta1=0.9, beta2=0.98) with linear learning rate schedule (800 warmup steps from 0 to peak LR followed by linear decay to 0, a range of value for peak LR was [1e-5; 1e-4]), weight decay 0.1 and batch size 256, sequence length 1024. We trained model for 5 epochs selecting best checkpoint judging by validation loss. The language modelling objective for APPS dataset is modified to backpropagate loss only for the tokens corresponding to code solution (refer to Hendrycks et al for more details).

For fine-tuning GPTNe0-1.3B on APPS dataset we used Adafactor optimizer with linear learning rate schedule (5k warmup steps from 0 to 2e-5 followed by linear decay to 0), weight decay 0.1 and batch size 24, sequence length 1024. The choice of hyperparameters for 1.3B model is in part determined by hardware limitations. We trained model for 5 epochs selecting best checkpoint judging by validation loss.

Model base

To view the model cards please click the links provided in the Modelcolumn below

Model	Dataset Used	pass@1	pass@2	pass@5	pass@10
gpt-neo-125M	The Pile	0.12%	0.24%	0.61%	1.22%
gpt-neo-125M	APPS (Train)	0.06%	0.12%	0.30%	0.61%
gpt-neo-125M	APPS (Train + Test)	TBD...
gpt-neo-1.3B	APPS (Train)	TBD...
gpt-neo-1.3B	APPS (Train + Test)	Desc...
gpt-neo-125M	Code Clippy Data	0.00%	0.00%	0.00%	0.00%
gpt-neo-125M	Code Clippy Data (Deduplicated)	0.00%	0.00%	0.00%	0.00%
gpt-neo-125M	Code Search Net Challenge (All)	0.00%	0.00%	0.00%	0.00%
gpt-neo-125M	Code Search Net Challenge (Python)	0.00%	0.00%	0.00%	0.00%
gpt-neo-125M (trained from scratch)	Code Clippy Data (Deduplicated) (All)	0.00%	0.00%	0.00%	0.00%

Page Directory

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Models

GPT-Neo

Causal Language Modelling

Gradient Accumulation

Training

Model base

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Page Directory

Clone this wiki locally