GitHub - 0xideas/whale-gpt

Hello!

This repo is a first attempt to build WhaleGPT, a transformer 'language' model for whale language. The data and codas used in this model come from Sharma et al. At this point the data is likely insufficient for a model that is directly useful, so it should be considered experimental.

The approach is the following:

The data used in Sharma et al. contains click sequences with a length of up to 28 inter click intervals, i.e. 29 clicks, but the codas being used to encode these click sequences have up to 9 inter click intervals, i.e. up to 10 clicks. Longer click sequences can not be encoded as a single coda, as throwing away surplus click intervals would be rash. We have therefore developed an algorithm to code click sequences into codas, contained in scripts/0_extract_codas.py. This algorithm is used to decode all click sequences, now named 'vocalizations', into sequences of codas. This applies to all vocalizations, including those consisting of 10 clicks or less. The algorithm takes the mean relative interval sequence for each coda (i.e. the mean time percentile of each click in the coda), and recursively divides the vocalization into codas or 'surplus' clicks (that can usually be interpreted as ornamentation). The sequences of these subdivisions are then scored by the manhattan distance, and the sequence of codas with the minimum total distance is then taken as the vocalization encoding into codas.
These coda sequences are then encoded into dialogue form using scripts/1_create_dialogue.py. The main decisions on how to represent these overlapping coda sequences by multiple whales in discrete form are (1) Codas that begin sufficiently close in time are considered simultaneous and encoded in a single row. The threshold used to making this determination is somewhat arbitrary and currently set to 0.3 seconds. (2) Some recordings contain two or more whales, and this can be represented in multiple ways. Here, we adopt a 'me' vs 'other' encoding, where for each whale contained in a recording, the codas emitted by the whale is encoded in the columns "Coda1", "Ornamentation1" and "Duration1", while the codas emitted by any of the other whales are encoded in the columns "Coda2", "Ornamentation2" and "Duration2". When either the primary whale or the other whales are silent, this is encoded with an additional token '98'. The data used for modelling can be found at data/whale-dialogues.csv.
The language model itself is a decoder only transformer with 126k parameters that autoregressively models "Coda1", "Ornamentation1", "Duration1", "Coda2", "Ornamentation2" and "Duration2". Each incremental output of these variables is generated from the previous 25 values of all of these variables. We use the package sequifier that enables the easy configuration, training and inference for models of this type.

Several generated "Whale dialogues" can be found in outputs/predictions/sequifier-dialogue-best-5000-predictions.csv, where each sequenceId value is a single "dialogue". Currently most of them end up with both whales repeatedly emitting the coda "5", either while the other is silent or together.

The roughly 9k observations contained in this dataset are clearly insufficient to create a useful model, but this modelling work should serve as an encouragement for additional data collection and a basis for future model development.

The development of this model can be reproduced in the following steps, using Mac (or likely most Linux distributions). Since all the artefacts are also contained in this repository, all the steps after can also be executed individually (after executing 1., 2., 3. and 7.).

conda create --name whale-gpt python=3.11 -y
conda activate whale-gpt
pip install sequifier==0.4.0.0 scikit-learn
python scripts/00_create_coda_means.py
python scripts/0_extract_codas.py
python scripts/1_create_dialogue.py
sequifier preprocess
sequifier train
sequifier infer

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
assets		assets
configs		configs
data		data
models		models
outputs/predictions		outputs/predictions
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

0xideas/whale-gpt

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages