Semantle, created by David Turner, is an interesting spin on the recent trend of word guessing games. Instead of finding similarities in the spelling, semantle compares the semantic similarity between the words. Here I threw together a bit of code to visually explore a game of semantle using a UMAP representation of the underlying word2vec word embedding. You can follow along with my bumbling semantle guesses, or, probably more fun, visualize your own guesses.
You can install all of the used packages using pip, and all except for Babyplots also using conda.
# import re
# from hashlib import sha1
import gensim
import pandas as pd
import umap.umap_ as umap
import numpy as np
from numpy import log10
from tqdm import tqdm
from babyplots import Babyplot
First we load the full word2vec model (not included in the repository, but you can find it here.
# model = gensim.models.KeyedVectors.load_word2vec_format("../GoogleNews-vectors-negative300.bin", binary=True)
Semantle has a list of allowed and banned words. We need to filter the model by these, so that our UMAP is representative (Filtering code adapted from the original semantle source code). These lists are also not included in this repository, but you can find them in the semantle repository.
# allowable_words = set()
# with open("words_alpha.txt") as walpha:
# for line in walpha.readlines():
# allowable_words.add(line.strip())
# banned_hashes = set()
# with open("banned.txt") as f:
# for line in f:
# banned_hashes.add(line.strip())
# simple_word = re.compile("^[a-z]*$")
# words = []
# for word in model.key_to_index:
# if simple_word.match(word) and word in allowable_words:
# h = sha1()
# h.update(("banned" + word).encode("ascii"))
# hash = h.hexdigest()
# if not hash in banned_hashes:
# words.append(word)
# len(words)
Now we create a subset of the word2vec model with just the allowed vectors (this takes quite long, because I add the vectors one by one instead of batch-wise which would be better).
# w2v_allowed = gensim.models.keyedvectors.KeyedVectors(300)
# for word in tqdm(words):
# v = model.get_vector(word)
# w2v_allowed.add_vector(word, v)
Save it, so we don't have to do it again.
# w2v_allowed.save_word2vec_format("allowed_word2vec.bin", binary=True)
Now we load the model with only the allowed words.
model = gensim.models.KeyedVectors.load_word2vec_format("allowed_word2vec.bin", binary=True)
This example is based on semantle #53, where the secret word was "shot". So we first get the 1000 closest words to "shot".
secret_word = "shot"
topn = 1000
top_words = model.most_similar(secret_word, topn=topn)
top_words.insert(0, (secret_word, 1))
included_words = [x[0] for x in top_words]
Now we load my guesses. You can also do this for your semantle attempts, by replacing the words in guesses.txt (and the secret word).
guesses = []
with open("guesses.txt", "r") as guesses_file:
for line in guesses_file:
guesses.append(line.rstrip("\n"))
Next, we get the closest 1000 words around the guesses, but keeping only unique words.
n_words_around_guess = 1000
for w in tqdm(guesses):
if w in included_words:
continue
try:
words_around_guess = model.most_similar(w, topn=n_words_around_guess)
words_around_guess = [w for w in words_around_guess if w[0] not in included_words]
top_words += words_around_guess
included_words += [w[0] for w in words_around_guess]
except KeyError:
continue
len(top_words)
100%|██████████| 60/60 [00:04<00:00, 12.51it/s]
20826
Put the secret word at the end of the guesses.
guesses.append(secret_word)
Here, we get the vectors for the selected words...
model_reduced = model[[w[0] for w in top_words]]
model_reduced.shape
(20826, 300)
... and run the UMAP dimensionality reduction.
reducer = umap.UMAP(metric='cosine', n_neighbors=15, min_dist=0.05, random_state=42, n_components=3)
embedding = reducer.fit_transform(model_reduced)
Finally, we create a dataframe to organize the data for visualization...
d = pd.DataFrame(embedding, columns=['umap1', 'umap2', 'umap3'])
d['word'] = [w[0] for w in top_words]
d['similarity'] = [w[1] for w in top_words]
d['log_similarity'] = d['similarity'].apply(log10)
d['word_index'] = np.arange(len(d)) + 1
d['log_word_index'] = d['word_index'].apply(log10)
d['word_index_rev'] = len(d) - d['word_index']
d['log_word_index_rev'] = 1 - d["log_word_index"]
d.head()
umap1 | umap2 | umap3 | word | similarity | log_similarity | word_index | log_word_index | word_index_rev | log_word_index_rev | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 11.525807 | 8.440066 | 6.365793 | shot | 1.000000 | 0.000000 | 1 | 0.000000 | 20825 | 1.000000 |
1 | 11.597922 | 8.653831 | 6.365990 | shots | 0.694082 | -0.158589 | 2 | 0.301030 | 20824 | 0.698970 |
2 | 11.313740 | 8.286745 | 6.431290 | shooting | 0.646509 | -0.189426 | 3 | 0.477121 | 20823 | 0.522879 |
3 | 11.395639 | 8.349465 | 6.409628 | shoot | 0.602124 | -0.220314 | 4 | 0.602060 | 20822 | 0.397940 |
4 | 11.395675 | 8.422284 | 6.300252 | fired | 0.552951 | -0.257313 | 5 | 0.698970 | 20821 | 0.301030 |
... get the coordinates of the guessed words ...
guesses = sorted(set([g for g in guesses if g in d["word"].tolist()]), key=guesses.index)
d_guesses = d.loc[d["word"].isin(guesses)]
d_guesses = d_guesses.set_index("word").loc[guesses].reset_index()
d_guesses["order"] = d_guesses.index
... and create the babyplots visualization of the UMAP and the guessed path. Drag the mouse over the plot to rotate and shift+scroll to zoom in and out.
bp = Babyplot(background_color="#262020ff")
bp.add_plot_from_dataframe(
d,
"pointCloud",
"values",
"log_word_index_rev",
["umap1", "umap2", "umap3"],
{
"colorScale": "Spectral"
}
)
bp.add_plot_from_dataframe(d_guesses, "line", "values", "order", ["umap1", "umap2", "umap3"], {
"colorScale": "YlGnBu",
"labels": d_guesses["word"].tolist(),
"labelSize": 80,
"labelColor": "white",
"colorScaleInverted": True
})
bp