Word2Vec Model¶
Date: July 2024
Status: Ongoing experiment
People involved: Leon van Wissen
Citation:
- van Wissen, Leon, and GLOBALISE. “GLOBALISE Word2vec Experiment”. Zenodo, March 17, 2025. https://doi.org/10.5281/zenodo.15038313.
Introduction¶
We trained a Word2Vec model on the GLOBALISE Transcriptions, creating vector representations of words based on their context. By leveraging this model you can:
- Find Spelling Variants and Synonyms: Discover alternative spellings or synonyms of a word by identifying those with similar vector representations. This is particularly useful for early modern texts with inconsistent orthography.
- Contextual Similarity: Locate words that frequently appear in similar contexts, shedding light on semantic relationships. For instance, the term
plantage
(plantation) might reveal associations with specific crops or geographic regions. - Advanced Semantic Queries: Perform tasks such as analogy generation (e.g.,
noten is to banda as [X] is to ceylon
) and compute word similarities. These functionalities help researchers uncover patterns and insights from the corpus that are difficult to detect manually.
This notebook guides you through loading and running our pretrained model and provides some examples of queries.
Short User Guide¶
Option 1 (Google Colab):
- Open this notebook in Google Colab.
- Run the cells below to load the model and start querying the corpus.
Option 2 (local):
Download this notebook and our pretrained model, follow the cells below, and start exploring the corpus.
Download links:
- Notebook: https://github.com/globalise-huygens/lab.globalise.huygens.knaw.nl/blob/main/docs/experiments/GLOBALISE_Word2Vec_Lab.ipynb
- Pretrained model (100 dimensions, 645MB): https://surfdrive.surf.nl/files/index.php/s/XmUIlsy33vpRdCX
Note: download and unzip the
GLOBALISE.word2vec.zip
file in adata
directory in the same folder as this notebook for the cells below to work. The cells below will do this for you if you have thewget
andunzip
commands available.
If you haven’t used Jupyter notebooks before, we recommend looking up a user guide online. Anaconda is an easy-to-use package. Make sure you have the required libraries (such as Gensim) installed in the Python environment you’re using.
Download the Pretrained Model and install Gensim¶
The cell below downloads the pretrained model and installs the Gensim library. If you are running this notebook locally, you can also download the model from the link above and place it in a data
directory in the same folder as this notebook.
! wget --content-disposition https://surfdrive.surf.nl/files/index.php/s/XmUIlsy33vpRdCX/download
! unzip GLOBALISE.word2vec.zip
! pip install gensim==4.3.3 -U
Something recently changed in Google Colab, so if you're running the notebook there, also execute the cell below. This will interrupt the session and restart a new one, with the correct packages installed. Please note that you will get a message that "Your session crashed for an unknown reason", which can be ignored.
Alternatively, click 'Runtime' in the top menu, then 'Restart session' and continue the rest of the notebook.
# Colab only
exit()
import os
import sys
import logging
import pickle
from gensim.models import Word2Vec, KeyedVectors
vector_size = 100
w2v = KeyedVectors.load_word2vec_format(f"data/GLOBALISE_{vector_size}.word2vec")
Analyzing the Corpus¶
Below are some examples of how to use the model. You can substitute the words with any word you like, as long as it is in the vocabulary of the model/corpus. Everything needs to be in lowercase. See the Gensim documentation for more information on how to use the Word2Vec model: https://radimrehurek.com/gensim/models/word2vec.html.
The first cells use the most_similar
function from the Word2Vec model (w2v
) to find and print the topn
most similar words to a given word.
for i in w2v.most_similar("pantchialang", topn=100):
print(i[0], end=" | ")
pantchialling | pantjall | dehaij | pantch | depantjall | patchiall | pantchiall | challang | debijl | noodhulp | goudsoeker | pantsch | haaij | tapko | pantchialt | jaarvogel | depantchiall | jongedirk | buijtel | krankte | windbuijl | depantjallang | patchiallang | zuykermaalder | pantchallang | depantch | onbeschaamdh | copjagt | chialling | patchalling | boshaan | pantchiallings | salpetersoeker | overmaas | pantjalang | bonneratte | chialop | onbeschaamtheijt | pantc | patchall | patjallang | arnoldina | losboots | pantchall | desnoek | zijdeteeld | woelwater | suijkermaalder | bancq | depatchiall | kruisser | depant | debarcq | nacheribon | sorgdrager | zijdewoom | glisgis | beschutter | vantchiall | delosboot | garnaal | chailoup | beschermer | zordaan | galwet | casuaris | pandjallang | casuarus | pantj | schipio | galeij | oostendenaer | ontang | patch | burk | losboot | smapt | panthialling | bethij | breguantijn | depatch | coffijthuijn | pantsjall | contong | moesthuijn | ramsgatte | jallang | zuijerbeek | onbeschaamtheijd | pantchalling | panthiallang | pittoor | zuijkermaalder | chialoop | tanjongpour | vrctoria | vesuvius | pinxterbloem | chiloup | pantschiallang |
for i in w2v.most_similar("intje", topn=100):
print(i[0], end=" | ")
jntje | maleijer | dul | maleyer | anachoda | bappa | salim | jntie | malijer | malaijer | malim | boeginees | jurragan | parnakan | iavaan | iuragan | intie | cadier | sadulla | carim | mochamat | abdul | samat | parnackan | javaan | arabier | assan | nachoda | javaen | soedin | bouginees | mohamat | abdulla | achmat | talip | iurragan | inw | kinko | balier | zait | jnw | lim | sleman | juragan | saijit | garrang | rahim | bagus | oeij | tjina | anach | njo | jabar | boeang | tjan | mahama | karim | boeijong | aboe | jnwoonder | ganie | campar | tja | garang | balijer | troena | kamat | mallijer | anak | chin | sait | cassim | machoda | boejong | soekoer | roekoe | nio | samara | oemar | poea | lebe | hoko | miskien | vrijbalier | maijang | hoeko | salee | sech | samsoe | boegenees | naghoda | koko | gonting | tenoedin | mandarees | oesien | troeno | draman | sinko | jamal |
for i in w2v.most_similar("caneel", topn=100):
print(i[0], end=" | ")
canneel | arreecq | arreeck | cardamom | geschilden | balen | cardaman | cardamon | areecq | arreek | geschilt | overjarigen | areek | geschilde | ruijnas | bast | kurkuma | wortelen | wortel | cardemom | cannel | saffragamse | arreeq | groven | jndigo | incorrecte | ammenams | schelders | plantjes | curcuma | areeck | ougsten | affpacken | zaije | runas | schillens | moernagelen | cauwa | wortels | smakeloose | koehuijden | klenen | indigo | gekookten | zalpeter | canneer | saije | calpentijnsen | cragtelose | endeneese | canneelschilders | cheijlonsen | kannee | reuck | baelen | baalen | kanneel | pingos | sacken | varssen | anijl | ruinas | ammonams | tabacq | zaat | cauris | amm | ruias | cardanom | fijnen | cardamam | coffijbonen | cardamoin | arreck | bhaalen | zaijen | nagelen | caneell | embaleeren | bladeren | berberijen | coffijboonen | overjarige | kleenen | fordeelen | zaad | onrijpe | noten | pken | specerije | gamsen | geschild | caaneel | roggevellen | endeneesche | ingesamelden | oliteiten | peerlen | pepen | elijhanten |
The cell below finds the 100 most similar words to "Amsterdam" that have a similarity score of 0.4 or higher.
for i, p in w2v.most_similar("amsterdam", topn=100):
if p >= 0.4:
print(i, end=" | ")
sterdam | middelburg | amsterd | amst | zeeland | amster | amsterdm | amstm | rotterdam | delft | amsteldam | enkhuijsen | zeland | middelburgh | utrecht | ams | amsterda | amste | gravenhage | terdam | zeelant | zeiland | derwapen | enchuijsen | dam | delff | maddelburg | middelb | enckhuijsen | amstedam | enkhuijzen | aamsterdam | delfft | presidiale | enkhuisen | seeland | enckhuijzen | geredreseert | vlissingen | rdam | praesidiale | amsterdan | hage | costeux | zeelandt | wappan | hoorn | rotterdant | delburg | delf | delst | behangsels | inzeland | middelbrerg | enkhuizen | proefidiaale | praecidiale | ceulen | boodh | caamer | enckhuijs | dewees | behanghsel | amsterstam | temiddelburg | enkhuysen | zieland | alkmaar | meddelburg | cognoissemet | rotter | sdh | carode | uijtgevaren | middelburgin | kameer | delvt | leijden | zeel | praesideale | amstd | uijtregt | utregt | hoplooper | enchuy | terkamer | rabbinel | vlissinge | diale | kaner | veere | arnhem | confernee | praesidiaale | haarlem | kamier | enehuysen | siemermeer | middeburg | amstdam |
The following cell again uses the most_similar
function from the Word2Vec model (w2v
), this time to find and print words similar to a given set of "positive" and "negative" terms. The vector representation of positive
words contributes positively to the similarity computation, that of negative
words negatively based on their vector relationships. In this example, we use this methods to find words similar to "weder" in the meaning of "weather", and not in the meaning of "again".
for i in w2v.most_similar(
positive=["weder", "weer", "regen"], negative=["wederom", "alweder"], topn=100
):
print(i[0], end=" | ")
weir | reegen | wint | zeewint | lugt | windt | noorde | winden | waaijende | stroom | buijen | winde | sneeuw | doorwaijen | zuijde | lucht | regenbuijen | waijende | wind | coeltjens | suijdweste | koude | vlagen | handsaem | weerligt | dewind | regenagtig | tegenstroom | doorwaijende | sonneschijn | regenen | stilte | koelte | regens | coelte | lught | hitte | stijve | lughje | zeewind | wintje | weste | warme | onstuijmig | reegenen | stroomen | koelten | zonnestraalen | delugt | warmte | handsaam | buijdige | travaden | doorbreken | inbreeken | moussom | doorwaaijende | reegende | travadig | doorstaande | doorkomende | hette | buijig | luchje | felle | afwatering | starke | kentering | overdag | stormwinden | reegens | wzw | westelijke | vloet | variable | coeltje | calte | tegenwinden | ooste | goedweer | oostelijke | noordweste | zot | waaijde | deijning | aartrijk | noordelijk | valwinden | ongestadige | doorwaaijen | slijve | suijde | caelte | lugties | firmament | regende | coeste | travodig | coelende | doorbrake |
Ships in the Dutch East India Company (VOC) fleet were often named after places. The following cell uses the closer_than
function to find all words in the Word2Vec model’s vocabulary whose vector representations are closer to a specified word ("eendracht", here meant as the name of a ship) than to another word ("tilburg", in this example meant as the name of a place) in terms of cosine similarity. This helps to identify words that share a stronger contextual association with "eendracht" (ship) compared to "tilburg" (place) and thus ideally filter out terms referring to places, yielding a list of potential ship names, or at least words that are more likely to be associated with ships.
words = w2v.closer_than("eendracht", "tilburg")
output = " | ".join(words[:100])
print(output)
ende | naer | oock | noch | int | nae | schepen | retour | vant | camer | gecomen | fluijt | volck | becomen | welck | jacht | hoorn | japan | rotterdam | coninck | ditto | jagt | wint | compe | godt | lant | eijlanden | derwaerts | end | vertreck | landt | goa | geladen | stadt | tschip | bat | comende | maent | opt | chaloup | maecken | ladinge | japara | delft | oocq | gearriveert | genaemt | gemaeckt | weijnich | coningh | rhede | langh | waermede | daermede | ene | macht | ancker | originele | jnt | eijlant | nassauw | augustij | vrede | quartieren | wapen | voijagie | cattij | middagh | achter | opde | vaderlant | portugees | geseijde | leeuw | dirck | cargasoen | verwachten | mauritius | rijck | chialoep | dach | namentlijck | eijlandt | geladene | vlissingen | jachten | battavia | gelyck | seecker | wingurla | gescheept | amst | portugesen | iapan | comste | stondt | nederlants | arent | nacht | vercocht
Analogy generation can provide insights about historical semantics and how certain terms relate to one another in specific domains. To do this, you can use the most_similar
with a combination of "positive" and "negative" word vectors. For example, running the following cell yields the ten best fitting words (based on their vector respresentation) for the analogy "noten is to banda as [X] is to ceylon".
results = w2v.most_similar(positive=["noten", "ceylon"], negative=["banda"], topn=10)
print("'noten' is to 'banda' as the following are to 'ceylon':")
for word, similarity in results:
print(f"{word} (similarity: {similarity:.4f})")
'noten' is to 'banda' as the following are to 'ceylon': cardamon (similarity: 0.5871) baalen (similarity: 0.5764) ruijnas (similarity: 0.5662) baaltjes (similarity: 0.5633) kardamom (similarity: 0.5629) chiancossen (similarity: 0.5529) cardamam (similarity: 0.5451) wortelen (similarity: 0.5399) caneel (similarity: 0.5394) cardaman (similarity: 0.5378)
import os
import sys
import logging
import pickle
from gensim.models import Word2Vec, KeyedVectors
from gensim.corpora.textcorpus import TextDirectoryCorpus
from gensim.corpora.dictionary import Dictionary
from gensim.parsing.preprocessing import (
remove_stopword_tokens,
remove_short_tokens,
lower_to_unicode,
strip_multiple_whitespaces,
)
from gensim.utils import deaccent, simple_tokenize, effective_n_jobs
logging.basicConfig(
format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)
logging.getLogger().setLevel(logging.INFO)
# Setting
vector_size = 100
Downloading the Data¶
The data can be downloaded from the GLOBALISE Dataverse: https://datasets.iisg.amsterdam/dataverse/globalise. For this experiment, we’re working with version v2.0 of the transcriptions dataset:
- GLOBALISE project, 2024, "VOC transcriptions v2 - GLOBALISE", https://hdl.handle.net/10622/LVXSBW, IISH Data Collection
The project conveniently provides a file with pointers to all txt files in this dataset that we can download automatically. We are using wget
to download the files. First the file with pointers, which we will use to download all txt files. This can take a while.
! mkdir -p data && wget https://datasets.iisg.amsterdam/api/access/datafile/33172?gbrecs=true -O data/globalise_transcriptions_v2_txt.tab --content-disposition
! mkdir -p data/txt && wget -i data/globalise_transcriptions_v2_txt.tab -P data/txt/ --content-disposition
Pre-processing¶
We now have a collection of text files, in which each file represents the text per inventory number.
The files need a bit of pre-processing before we can work with it. What needs to be done:
- Remove all lines starting with
#+
. These are comments and not part of the text.
def preprocess_txt(file_path):
print("Processing", file_path)
# Open the textfile
with open(file_path) as infile:
text = infile.read()
lines = []
for line in text.split("\n"):
if line.startswith("#+ "):
continue
else:
lines.append(line)
text = "\n".join(lines)
# Save the cleaned version
with open(file_path, "w") as outfile:
outfile.write(text)
FOLDER = "data/txt"
for f in os.listdir(FOLDER):
filepath = os.path.join(FOLDER, f)
preprocess_txt(filepath)
Processing¶
Now that we have the data in a usable format, we can start processing it. We will use the Gensim library to train a Word2Vec model on the text data. For this, we first create a Corpus object that will be used to feed text to the model. We use a custom implementation of the gensim.corpora.textcorpus.TextCorpus
class to now have a cutoff for the number of words in the vocabulary (standard settings).
logger = logging.getLogger(__name__)
class CustomTextDirectoryCorpus(TextDirectoryCorpus):
"""
Custom class to set the `prune_at` gensim.Dictionary parameter.
"""
def __init__(
self,
input,
dictionary=None,
metadata=False,
character_filters=None,
tokenizer=None,
token_filters=None,
min_depth=0,
max_depth=None,
pattern=None,
exclude_pattern=None,
lines_are_documents=False,
encoding="utf-8",
dictionary_prune_at=2_000_000,
**kwargs,
):
self._min_depth = min_depth
self._max_depth = sys.maxsize if max_depth is None else max_depth
self.pattern = pattern
self.exclude_pattern = exclude_pattern
self.lines_are_documents = lines_are_documents
self.encoding = encoding
self.dictionary_prune_at = dictionary_prune_at
self.input = input
self.metadata = metadata
self.character_filters = character_filters
if self.character_filters is None:
self.character_filters = [
lower_to_unicode,
deaccent,
strip_multiple_whitespaces,
]
self.tokenizer = tokenizer
if self.tokenizer is None:
self.tokenizer = simple_tokenize
self.token_filters = token_filters
if self.token_filters is None:
self.token_filters = [remove_short_tokens, remove_stopword_tokens]
self.length = None
self.dictionary = None
self.init_dictionary(dictionary)
super(CustomTextDirectoryCorpus, self).__init__(
input, self.dictionary, metadata, **kwargs
)
def init_dictionary(self, dictionary):
"""Initialize/update dictionary.
Parameters
----------
dictionary : :class:`~gensim.corpora.dictionary.Dictionary`, optional
If a dictionary is provided, it will not be updated with the given corpus on initialization.
If None - new dictionary will be built for the given corpus.
Notes
-----
If self.input is None - make nothing.
"""
self.dictionary = dictionary if dictionary is not None else Dictionary()
if self.input is not None:
if dictionary is None:
logger.info("Initializing dictionary")
metadata_setting = self.metadata
self.metadata = False
self.dictionary.add_documents(
self.get_texts(), prune_at=self.dictionary_prune_at
)
self.metadata = metadata_setting
else:
logger.info("Input stream provided but dictionary already initialized")
else:
logger.warning(
"No input document stream provided; assuming dictionary will be initialized some other way."
)
class SentencesIterator:
def __init__(self, generator_function):
self.generator_function = generator_function
self.generator = self.generator_function()
def __iter__(self):
# reset the generator
self.generator = self.generator_function()
return self
def __next__(self):
result = next(self.generator)
if result is None:
raise StopIteration
else:
return result
With the above code we can generate our own `corpus’ object with a slightly bigger dictionary size than in Gensim’s standard library. We set it to 20M, since we are also interested in the less frequently occurring words (e.g. spelling varieties). We can filter later on minimum frequency.
corpus = CustomTextDirectoryCorpus(FOLDER, dictionary_prune_at=20_000_000)
Now let’s save the corpus object to disk, so we can use it later on and don’t have to re-run the pre-processing steps. Comment and uncomment the respective code below to run the pre-processing steps or load the corpus object from disk.
with open("data/corpus.pkl", "wb") as f:
pickle.dump(corpus, f)
# with open("data/corpus.pkl", "rb") as f:
# corpus = pickle.load(f)
The next step is to train the Word2Vec model. For this, we need to feed it the corpus object multiple times. We do so by initializing an iterator:
texts = SentencesIterator(corpus.get_texts)
Now, let’s create a Word2Vec embedding. You can set the number of workers to your CPU count (minus 1). Again, this can take a while.
You can experiment with the parameters of the Word2Vec model, such as the vector size, window size, and minimum frequency, but this can lead to a bigger model, longer training time, and not necessarily better results.
workers = effective_n_jobs(max(os.cpu_count() - 1, 1))
w2v = Word2Vec(
texts, vector_size=vector_size, window=5, min_count=5, workers=workers, epochs=5
)
Now, let’s save the embedding for future use.
w2v.wv.save_word2vec_format(f"data/GLOBALISE_{vector_size}.word2vec")