Word2Vec Model¶

Date: July 2024
Status: Ongoing experiment
People involved: Leon van Wissen

Citation:

van Wissen, Leon, and GLOBALISE. “GLOBALISE Word2vec Experiment”. Zenodo, March 17, 2025. https://doi.org/10.5281/zenodo.15038313.

Introduction¶

We trained a Word2Vec model on the GLOBALISE Transcriptions, creating vector representations of words based on their context. By leveraging this model you can:

Find Spelling Variants and Synonyms: Discover alternative spellings or synonyms of a word by identifying those with similar vector representations. This is particularly useful for early modern texts with inconsistent orthography.
Contextual Similarity: Locate words that frequently appear in similar contexts, shedding light on semantic relationships. For instance, the term plantage (plantation) might reveal associations with specific crops or geographic regions.
Advanced Semantic Queries: Perform tasks such as analogy generation (e.g., noten is to banda as [X] is to ceylon) and compute word similarities. These functionalities help researchers uncover patterns and insights from the corpus that are difficult to detect manually.

This notebook guides you through loading and running our pretrained model and provides some examples of queries.

Short User Guide¶

Option 1 (Google Colab):

Open this notebook in Google Colab.
Run the cells below to load the model and start querying the corpus.

Option 2 (local):

Download this notebook and our pretrained model, follow the cells below, and start exploring the corpus.

Download links:

Notebook: https://github.com/globalise-huygens/lab.globalise.huygens.knaw.nl/blob/main/docs/experiments/GLOBALISE_Word2Vec_Lab.ipynb
Pretrained model (100 dimensions, 645MB): https://surfdrive.surf.nl/files/index.php/s/XmUIlsy33vpRdCX Note: download and unzip the GLOBALISE.word2vec.zip file in a data directory in the same folder as this notebook for the cells below to work. The cells below will do this for you if you have the wget and unzip commands available.

If you haven’t used Jupyter notebooks before, we recommend looking up a user guide online. Anaconda is an easy-to-use package. Make sure you have the required libraries (such as Gensim) installed in the Python environment you’re using.

Download the Pretrained Model and install Gensim¶

The cell below downloads the pretrained model and installs the Gensim library. If you are running this notebook locally, you can also download the model from the link above and place it in a data directory in the same folder as this notebook.

In [ ]:

Copied!

! wget --content-disposition https://surfdrive.surf.nl/files/index.php/s/XmUIlsy33vpRdCX/download

! unzip GLOBALISE.word2vec.zip
! wget --content-disposition https://surfdrive.surf.nl/files/index.php/s/XmUIlsy33vpRdCX/download

! unzip GLOBALISE.word2vec.zip

In [ ]:

Copied!

! pip install gensim==4.3.3 -U
! pip install gensim==4.3.3 -U

Something recently changed in Google Colab, so if you're running the notebook there, also execute the cell below. This will interrupt the session and restart a new one, with the correct packages installed. Please note that you will get a message that "Your session crashed for an unknown reason", which can be ignored.

Alternatively, click 'Runtime' in the top menu, then 'Restart session' and continue the rest of the notebook.

In [ ]:

Copied!

# Colab only
exit()
# Colab only
exit()

Running the GLOBALISE pretrained model¶

Loading the model¶

Execute the cell below to load the model. Please note that loading the model might take a few minutes.

In [1]:

Copied!

import os
import sys
import logging
import pickle

from gensim.models import Word2Vec, KeyedVectors

vector_size = 100

w2v = KeyedVectors.load_word2vec_format(f"data/GLOBALISE_{vector_size}.word2vec")
import os
import sys
import logging
import pickle

from gensim.models import Word2Vec, KeyedVectors

vector_size = 100

w2v = KeyedVectors.load_word2vec_format(f"data/GLOBALISE_{vector_size}.word2vec")

Analyzing the Corpus¶

Below are some examples of how to use the model. You can substitute the words with any word you like, as long as it is in the vocabulary of the model/corpus. Everything needs to be in lowercase. See the Gensim documentation for more information on how to use the Word2Vec model: https://radimrehurek.com/gensim/models/word2vec.html.

The first cells use the most_similar function from the Word2Vec model (w2v) to find and print the topn most similar words to a given word.

In [2]:

Copied!

for i in w2v.most_similar("pantchialang", topn=100):
    print(i[0], end=" | ")
for i in w2v.most_similar("pantchialang", topn=100):
    print(i[0], end=" | ")

pantchialling | pantjall | dehaij | pantch | depantjall | patchiall | pantchiall | challang | debijl | noodhulp | goudsoeker | pantsch | haaij | tapko | pantchialt | jaarvogel | depantchiall | jongedirk | buijtel | krankte | windbuijl | depantjallang | patchiallang | zuykermaalder | pantchallang | depantch | onbeschaamdh | copjagt | chialling | patchalling | boshaan | pantchiallings | salpetersoeker | overmaas | pantjalang | bonneratte | chialop | onbeschaamtheijt | pantc | patchall | patjallang | arnoldina | losboots | pantchall | desnoek | zijdeteeld | woelwater | suijkermaalder | bancq | depatchiall | kruisser | depant | debarcq | nacheribon | sorgdrager | zijdewoom | glisgis | beschutter | vantchiall | delosboot | garnaal | chailoup | beschermer | zordaan | galwet | casuaris | pandjallang | casuarus | pantj | schipio | galeij | oostendenaer | ontang | patch | burk | losboot | smapt | panthialling | bethij | breguantijn | depatch | coffijthuijn | pantsjall | contong | moesthuijn | ramsgatte | jallang | zuijerbeek | onbeschaamtheijd | pantchalling | panthiallang | pittoor | zuijkermaalder | chialoop | tanjongpour | vrctoria | vesuvius | pinxterbloem | chiloup | pantschiallang |

In [3]:

Copied!

for i in w2v.most_similar("intje", topn=100):
    print(i[0], end=" | ")
for i in w2v.most_similar("intje", topn=100):
    print(i[0], end=" | ")

jntje | maleijer | dul | maleyer | anachoda | bappa | salim | jntie | malijer | malaijer | malim | boeginees | jurragan | parnakan | iavaan | iuragan | intie | cadier | sadulla | carim | mochamat | abdul | samat | parnackan | javaan | arabier | assan | nachoda | javaen | soedin | bouginees | mohamat | abdulla | achmat | talip | iurragan | inw | kinko | balier | zait | jnw | lim | sleman | juragan | saijit | garrang | rahim | bagus | oeij | tjina | anach | njo | jabar | boeang | tjan | mahama | karim | boeijong | aboe | jnwoonder | ganie | campar | tja | garang | balijer | troena | kamat | mallijer | anak | chin | sait | cassim | machoda | boejong | soekoer | roekoe | nio | samara | oemar | poea | lebe | hoko | miskien | vrijbalier | maijang | hoeko | salee | sech | samsoe | boegenees | naghoda | koko | gonting | tenoedin | mandarees | oesien | troeno | draman | sinko | jamal |

In [4]:

Copied!

for i in w2v.most_similar("caneel", topn=100):
    print(i[0], end=" | ")
for i in w2v.most_similar("caneel", topn=100):
    print(i[0], end=" | ")

canneel | arreecq | arreeck | cardamom | geschilden | balen | cardaman | cardamon | areecq | arreek | geschilt | overjarigen | areek | geschilde | ruijnas | bast | kurkuma | wortelen | wortel | cardemom | cannel | saffragamse | arreeq | groven | jndigo | incorrecte | ammenams | schelders | plantjes | curcuma | areeck | ougsten | affpacken | zaije | runas | schillens | moernagelen | cauwa | wortels | smakeloose | koehuijden | klenen | indigo | gekookten | zalpeter | canneer | saije | calpentijnsen | cragtelose | endeneese | canneelschilders | cheijlonsen | kannee | reuck | baelen | baalen | kanneel | pingos | sacken | varssen | anijl | ruinas | ammonams | tabacq | zaat | cauris | amm | ruias | cardanom | fijnen | cardamam | coffijbonen | cardamoin | arreck | bhaalen | zaijen | nagelen | caneell | embaleeren | bladeren | berberijen | coffijboonen | overjarige | kleenen | fordeelen | zaad | onrijpe | noten | pken | specerije | gamsen | geschild | caaneel | roggevellen | endeneesche | ingesamelden | oliteiten | peerlen | pepen | elijhanten |

The cell below finds the 100 most similar words to "Amsterdam" that have a similarity score of 0.4 or higher.

In [5]:

Copied!

for i, p in w2v.most_similar("amsterdam", topn=100):
    if p >= 0.4:
        print(i, end=" | ")
for i, p in w2v.most_similar("amsterdam", topn=100):
    if p >= 0.4:
        print(i, end=" | ")

sterdam | middelburg | amsterd | amst | zeeland | amster | amsterdm | amstm | rotterdam | delft | amsteldam | enkhuijsen | zeland | middelburgh | utrecht | ams | amsterda | amste | gravenhage | terdam | zeelant | zeiland | derwapen | enchuijsen | dam | delff | maddelburg | middelb | enckhuijsen | amstedam | enkhuijzen | aamsterdam | delfft | presidiale | enkhuisen | seeland | enckhuijzen | geredreseert | vlissingen | rdam | praesidiale | amsterdan | hage | costeux | zeelandt | wappan | hoorn | rotterdant | delburg | delf | delst | behangsels | inzeland | middelbrerg | enkhuizen | proefidiaale | praecidiale | ceulen | boodh | caamer | enckhuijs | dewees | behanghsel | amsterstam | temiddelburg | enkhuysen | zieland | alkmaar | meddelburg | cognoissemet | rotter | sdh | carode | uijtgevaren | middelburgin | kameer | delvt | leijden | zeel | praesideale | amstd | uijtregt | utregt | hoplooper | enchuy | terkamer | rabbinel | vlissinge | diale | kaner | veere | arnhem | confernee | praesidiaale | haarlem | kamier | enehuysen | siemermeer | middeburg | amstdam |

The following cell again uses the most_similar function from the Word2Vec model (w2v), this time to find and print words similar to a given set of "positive" and "negative" terms. The vector representation of positive words contributes positively to the similarity computation, that of negative words negatively based on their vector relationships. In this example, we use this methods to find words similar to "weder" in the meaning of "weather", and not in the meaning of "again".

In [6]:

Copied!





for i in w2v.most_similar(
    positive=["weder", "weer", "regen"], negative=["wederom", "alweder"], topn=100
):
    print(i[0], end=" | ")
for i in w2v.most_similar(
    positive=["weder", "weer", "regen"], negative=["wederom", "alweder"], topn=100
):
    print(i[0], end=" | ")

weir | reegen | wint | zeewint | lugt | windt | noorde | winden | waaijende | stroom | buijen | winde | sneeuw | doorwaijen | zuijde | lucht | regenbuijen | waijende | wind | coeltjens | suijdweste | koude | vlagen | handsaem | weerligt | dewind | regenagtig | tegenstroom | doorwaijende | sonneschijn | regenen | stilte | koelte | regens | coelte | lught | hitte | stijve | lughje | zeewind | wintje | weste | warme | onstuijmig | reegenen | stroomen | koelten | zonnestraalen | delugt | warmte | handsaam | buijdige | travaden | doorbreken | inbreeken | moussom | doorwaaijende | reegende | travadig | doorstaande | doorkomende | hette | buijig | luchje | felle | afwatering | starke | kentering | overdag | stormwinden | reegens | wzw | westelijke | vloet | variable | coeltje | calte | tegenwinden | ooste | goedweer | oostelijke | noordweste | zot | waaijde | deijning | aartrijk | noordelijk | valwinden | ongestadige | doorwaaijen | slijve | suijde | caelte | lugties | firmament | regende | coeste | travodig | coelende | doorbrake |

Ships in the Dutch East India Company (VOC) fleet were often named after places. The following cell uses the closer_than function to find all words in the Word2Vec model’s vocabulary whose vector representations are closer to a specified word ("eendracht", here meant as the name of a ship) than to another word ("tilburg", in this example meant as the name of a place) in terms of cosine similarity. This helps to identify words that share a stronger contextual association with "eendracht" (ship) compared to "tilburg" (place) and thus ideally filter out terms referring to places, yielding a list of potential ship names, or at least words that are more likely to be associated with ships.

In [7]:

Copied!

words = w2v.closer_than("eendracht", "tilburg")

output = " | ".join(words[:100])

print(output)
words = w2v.closer_than("eendracht", "tilburg")

output = " | ".join(words[:100])

print(output)

ende | naer | oock | noch | int | nae | schepen | retour | vant | camer | gecomen | fluijt | volck | becomen | welck | jacht | hoorn | japan | rotterdam | coninck | ditto | jagt | wint | compe | godt | lant | eijlanden | derwaerts | end | vertreck | landt | goa | geladen | stadt | tschip | bat | comende | maent | opt | chaloup | maecken | ladinge | japara | delft | oocq | gearriveert | genaemt | gemaeckt | weijnich | coningh | rhede | langh | waermede | daermede | ene | macht | ancker | originele | jnt | eijlant | nassauw | augustij | vrede | quartieren | wapen | voijagie | cattij | middagh | achter | opde | vaderlant | portugees | geseijde | leeuw | dirck | cargasoen | verwachten | mauritius | rijck | chialoep | dach | namentlijck | eijlandt | geladene | vlissingen | jachten | battavia | gelyck | seecker | wingurla | gescheept | amst | portugesen | iapan | comste | stondt | nederlants | arent | nacht | vercocht

Analogy generation can provide insights about historical semantics and how certain terms relate to one another in specific domains. To do this, you can use the most_similar with a combination of "positive" and "negative" word vectors. For example, running the following cell yields the ten best fitting words (based on their vector respresentation) for the analogy "noten is to banda as [X] is to ceylon".

In [8]:

Copied!

results = w2v.most_similar(positive=["noten", "ceylon"], negative=["banda"], topn=10)

print("'noten' is to 'banda' as the following are to 'ceylon':")
for word, similarity in results:
    print(f"{word} (similarity: {similarity:.4f})")
results = w2v.most_similar(positive=["noten", "ceylon"], negative=["banda"], topn=10)

print("'noten' is to 'banda' as the following are to 'ceylon':")
for word, similarity in results:
    print(f"{word} (similarity: {similarity:.4f})")

'noten' is to 'banda' as the following are to 'ceylon':
cardamon (similarity: 0.5871)
baalen (similarity: 0.5764)
ruijnas (similarity: 0.5662)
baaltjes (similarity: 0.5633)
kardamom (similarity: 0.5629)
chiancossen (similarity: 0.5529)
cardamam (similarity: 0.5451)
wortelen (similarity: 0.5399)
caneel (similarity: 0.5394)
cardaman (similarity: 0.5378)

Training the GLOBALISE Word2Vec model¶

The cells below show how we trained our model. Following the methodology allows you to easily retrain it (e.g. with different parameters or on a subset of the corpus).

Loading the Libraries and Configuring Preprocessing and Logging¶

In [ ]:

Copied!





import os
import sys
import logging
import pickle

from gensim.models import Word2Vec, KeyedVectors
from gensim.corpora.textcorpus import TextDirectoryCorpus

from gensim.corpora.dictionary import Dictionary

from gensim.parsing.preprocessing import (
    remove_stopword_tokens,
    remove_short_tokens,
    lower_to_unicode,
    strip_multiple_whitespaces,
)
from gensim.utils import deaccent, simple_tokenize, effective_n_jobs

logging.basicConfig(
    format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)
logging.getLogger().setLevel(logging.INFO)

# Setting
vector_size = 100
import os
import sys
import logging
import pickle

from gensim.models import Word2Vec, KeyedVectors
from gensim.corpora.textcorpus import TextDirectoryCorpus

from gensim.corpora.dictionary import Dictionary

from gensim.parsing.preprocessing import (
    remove_stopword_tokens,
    remove_short_tokens,
    lower_to_unicode,
    strip_multiple_whitespaces,
)
from gensim.utils import deaccent, simple_tokenize, effective_n_jobs

logging.basicConfig(
    format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)
logging.getLogger().setLevel(logging.INFO)

# Setting
vector_size = 100

Downloading the Data¶

The data can be downloaded from the GLOBALISE Dataverse: https://datasets.iisg.amsterdam/dataverse/globalise. For this experiment, we’re working with version v2.0 of the transcriptions dataset:

GLOBALISE project, 2024, "VOC transcriptions v2 - GLOBALISE", https://hdl.handle.net/10622/LVXSBW, IISH Data Collection

The project conveniently provides a file with pointers to all txt files in this dataset that we can download automatically. We are using wget to download the files. First the file with pointers, which we will use to download all txt files. This can take a while.

In [ ]:

Copied!

! mkdir -p data && wget https://datasets.iisg.amsterdam/api/access/datafile/33172?gbrecs=true -O data/globalise_transcriptions_v2_txt.tab --content-disposition
! mkdir -p data && wget https://datasets.iisg.amsterdam/api/access/datafile/33172?gbrecs=true -O data/globalise_transcriptions_v2_txt.tab --content-disposition

In [ ]:

Copied!

! mkdir -p data/txt && wget -i data/globalise_transcriptions_v2_txt.tab -P data/txt/ --content-disposition
! mkdir -p data/txt && wget -i data/globalise_transcriptions_v2_txt.tab -P data/txt/ --content-disposition

Pre-processing¶

We now have a collection of text files, in which each file represents the text per inventory number.

The files need a bit of pre-processing before we can work with it. What needs to be done:

Remove all lines starting with #+ . These are comments and not part of the text.

In [ ]:

Copied!





def preprocess_txt(file_path):
    print("Processing", file_path)

    # Open the textfile
    with open(file_path) as infile:
        text = infile.read()

    lines = []
    for line in text.split("\n"):
        if line.startswith("#+ "):
            continue
        else:
            lines.append(line)

    text = "\n".join(lines)

    # Save the cleaned version
    with open(file_path, "w") as outfile:
        outfile.write(text)
def preprocess_txt(file_path):
    print("Processing", file_path)

    # Open the textfile
    with open(file_path) as infile:
        text = infile.read()

    lines = []
    for line in text.split("\n"):
        if line.startswith("#+ "):
            continue
        else:
            lines.append(line)

    text = "\n".join(lines)

    # Save the cleaned version
    with open(file_path, "w") as outfile:
        outfile.write(text)

In [ ]:

Copied!

FOLDER = "data/txt"

for f in os.listdir(FOLDER):
    filepath = os.path.join(FOLDER, f)
    preprocess_txt(filepath)
FOLDER = "data/txt"

for f in os.listdir(FOLDER):
    filepath = os.path.join(FOLDER, f)
    preprocess_txt(filepath)

Processing¶

Now that we have the data in a usable format, we can start processing it. We will use the Gensim library to train a Word2Vec model on the text data. For this, we first create a Corpus object that will be used to feed text to the model. We use a custom implementation of the gensim.corpora.textcorpus.TextCorpus class to now have a cutoff for the number of words in the vocabulary (standard settings).

In [ ]:

Copied!





logger = logging.getLogger(__name__)


class CustomTextDirectoryCorpus(TextDirectoryCorpus):
    """
    Custom class to set the `prune_at` gensim.Dictionary parameter.
    """

    def __init__(
        self,
        input,
        dictionary=None,
        metadata=False,
        character_filters=None,
        tokenizer=None,
        token_filters=None,
        min_depth=0,
        max_depth=None,
        pattern=None,
        exclude_pattern=None,
        lines_are_documents=False,
        encoding="utf-8",
        dictionary_prune_at=2_000_000,
        **kwargs,
    ):
        self._min_depth = min_depth
        self._max_depth = sys.maxsize if max_depth is None else max_depth
        self.pattern = pattern
        self.exclude_pattern = exclude_pattern
        self.lines_are_documents = lines_are_documents
        self.encoding = encoding

        self.dictionary_prune_at = dictionary_prune_at

        self.input = input
        self.metadata = metadata

        self.character_filters = character_filters
        if self.character_filters is None:
            self.character_filters = [
                lower_to_unicode,
                deaccent,
                strip_multiple_whitespaces,
            ]

        self.tokenizer = tokenizer
        if self.tokenizer is None:
            self.tokenizer = simple_tokenize

        self.token_filters = token_filters
        if self.token_filters is None:
            self.token_filters = [remove_short_tokens, remove_stopword_tokens]

        self.length = None
        self.dictionary = None
        self.init_dictionary(dictionary)

        super(CustomTextDirectoryCorpus, self).__init__(
            input, self.dictionary, metadata, **kwargs
        )

    def init_dictionary(self, dictionary):
        """Initialize/update dictionary.

        Parameters
        ----------
        dictionary : :class:`~gensim.corpora.dictionary.Dictionary`, optional
            If a dictionary is provided, it will not be updated with the given corpus on initialization.
            If None - new dictionary will be built for the given corpus.

        Notes
        -----
        If self.input is None - make nothing.

        """

        self.dictionary = dictionary if dictionary is not None else Dictionary()

        if self.input is not None:
            if dictionary is None:
                logger.info("Initializing dictionary")
                metadata_setting = self.metadata
                self.metadata = False
                self.dictionary.add_documents(
                    self.get_texts(), prune_at=self.dictionary_prune_at
                )
                self.metadata = metadata_setting
            else:
                logger.info("Input stream provided but dictionary already initialized")
        else:
            logger.warning(
                "No input document stream provided; assuming dictionary will be initialized some other way."
            )
logger = logging.getLogger(__name__)


class CustomTextDirectoryCorpus(TextDirectoryCorpus):
    """
    Custom class to set the `prune_at` gensim.Dictionary parameter.
    """

    def __init__(
        self,
        input,
        dictionary=None,
        metadata=False,
        character_filters=None,
        tokenizer=None,
        token_filters=None,
        min_depth=0,
        max_depth=None,
        pattern=None,
        exclude_pattern=None,
        lines_are_documents=False,
        encoding="utf-8",
        dictionary_prune_at=2_000_000,
        **kwargs,
    ):
        self._min_depth = min_depth
        self._max_depth = sys.maxsize if max_depth is None else max_depth
        self.pattern = pattern
        self.exclude_pattern = exclude_pattern
        self.lines_are_documents = lines_are_documents
        self.encoding = encoding

        self.dictionary_prune_at = dictionary_prune_at

        self.input = input
        self.metadata = metadata

        self.character_filters = character_filters
        if self.character_filters is None:
            self.character_filters = [
                lower_to_unicode,
                deaccent,
                strip_multiple_whitespaces,
            ]

        self.tokenizer = tokenizer
        if self.tokenizer is None:
            self.tokenizer = simple_tokenize

        self.token_filters = token_filters
        if self.token_filters is None:
            self.token_filters = [remove_short_tokens, remove_stopword_tokens]

        self.length = None
        self.dictionary = None
        self.init_dictionary(dictionary)

        super(CustomTextDirectoryCorpus, self).__init__(
            input, self.dictionary, metadata, **kwargs
        )

    def init_dictionary(self, dictionary):
        """Initialize/update dictionary.

        Parameters
        ----------
        dictionary : :class:`~gensim.corpora.dictionary.Dictionary`, optional
            If a dictionary is provided, it will not be updated with the given corpus on initialization.
            If None - new dictionary will be built for the given corpus.

        Notes
        -----
        If self.input is None - make nothing.

        """

        self.dictionary = dictionary if dictionary is not None else Dictionary()

        if self.input is not None:
            if dictionary is None:
                logger.info("Initializing dictionary")
                metadata_setting = self.metadata
                self.metadata = False
                self.dictionary.add_documents(
                    self.get_texts(), prune_at=self.dictionary_prune_at
                )
                self.metadata = metadata_setting
            else:
                logger.info("Input stream provided but dictionary already initialized")
        else:
            logger.warning(
                "No input document stream provided; assuming dictionary will be initialized some other way."
            )

In [ ]:

Copied!





class SentencesIterator:
    def __init__(self, generator_function):
        self.generator_function = generator_function
        self.generator = self.generator_function()

    def __iter__(self):
        # reset the generator
        self.generator = self.generator_function()
        return self

    def __next__(self):
        result = next(self.generator)
        if result is None:
            raise StopIteration
        else:
            return result
class SentencesIterator:
    def __init__(self, generator_function):
        self.generator_function = generator_function
        self.generator = self.generator_function()

    def __iter__(self):
        # reset the generator
        self.generator = self.generator_function()
        return self

    def __next__(self):
        result = next(self.generator)
        if result is None:
            raise StopIteration
        else:
            return result

With the above code we can generate our own `corpus’ object with a slightly bigger dictionary size than in Gensim’s standard library. We set it to 20M, since we are also interested in the less frequently occurring words (e.g. spelling varieties). We can filter later on minimum frequency.

In [ ]:

Copied!

corpus = CustomTextDirectoryCorpus(FOLDER, dictionary_prune_at=20_000_000)
corpus = CustomTextDirectoryCorpus(FOLDER, dictionary_prune_at=20_000_000)

Now let’s save the corpus object to disk, so we can use it later on and don’t have to re-run the pre-processing steps. Comment and uncomment the respective code below to run the pre-processing steps or load the corpus object from disk.

In [ ]:

Copied!

with open("data/corpus.pkl", "wb") as f:
    pickle.dump(corpus, f)
with open("data/corpus.pkl", "wb") as f:
    pickle.dump(corpus, f)

In [ ]:

Copied!

# with open("data/corpus.pkl", "rb") as f:
#     corpus = pickle.load(f)
# with open("data/corpus.pkl", "rb") as f:
#     corpus = pickle.load(f)

The next step is to train the Word2Vec model. For this, we need to feed it the corpus object multiple times. We do so by initializing an iterator:

In [ ]:

Copied!

texts = SentencesIterator(corpus.get_texts)
texts = SentencesIterator(corpus.get_texts)

Now, let’s create a Word2Vec embedding. You can set the number of workers to your CPU count (minus 1). Again, this can take a while.

You can experiment with the parameters of the Word2Vec model, such as the vector size, window size, and minimum frequency, but this can lead to a bigger model, longer training time, and not necessarily better results.

In [ ]:

Copied!





workers = effective_n_jobs(max(os.cpu_count() - 1, 1))
w2v = Word2Vec(
    texts, vector_size=vector_size, window=5, min_count=5, workers=workers, epochs=5
)
workers = effective_n_jobs(max(os.cpu_count() - 1, 1))
w2v = Word2Vec(
    texts, vector_size=vector_size, window=5, min_count=5, workers=workers, epochs=5
)

Now, let’s save the embedding for future use.

In [ ]:

Copied!

w2v.wv.save_word2vec_format(f"data/GLOBALISE_{vector_size}.word2vec")
w2v.wv.save_word2vec_format(f"data/GLOBALISE_{vector_size}.word2vec")