Word2Vec Experiment¶
Download links:
- Notebook: https://github.com/globalise-huygens/lab.globalise.huygens.knaw.nl/blob/main/docs/experiments/GLOBALISE_Word2Vec_Lab.ipynb
- Pretrained model (100 dimensions, 645MB): https://surfdrive.surf.nl/files/index.php/s/XmUIlsy33vpRdCX
Downloading this notebook will allow you to experiment with a Word2Vec model based on the GLOBALISE Transcription (V2). You can train the model on your own data, or use the pretrained model to find similar words.
If you use the pretrained model, download and unzip the GLOBALISE.word2vec.zip
in a data
directory in the same folder as this notebook. Only run the first cell, and skip to 'Loading a pretrained model' section to load the model and start experimenting.
import os
import sys
import logging
import pickle
from gensim.models import Word2Vec, KeyedVectors
from gensim.corpora.textcorpus import TextDirectoryCorpus
from gensim.corpora.dictionary import Dictionary
from gensim.parsing.preprocessing import (
remove_stopword_tokens,
remove_short_tokens,
lower_to_unicode,
strip_multiple_whitespaces,
)
from gensim.utils import deaccent, simple_tokenize, effective_n_jobs
logging.basicConfig(
format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)
logging.getLogger().setLevel(logging.INFO)
# Setting
vector_size = 100
Data¶
The data can be downloaded from the GLOBALISE Dataverse: https://datasets.iisg.amsterdam/dataverse/globalise. For this experiment, we're working with version V2.0 of the Transcription dataset:
- GLOBALISE project, 2024, "VOC transcriptions v2 - GLOBALISE", https://hdl.handle.net/10622/LVXSBW, IISH Data Collection
The project conveniently provides a file with pointers to all txt files in this dataset that we can download automatically. We are using wget
to download the files. First the file with pointers, which we will use to download all txt files. This can take a while.
! mkdir -p data && wget https://datasets.iisg.amsterdam/api/access/datafile/33172?gbrecs=true -O data/globalise_transcriptions_v2_txt.tab --content-disposition
--2024-07-22 22:01:53-- https://datasets.iisg.amsterdam/api/access/datafile/33172?gbrecs=true Resolving datasets.iisg.amsterdam (datasets.iisg.amsterdam)... 195.169.88.174 Connecting to datasets.iisg.amsterdam (datasets.iisg.amsterdam)|195.169.88.174|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 399793 (390K) [text/plain] Saving to: ‘data/globalise_transcriptions_v2_txt.tab’ data/globalise_tran 100%[===================>] 390,42K --.-KB/s in 0,06s 2024-07-22 22:01:53 (6,54 MB/s) - ‘data/globalise_transcriptions_v2_txt.tab’ saved [399793/399793]
! mkdir -p data/txt && wget -i data/globalise_transcriptions_v2_txt.tab -P data/txt/ --content-disposition
Pre-processing¶
We now have a collection of text files, in which each file represents the text per inventory number.
The files need a bit of pre-processing before we can work with it. What needs to be done:
- Remove all lines starting with
#+
. These are comments and not part of the text.
def preprocess_txt(file_path):
print("Processing", file_path)
# Open the textfile
with open(file_path) as infile:
text = infile.read()
lines = []
for line in text.split("\n"):
if line.startswith("#+ "):
continue
else:
lines.append(line)
text = "\n".join(lines)
# Save the cleaned version
with open(file_path, "w") as outfile:
outfile.write(text)
FOLDER = "data/txt"
for f in os.listdir(FOLDER):
filepath = os.path.join(FOLDER, f)
preprocess_txt(filepath)
Processing¶
Now that we have the data in a usable format, we can start processing it. We will use the Gensim library to train a Word2Vec model on the text data. For this, we first create a Corpus object that will be used to feed text to the model. We use a custom implementation of the gensim.corpora.textcorpus.TextCorpus
class to now have a cutoff for the number of words in the vocabulary (standard settings).
logger = logging.getLogger(__name__)
class CustomTextDirectoryCorpus(TextDirectoryCorpus):
"""
Custom class to set the `prune_at` gensim.Dictionary parameter.
"""
def __init__(
self,
input,
dictionary=None,
metadata=False,
character_filters=None,
tokenizer=None,
token_filters=None,
min_depth=0,
max_depth=None,
pattern=None,
exclude_pattern=None,
lines_are_documents=False,
encoding="utf-8",
dictionary_prune_at=2_000_000,
**kwargs,
):
self._min_depth = min_depth
self._max_depth = sys.maxsize if max_depth is None else max_depth
self.pattern = pattern
self.exclude_pattern = exclude_pattern
self.lines_are_documents = lines_are_documents
self.encoding = encoding
self.dictionary_prune_at = dictionary_prune_at
self.input = input
self.metadata = metadata
self.character_filters = character_filters
if self.character_filters is None:
self.character_filters = [
lower_to_unicode,
deaccent,
strip_multiple_whitespaces,
]
self.tokenizer = tokenizer
if self.tokenizer is None:
self.tokenizer = simple_tokenize
self.token_filters = token_filters
if self.token_filters is None:
self.token_filters = [remove_short_tokens, remove_stopword_tokens]
self.length = None
self.dictionary = None
self.init_dictionary(dictionary)
super(CustomTextDirectoryCorpus, self).__init__(
input, self.dictionary, metadata, **kwargs
)
def init_dictionary(self, dictionary):
"""Initialize/update dictionary.
Parameters
----------
dictionary : :class:`~gensim.corpora.dictionary.Dictionary`, optional
If a dictionary is provided, it will not be updated with the given corpus on initialization.
If None - new dictionary will be built for the given corpus.
Notes
-----
If self.input is None - make nothing.
"""
self.dictionary = dictionary if dictionary is not None else Dictionary()
if self.input is not None:
if dictionary is None:
logger.info("Initializing dictionary")
metadata_setting = self.metadata
self.metadata = False
self.dictionary.add_documents(
self.get_texts(), prune_at=self.dictionary_prune_at
)
self.metadata = metadata_setting
else:
logger.info("Input stream provided but dictionary already initialized")
else:
logger.warning(
"No input document stream provided; assuming dictionary will be initialized some other way."
)
class SentencesIterator:
def __init__(self, generator_function):
self.generator_function = generator_function
self.generator = self.generator_function()
def __iter__(self):
# reset the generator
self.generator = self.generator_function()
return self
def __next__(self):
result = next(self.generator)
if result is None:
raise StopIteration
else:
return result
With the above code we can generate our own 'corpus' object with a slightly bigger dictionary size than in Gensim's standard library. We set it to 20M, since we are also interested in the lesser frequent words (e.g. spelling varieties). We can filter later on minimum frequency.
corpus = CustomTextDirectoryCorpus(FOLDER, dictionary_prune_at=20_000_000)
2024-07-22 22:14:08,818 : INFO : Initializing dictionary 2024-07-22 22:14:09,148 : INFO : adding document #0 to Dictionary<0 unique tokens: []> 2024-07-22 22:36:27,988 : INFO : built Dictionary<10195707 unique tokens: ['__o', '_os', 'aad', 'aag', 'aagwit']...> from 6893 documents (total 694347987 corpus positions) 2024-07-22 22:36:27,988 : INFO : Input stream provided but dictionary already initialized
Now let's save the corpus object to disk, so we can use it later on and don't have to re-run the pre-processing steps. Comment and uncomment the respective code below to run the pre-processing steps or load the corpus object from disk.
with open("data/corpus.pkl", "wb") as f:
pickle.dump(corpus, f)
# with open("data/corpus.pkl", "rb") as f:
# corpus = pickle.load(f)
Next step is to train the Word2Vec model. For this, we need to feed it the corpus object multiple times. We do so by initializing an iterator:
texts = SentencesIterator(corpus.get_texts)
Now, let's create a Word2Vec embedding. You can set the number of workers to your CPU count (minus 1). Again, this can take a while.
You can experiment with the parameters of the Word2Vec model, such as the vector size, window size, and minimum frequency, but this can lead to a bigger model, longer training time, and not necessarily better results.
workers = effective_n_jobs(max(os.cpu_count() - 1, 1))
w2v = Word2Vec(
texts, vector_size=vector_size, window=5, min_count=5, workers=workers, epochs=5
)
Now, let's save the embedding for future use. (Similarly, there is a function to load a previously saved model again below)
w2v.wv.save_word2vec_format(f"data/GLOBALISE_{vector_size}.word2vec")
2024-07-23 00:23:10,897 : INFO : storing 1369250x100 projection weights into data/GLOBALISE.word2vec
Loading a pretrained model¶
There's a slight difference in what we generate above, and what we can load below. To streamline the commands, we load it as a KeyedVectors object. Also, this is the place to load an earlier trained model.
w2v = KeyedVectors.load_word2vec_format(f"data/GLOBALISE_{vector_size}.word2vec")
2024-07-26 11:53:24,202 : INFO : loading projection weights from data/GLOBALISE.word2vec 2024-07-26 11:54:03,938 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (1369250, 100) matrix of type float32 from data/GLOBALISE.word2vec', 'binary': False, 'encoding': 'utf8', 'datetime': '2024-07-26T11:54:03.938194', 'gensim': '4.3.0', 'python': '3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]', 'platform': 'Linux-5.15.0-116-generic-x86_64-with-glibc2.35', 'event': 'load_word2vec_format'}
Analysis¶
Now, this is where the fun starts. We can use the Word2Vec model to find similar words to a given word and thereby find words that share the same semantics and context.
See the Gensim documentation for more information on how to use the Word2Vec model: https://radimrehurek.com/gensim/models/word2vec.html. Below are some examples of how to use the model. You can substitute the words with any word you like, as long as it is in the vocabulary of the model/corpus. Everything needs to be in lowercase.
for i in w2v.most_similar("pantchialang", topn=100):
print(i[0], end=" | ")
pantchialling | pantjall | dehaij | pantch | depantjall | patchiall | pantchiall | challang | debijl | noodhulp | goudsoeker | pantsch | haaij | tapko | pantchialt | jaarvogel | depantchiall | jongedirk | buijtel | krankte | windbuijl | depantjallang | patchiallang | zuykermaalder | pantchallang | depantch | onbeschaamdh | copjagt | chialling | patchalling | boshaan | pantchiallings | salpetersoeker | overmaas | pantjalang | bonneratte | chialop | onbeschaamtheijt | pantc | patchall | patjallang | arnoldina | losboots | pantchall | desnoek | zijdeteeld | woelwater | suijkermaalder | bancq | depatchiall | kruisser | depant | debarcq | nacheribon | sorgdrager | zijdewoom | glisgis | beschutter | vantchiall | delosboot | garnaal | chailoup | beschermer | zordaan | galwet | casuaris | pandjallang | casuarus | pantj | schipio | galeij | oostendenaer | ontang | patch | burk | losboot | smapt | panthialling | bethij | breguantijn | depatch | coffijthuijn | pantsjall | contong | moesthuijn | ramsgatte | jallang | zuijerbeek | onbeschaamtheijd | pantchalling | panthiallang | pittoor | zuijkermaalder | chialoop | tanjongpour | vrctoria | vesuvius | pinxterbloem | chiloup | pantschiallang |
for i, p in w2v.most_similar("amsterdam", topn=100):
if p >= 0.4:
print(i, end=" ")
sterdam middelburg amsterd amst zeeland amster amsterdm amstm rotterdam delft amsteldam enkhuijsen zeland middelburgh utrecht ams amsterda amste gravenhage terdam zeelant zeiland derwapen enchuijsen dam delff maddelburg middelb enckhuijsen amstedam enkhuijzen aamsterdam delfft presidiale enkhuisen seeland enckhuijzen geredreseert vlissingen rdam praesidiale amsterdan hage costeux zeelandt wappan hoorn rotterdant delburg delf delst behangsels inzeland middelbrerg enkhuizen proefidiaale praecidiale ceulen boodh caamer enckhuijs dewees behanghsel amsterstam temiddelburg enkhuysen zieland alkmaar meddelburg cognoissemet rotter sdh carode uijtgevaren middelburgin kameer delvt leijden zeel praesideale amstd uijtregt utregt hoplooper enchuy terkamer rabbinel vlissinge diale kaner veere arnhem confernee praesidiaale haarlem kamier enehuysen siemermeer middeburg amstdam
for i in w2v.most_similar("intje", topn=100):
print(i[0], end=" | ")
jntje | maleijer | dul | maleyer | anachoda | bappa | salim | jntie | malijer | malaijer | malim | boeginees | jurragan | parnakan | iavaan | iuragan | intie | cadier | sadulla | carim | mochamat | abdul | samat | parnackan | javaan | arabier | assan | nachoda | javaen | soedin | bouginees | mohamat | abdulla | achmat | talip | iurragan | inw | kinko | balier | zait | jnw | lim | sleman | juragan | saijit | garrang | rahim | bagus | oeij | tjina | anach | njo | jabar | boeang | tjan | mahama | karim | boeijong | aboe | jnwoonder | ganie | campar | tja | garang | balijer | troena | kamat | mallijer | anak | chin | sait | cassim | machoda | boejong | soekoer | roekoe | nio | samara | oemar | poea | lebe | hoko | miskien | vrijbalier | maijang | hoeko | salee | sech | samsoe | boegenees | naghoda | koko | gonting | tenoedin | mandarees | oesien | troeno | draman | sinko | jamal |
for i in w2v.most_similar("caneel", topn=100):
print(i[0], end=" | ")
canneel | arreecq | arreeck | cardamom | geschilden | balen | cardaman | cardamon | areecq | arreek | geschilt | overjarigen | areek | geschilde | ruijnas | bast | kurkuma | wortelen | wortel | cardemom | cannel | saffragamse | arreeq | groven | jndigo | incorrecte | ammenams | schelders | plantjes | curcuma | areeck | ougsten | affpacken | zaije | runas | schillens | moernagelen | cauwa | wortels | smakeloose | koehuijden | klenen | indigo | gekookten | zalpeter | canneer | saije | calpentijnsen | cragtelose | endeneese | canneelschilders | cheijlonsen | kannee | reuck | baelen | baalen | kanneel | pingos | sacken | varssen | anijl | ruinas | ammonams | tabacq | zaat | cauris | amm | ruias | cardanom | fijnen | cardamam | coffijbonen | cardamoin | arreck | bhaalen | zaijen | nagelen | caneell | embaleeren | bladeren | berberijen | coffijboonen | overjarige | kleenen | fordeelen | zaad | onrijpe | noten | pken | specerije | gamsen | geschild | caaneel | roggevellen | endeneesche | ingesamelden | oliteiten | peerlen | pepen | elijhanten |
for i in w2v.most_similar(
positive=["weder", "weer", "regen"], negative=["wederom", "alweder"], topn=100
):
print(i[0], end=" | ")
weir | reegen | wint | zeewint | lugt | windt | noorde | winden | waaijende | stroom | buijen | winde | sneeuw | doorwaijen | zuijde | lucht | regenbuijen | waijende | wind | coeltjens | suijdweste | koude | vlagen | handsaem | weerligt | dewind | regenagtig | tegenstroom | doorwaijende | sonneschijn | regenen | stilte | koelte | regens | coelte | lught | hitte | stijve | lughje | zeewind | wintje | weste | warme | onstuijmig | reegenen | stroomen | koelten | zonnestraalen | delugt | warmte | handsaam | buijdige | travaden | doorbreken | inbreeken | moussom | doorwaaijende | reegende | travadig | doorstaande | doorkomende | hette | buijig | luchje | felle | afwatering | starke | kentering | overdag | stormwinden | reegens | wzw | westelijke | vloet | variable | coeltje | calte | tegenwinden | ooste | goedweer | oostelijke | noordweste | zot | waaijde | deijning | aartrijk | noordelijk | valwinden | ongestadige | doorwaaijen | slijve | suijde | caelte | lugties | firmament | regende | coeste | travodig | coelende | doorbrake |
w2v.closer_than("eendracht", "tilburg")
['ende', 'naer', 'oock', 'noch', 'int', 'nae', 'schepen', 'retour', 'vant', 'camer', 'gecomen', 'fluijt', 'volck', 'becomen', 'welck', 'jacht', 'hoorn', 'japan', 'rotterdam', 'coninck', 'ditto', 'jagt', 'wint', 'compe', 'godt', 'lant', 'eijlanden', 'derwaerts', 'end', 'vertreck', 'landt', 'goa', 'geladen', 'stadt', 'tschip', 'bat', 'comende', 'maent', 'opt', 'chaloup', 'maecken', 'ladinge', 'japara', 'delft', 'oocq', 'gearriveert', 'genaemt', 'gemaeckt', 'weijnich', 'coningh', 'rhede', 'langh', 'waermede', 'daermede', 'ene', 'macht', 'ancker', 'originele', 'jnt', 'eijlant', 'nassauw', 'augustij', 'vrede', 'quartieren', 'wapen', 'voijagie', 'cattij', 'middagh', 'achter', 'opde', 'vaderlant', 'portugees', 'geseijde', 'leeuw', 'dirck', 'cargasoen', 'verwachten', 'mauritius', 'rijck', 'chialoep', 'dach', 'namentlijck', 'eijlandt', 'geladene', 'vlissingen', 'jachten', 'battavia', 'gelyck', 'seecker', 'wingurla', 'gescheept', 'amst', 'portugesen', 'iapan', 'comste', 'stondt', 'nederlants', 'arent', 'nacht', 'vercocht', 'haven', 'zielen', 'caap', 'gedestineert', 'goens', 'fregat', 'vaert', 'oosten', 'galjoot', 'iagt', 'bouton', 'admirael', 'baij', 'aengecomen', 'eenich', 'orangie', 'besendinge', 'nde', 'batt', 'joncken', 'toecomende', 'journael', 'fortuijn', 'conincx', 'volladen', 'enkhuijsen', 'gevaren', 'amsterd', 'engeland', 'diemen', 'japanse', 'suratte', 'texel', 'souratte', 'vaeren', 'moluccos', 'hooch', 'cleen', 'vandaer', 'vaertuijgh', 'aent', 'noorden', 'vloote', 'windt', 'loo', 'noort', 'havenen', 'jager', 'jans', 'ganges', 'gedestineerde', 'jagtje', 'mallacca', 'westen', 'reyse', 'twelcke', 'helena', 'claes', 'swarten', 'sumatra', 'suijcker', 'speelman', 'datmen', 'concordia', 'lam', 'burgh', 'bon', 'geseijlt', 'geseijt', 'wech', 'ouer', 'wacht', 'princes', 'vijant', 'uijren', 'vertreckt', 'lanck', 'vaderlandt', 'sunda', 'suratta', 'koninck', 'dittos', 'naderhant', 'mette', 'spiegel', 'eendragt', 'welvaren', 'graden', 'metten', 'wester', 'cleijn', 'oorloge', 'hollandia', 'avondt', 'mars', 'chaloupen', 'queda', 'hoecker', 'engelsz', 'malcanderen', 'maen', 'ternate', 'lichten', 'eenighe', 'velsen', 'boeckhouder', 'hoeck', 'soodanich', 'eenelijck', 'aer', 'vijandt', 'fluijten', 'toegecomen', 'leijt', 'becoomen', 'pauw', 'pallas', 'stierman', 'seijlen', 'jaght', 'bengala', 'beer', 'gelicht', 'seijl', 'zuijt', 'nangasackij', 'lants', 'nederlantse', 'eijndelijck', 'anthonio', 'jonck', 'facture', 'anckers', 'monterende', 'esperance', 'schagen', 'jegenwoordich', 'opperhooffden', 'monteert', 'geprojecteert', 'eylanden', 'taijouan', 'persien', 'aengebracht', 'brant', 'macquian', 'straet', 'schep', 'ladingh', 'portugese', 'ouglij', 'alreede', 'lis', 'insgelijcx', 'maes', 'besettinge', 'macao', 'haas', 'geraeckt', 'fluijtje', 'caep', 'zeelandia', 'gemaect', 'joris', 'twelcq', 'selvige', 'spoedigh', 'doort', 'zeelant', 'nassau', 'retourneren', 'overgecomen', 'oorsaecke', 'bantham', 'middach', 'nodich', 'rechte', 'naert', 'martij', 'costi', 'welcq', 'overnes', 'voorss', 'waert', 'maendt', 'siams', 'verrichten', 'cauw', 'middelburgh', 'sillida', 'snoek', 'tidoor', 'brugge', 'zeijlende', 'voornt', 'ouwerkerk', 'naervolgende', 'conincq', 'aenstonts', 'wederomme', 'elck', 'gesecht', 'fredrick', 'cast', 'volcht', 'diamant', 'jaerlijcx', 'dieren', 'mouson', 'compste', 'opden', 'genoech', 'batavier', 'vaertuijch', 'geraecken', 'middelb', 'reste', 'chial', 'reviere', 'adrichem', 'castricum', 'tegenwoordich', 'vracht', 'margaretha', 'overschie', 'damme', 'spijk', 'grave', 'horst', 'dicht', 'biema', 'engelant', 'terstont', 'jagten', 'doornik', 'tocht', 'gecocht', 'siecken', 'lagh', 'amboijna', 'redelijck', 'nieuwland', 'verovert', 'arriveren', 'capelle', 'verwacht', 'gecregen', 'cargasoenen', 'houdt', 'scheepie', 'raap', 'ooc', 'hillegonda', 'overgescheept', 'onderwegen', 'hollantse', 'voorder', 'nassouw', 'vruchten', 'brandt', 'oma', 'westhoven', 'ongeluck', 'spiering', 'sijmon', 'goes', 'fluijtschip', 'selue', 'reij', 'geankert', 'bort', 'hercules', 'verbij', 'hoet', 'gelijcq', 'gezeijlt', 'gesamentlijck', 'vrijburg', 'manilha', 'havens', 'thoff', 'oudt', 'naderhandt', 'negombo', 'walcheren', 'opperdoes', 'breda', 'pegu', 'solor', 'joncke', 'mallebaer', 'chiampan', 'voorspoet', 'redout', 'naerde', 'batavise', 'boa', 'jappan', 'briel', 'abbekerk', 'sont', 'tijdingh', 'weynich', 'geseth', 'meijnden', 'vliegende', 'woerden', 'vaderlantse', 'genaempt', 'wilhem', 'achteren', 'paliacatta', 'steecken', 'alst', 'boero', 'zeelandt', 'langewijk', 'riviere', 'formosa', 'adelborst', 'mettet', 'mett', 'caron', 'leck', 'tjagt', 'ano', 'snauw', 'broeck', 'tjacht', 'erasmus', 'manilhas', 'witten', 'beverwijk', 'utrecht', 'quaemen', 'gouda', 'cats', 'atchin', 'ingeladen', 'enckhuijsen', 'verricht', 'coen', 'vlaming', 'claesz', 'larique', 'brandenburg', 'grondt', 'molucco', 'aendoen', 'come', 'duijnen', 'vuijt', 'armade', 'iacoba', 'boucken', 'firando', 'moij', 'unie', 'dolphijn', 'daerse', 'foreest', 'kat', 'thof', 'joncq', 'geruchten', 'haen', 'cha', 'flora', 'amb', 'spoedich', 'keulen', 'saterdagh', 'janssen', 'campen', 'aencomste', 'vaerwater', 'date', 'standt', 'haerlem', 'fluyt', 'chaloep', 'jaccatra', 'dienstich', 'bassora', 'veroverde', 'deense', 'aenboort', 'castel', 'baije', 'assenburg', 'reeckeninge', 'buuren', 'hoochte', 'paert', 'besendingh', 'bellasoor', 'hurdt', 'africa', 'engelandt', 'barentsz', 'drije', 'gemant', 'victorie', 'coeverden', 'gescheepte', 'etmael', 'borsselen', 'baeij', 'samson', 'inlants', 'zunda', 'besuijden', 'nieuwerkerk', 'valck', 'blijdorp', 'chialoepen', 'eten', 'redelijcke', 'mitsgaeders', 'delfs', 'oosthuijsen', 'sulckx', 'henrick', 'westerbeek', 'brugh', 'nas', 'tland', 'vaem', 'riff', 'antonio', 'hollant', 'nederlantsche', 'schulp', 'cruijs', 'verongelucken', 'wassenaar', 'londen', 'nachts', 'beschermer', 'geraackt', 'outshoorn', 'maendagh', 'wercq', 'nederlant', 'opperstierman', 'rijckloff', 'gecombineert', 'aengeweest', 'naght', 'voorburg', 'strandt', 'waveren', 'reduijt', 'cregen', 'noordbeek', 'draak', 'jaques', 'suijder', 'ellemeet', 'nova', 'nuijts', 'gelaeden', 'pool', 'besoecken', 'vosmaar', 'castor', 'retourneert', 'oct', 'schepe', 'comptanten', 'herstelde', 'onderzeijl', 'wickenburg', 'popkensburg', 'oorloch', 'steenhoven', 'vlamingh', 'bonne', 'seijnden', 'werwaerts', 'straalen', 'bijt', 'theodora', 'kercke', 'hogersmilde', 'tanjongpoura', 'datelijck', 'terra', 'iager', 'derwarts', 'dto', 'rotterd', 'aengeland', 'cijlon', 'soot', 'gideon', 'gevolcht', 'vertrocke', 'samarangh', 'hoedanich', 'gestevent', 'hollandt', 'herwarts', 'meerman', 'wesel', 'alphen', 'raeckende', 'iacht', 'twapen', 'alsvooren', 'parsia', 'ledigh', 'amstel', 'enchuijsen', 'masulipatam', 'rosingijn', 'vane', 'vestinge', 'papenburg', 'ontladen', 'valkenisse', 'reeckeningh', 'voorgemelte', 'tarnaten', 'gangh', 'laeden', 'hoochsten', 'ock', 'breecken', 'robbertus', 'booth', 'peerl', 'gouw', 'eylant', 'schelde', 'langhs', 'boordt', 'wendela', 'eenhoorn', 'voorland', 'faam', 'visvliet', 'beoosten', 'iagtje', 'salm', 'clachten', 'gevaeren', 'perijckel', 'redel', 'naet', 'seijlon', 'alsem', 'bocht', 'besich', 'leyden', 'groeningen', 'haerder', 'carthago', 'vruchteloos', 'maleijo', 'block', 'hardt', 'rhoon', 'wijck', 'genoechsaem', 'oostrust', 'waerdich', 'pollux', 'verseeckeringe', 'gesicht', 'mane', 'aancomste', 'geduijrende', 'besettingh', 'portugael', 'passerende', 'raadhuijs', 'rijcke', 'cargo', 'vertrecq', 'lampon', 'scholtenburg', 'palimbangh', 'deensche', 'amerongen', 'veroorsaeckt', 'enge', 'bogaart', 'choromandel', 'chirrebon', 'delff', 'bijweg', 'landskroon', 'alsnoch', 'bevindingh', 'menichte', 'sint', 'beseth', 'vijants', 'ridderkerk', 'westerveld', 'schiedam', 'beekvliet', 'comptant', 'padmos', 'retourneerende', 'naart', 'geseijden', 'aencomende', 'nederlandt', 'aden', 'becommen', 'maccauw', 'byden', 'crab', 'cronenburg', 'aenkomste', 'vlucht', 'eylandt', 'haes', 'waerden', 'purmer', 'correcorren', 'tack', 'majt', 'alsnu', 'verseijlt', 'zuratta', 'slach', 'scheepken', 'strijen', 'middelwout', 'verwachtende', 'zeepaard', 'brachten', 'beijeren', 'dregterland', 'geertruij', 'casar', 'javan', 'ceres', 'aencomen', 'hendricksz', 'tpatria', 'snachts', 'gevolght', 'ida', 'eijl', 'jonghst', 'triton', 'mandorijn', 'voorwaer', 'ome', 'corts', 'spaens', 'meerlust', 'dircksz', 'merwe', 'voorschooten', 'patanij', 'hogenes', 'borneo', 'eem', 'vertimmeren', 'vlote', 'nederhoven', 'rosenburg', 'jnlants', 'oranje', 'fregatten', 'ria', 'naerden', 'maldives', 'beveland', 'bodt', 'selffde', 'daechs', 'nass', 'negrij', 'purmerlust', 'uijr', 'overcomste', 'vaderlants', 'goas', 'ceulen', 'veroveren', 'jerusalem', 'defluijt', 'limburg', 'geseyde', 'hopvogel', 'rebecca', 'azia', 'chiampans', 'punct', 'veere', 'gaasperdam', 'uno', 'crap', 'lastdrager', 'sleewijk', 'verbrant', 'suijt', 'suijckeren', 'phenix', 'vicq', 'padtbrugge', 'lach', 'januario', 'aengelant', 'zuijderburg', 'medebrengende', 'geanckert', 'malcander', 'merckt', 'coomende', 'voorstaende', 'aes', 'maleijen', 'osacca', 'vreeland', 'leste', 'suijd', 'batavi', 'daghregister', 'partie', 'terdam', 'schoonderloo', 'cruijssen', 'maleije', 'calpentijn', 'vriesland', 'geus', 'renswoude', 'kerkwijk', 'westerdijxhorn', 'delfshaven', 'oostende', 'geroerde', 'vlagh', 'zeel', 'gissingh', 'geberchte', 'duijnenburg', 'sar', 'fluijtie', 'herstelder', 'duijven', 'schuijtwijk', 'thuys', 'aatchin', 'tcasteel', 'verongeluckte', 'vojagie', 'vrientschap', 'horstendaal', 'loenderveen', 'dordrecht', 'verseijlen', 'liggen', 'milde', 'madrast', 'aleppo', 'pantchiall', 'nieuwstad', 'barbara', 'meijenberg', 'dieshoek', 'carpentier', 'gevlucht', 'brack', 'weeck', 'schoonauwen', 'bombahia', 'macas', 'geertruijd', 'voorschoten', 'amadabath', 'mathijs', 'beeck', 'schellag', 'spoedichste', 'sparenrijk', 'graeff', 'jacatra', 'uijtgaen', 'voyagie', 'diemermeer', 'stieren', 'mourits', 'jorisz', 'gemaeckte', 'gestadich', 'aetchin', 'jonghste', 'schips', 'junius', 'crooswijk', 'andragirij', 'paauw', 'bellesoor', 'berkenroode', 'mijdregt', 'robijn', 'naede', 'deventer', 'sielen', 'kroonenburg', 'mondt', 'gheijn', 'ziam', 'hulst', 'palembangh', 'ael', 'verwachte', 'vegt', 'midts', 'alrede', 'superintendent', 'horssen', 'pampus', 'laer', 'zuyd', 'aengebrachte', 'caeb', 'hittoe', 'raecken', 'daman', 'prattenburg', 'frederick', 'vlotter', 'lichte', 'gecombineerd', 'landts', 'doradus', 'jamby', 'vroech', 'putmans', 'bril', 'maetsuijcker', 'maccao', 'delfland', 'kronenburg', 'bengaele', 'makassar', 'salanghoor', 'proostwijk', 'rescontre', 'perde', 'rosairo', 'tydinge', 'gemerct', 'ijsselmonde', 'giroffel', 'kiefhoek', 'eend', 'portugiesen', 'joncquen', 'tijdelijk', 'velzen', 'duijvenvoorde', 'belois', 'vervolch', 'gerescontreert', 'lieffde', 'tulpenburg', 'wart', 'gracht', 'admiraals', 'poedecherij', 'rotter', 'scheijbeek', 'aengeb', 'lacca', 'quelangh', 'veldhoen', 'gesocht', 'arriv', 'alblasserdam', 'meijnde', 'stolle', 'maeckten', 'ginck', 'geluckigh', 'dinsdagh', 'vlieger', 'tijger', 'xula', 'haij', 'naegelen', 'haring', 'hoochste', 'brenght', 'veerman', 'swaen', 'swol', 'diana', 'bentvelt', 'laus', 'bogaert', ...]
for i in w2v.most_similar("regen", topn=100):
print(i[0], end=" | ")
reegen | regens | reegens | droogte | continueelen | regentijd | hitte | hette | afwatering | continuelen | felle | travaden | regentijt | heete | koude | opperwater | sonneschijn | weste | swaren | buijen | vloed | overvloedigen | regenbuijen | coortse | gestadigen | onstuijmig | doorbreken | regenen | afwateringe | weir | stilte | waijende | westelijke | pides | onweer | vlagen | warmte | tegenwinden | noorde | regenagtig | koortsen | aardbeving | ooste | winden | stormwinden | sneeuw | stroomen | oostelijke | veroorsaakte | dampen | opkomende | stormen | doorwaijende | valwinden | oostewinden | vloet | waaijende | koors | stiltens | winde | vaarbaar | lucht | doorwaijen | hete | stromen | getij | doorblasende | regenagtigh | tegenwind | fellen | geduurigen | delugt | swaaren | moussom | reegentijd | vuurberg | suijdweste | stilten | inbreeken | ongestadig | aanwakkerende | overkroptheijt | aartrijk | westelijcke | ongemeenen | defelle | kentering | broeijende | weerligt | continuele | afwateringen | swaeren | doorstaande | schrale | zuijde | begonde | verlopen | reegenen | noordweste | handsaam |
for i in w2v.most_similar("schipbreuk", topn=100):
print(i[0], end=" | ")
machteloos | jammerlijk | honger | uijtgestaen | breuk | ongemak | schade | tempeest | onweer | accident | aardbevingh | ongeluk | orcaan | onveer | amsters | calamiteijt | koorts | woedend | crimiineelen | brandinge | gesucceld | schaade | ongemack | onweder | schaede | vreese | deerlyk | storm | uitgestaan | swaaren | ongeluck | bhuij | rampe | elende | besprongen | tanjepoer | ellendig | orcanen | vesemente | zeerampen | vrees | nootweer | stormen | geblasen | ongebal | geplaegt | gewopen | aardbeving | vesementen | presumeerden | uijtwendig | aardbevinge | jarigie | storme | hartseer | monding | deerlijck | vreeze | travade | gewaeijt | stooting | affront | eytmatrauw | stoting | ellende | mack | arrepo | deerlijk | bloedigen | naod | vrese | travaat | vruchte | uijtgestaane | louter | swaeren | smaadheden | travaet | pachtert | travaden | holgaende | smaet | dewijlen | flaauwten | aensegen | boegh | onweeder | belaglijk | nagejaagt | gaets | hongersnoot | hottentoosen | inflamatie | onderlek | losson | nederlage | rotterdame | tcelamse | verbolgen | jaagt |
for i in w2v.most_similar("pieter", topn=100):
print(i[0], end=" | ")
gerrit | cornelis | paulus | ian | leendert | barent | jan | evert | andries | lourens | claas | marten | daniel | roeloff | anthonij | dirk | harmanus | theunis | lambert | joost | sijmon | roelof | albert | martinus | gillis | michiel | matthijs | jacob | maerten | govert | maarten | harmen | abraham | iacobus | johannes | volkert | carsten | barend | dirck | rijnier | huijbert | jacobus | hendrick | jasper | abram | egbert | jurriaan | christiaen | sijbrand | verhoef | siewert | arnoldus | laurens | samuel | anthoni | iacob | nicolaas | meijndert | marinus | lucas | coert | iurriaan | hermanus | gerbrant | bartholomeus | henderik | iohannes | isaak | jochem | christiaan | eldert | harman | amos | reijer | guilliam | ioost | gilles | david | antonij | hendrik | reijndert | corthals | hend | bartel | aarnout | arent | casper | joris | jurriaen | coenraet | johannis | adam | adriaen | noach | adriaan | poulus | warnar | anthony | wessel | iurriaen |