🌱 duo-ai-anki

under construction 🚧

what?

retrieve duolingo learnt words from website
format into csv
use ollama and local llms to categorise words
output anki-friendly flashcards
optional: automatically update anki app with these?

1. duolingo website

https://www.duolingo.com/practice-hub/words
- save cookies locally to be able to log in automatically
main tools for web-scraping:
- beautiful soup
- selenium
output:
- .csv with list of words and translations

done refresh.. lets pause for 5 secs.
 
lets go to words page
https://www.duolingo.com/practice-hub/words
...
 
New words learnt count: 3034 words
Congrats you have learnt 1175 new words!
... i.e. might need to load the page 23 times to get all of them :)

,sp,en
0,cansadísimo,really tired
1,encantaron,(?) did you love
2,grandísimo,"really big, very big, really large"

2. data processing

goal: clean, enhance data with more properties, such as genre, type, customised properties etc.

i. data pre-processing

clean and format data → consistent format

ii. external database

if available, can use as a primary step to merge
limitation:
- different format
- reliability
- incomplete

iii. “internal” processing

can be the main processing step if good enough, or as a complementary step if ii. external database is better

2 methods:

either semi-manual

e.g. copy paste to chatbot for assistance with specific prompt engineering
- observation: the chatbots online (chatgpt, deepseek) are more effective than those local models…

… or automatised

spacy: open-source library for NLP tasks (POS = Part-of-Speech) → simpler if just need to get grammatical category
- observation: : not super accurate, not convincing.

nlp = spacy.load("es_core_news_sm")
for word in words:
    doc = nlp(word)
    token = doc[0] if len(doc) > 0 else None
    pos = token.pos_ if token else "UNKNOWN"
    # Try to extract gender from morphological features (Morph)
    gender = token.morph.get("Gender")
    gender_val = gender[0].lower() if gender else "none"
    data.append({
        "Word": word,
        "Gender": gender_val,
        "Type": pos.upper()
    })

ollama: platform that enables llms to run locally (e.g. Mistral) → more “overkill”
- simple prompt i.e. zero-shot prompting
- with examples i.e. few-shot prompting
  - observation: not perfectly accurate or consistent.

for word in words:
	prompt = (
		f"Clasifica la palabra o frase '{word}' en español: "
		"¿es un verbo, sustantivo, adjetivo, adverbio, preposición, "
		"conjunción, pronombre, determinante, interjección, o numeral? "
		"Responde solo con el tipo."
	)
	response = ollama.chat(
		model='mistral',
		messages=[{'role': 'user', 'content': prompt}]
	)
	pos = response['message']['content'].strip()
	results.append({'Word': word, 'Type': pos})

final observation: unfortunately, after testing, neither is actually satisfying… chatbots online are more efficient.

iv. final anki formatting

decide what properties to include,
- simple: word, translation
- more detailed: word, translation, gender, type

,sp,en,gender,type,
0,prestar,"gave, pay, pays",none,verb,
1,hockey,hockey,masc,noun,
2,navegar,"surf, sailing, sail",none,verb,
3,estampilla,stamp,fem,noun,
4,velero,"sailboat, sail",masc,noun,

deeilna

🌱 duo-ai-anki

1. duolingo website

2. data processing

i. data pre-processing

ii. external database

iii. “internal” processing

either semi-manual

… or automatised

iv. final anki formatting

Table of Contents