Arabic Language Scrapper / Datasets for LLMs avatar

Arabic Language Scrapper / Datasets for LLMs

Pricing

Pay per event

Go to Apify Store
Arabic Language Scrapper / Datasets for LLMs

Arabic Language Scrapper / Datasets for LLMs

This scraper is designed to collect and structure rich Arabic lexical data from authoritative Arabic dictionary sources. It extracts Arabic words along with their definitions, contextual meanings, related words, synonyms, and morphological derivations.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Amr Ashour

Amr Ashour

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

3 months ago

Last modified

Share

โ“ What is Arabic Language Scrapper / Datasets for LLMs Scraper?

A web scraping and data processing pipeline that builds high-quality, structured Arabic lexical datasets from online Arabic dictionary sources. It extracts words, meanings, contextual usages, related terms, synonyms, and morphological root derivations, transforming traditionally unstructured dictionary content into machine-readable datasets suitable for LLM training, NLP research, search engines, and language applications.

๐Ÿ“š What data can I extract?

  • Word โ€” The main Arabic word entry

  • Primary meanings โ€” Dictionary definitions

  • Contextual meanings โ€” Usage-based definitions and examples

  • Synonyms โ€” Words with similar meaning

  • Related words โ€” Associated or semantically linked terms

  • Morphological derivations โ€” Root-based derivatives (e.g., ูƒุงุชุจ, ู…ูƒุชูˆุจ from ูƒ-ุช-ุจ)

  • Root form โ€” The trilateral or quadrilateral Arabic root

  • Source URL โ€” Reference page for traceability

๐ŸŒ Why scrape Arabic Language?

Arabic is one of the most widely spoken languages globally, yet high-quality structured Arabic linguistic datasets remain limited. Most authoritative dictionary resources exist only as unstructured web content.

By scraping and structuring Arabic lexical data, this project enables:

  • Better Arabic understanding in LLMs

  • Higher quality Arabic NLP pipelines

  • Semantic search in Arabic applications

  • Knowledge graph creation

  • Educational and linguistic research tools

  • This project helps close the Arabic data gap in AI.

๐Ÿ“ฅ Input

{
"maxItems": 10,
}

๐Ÿ“ค Output

"word": "ุงู„ูˆุณู…",
"link": "https://www.almaany.com/ar/dict/ar-ar/",
"title": "ุชุนุฑูŠู ูˆ ู…ุนู†ู‰ ุงู„ูˆุณู… ููŠ ู…ุนุฌู… ุงู„ู…ุนุงู†ูŠ ุงู„ุฌุงู…ุน - ู…ุนุฌู… ุนุฑุจูŠ ุนุฑุจูŠุชุนุฑูŠู ูˆ ู…ุนู†ู‰ ุงู„ูˆุณู… ููŠ ู‚ุงู…ูˆุณ ุงู„ูƒู„. ู‚ุงู…ูˆุณ ุนุฑุจูŠ ุนุฑุจูŠ",
"meanings_list": [
{
"title": "ุงู„ูˆุณู…: (ู…ุตุทู„ุญุงุช)",
"description": [
"ุจูุชุญ ูุณูƒูˆู† ุฌู…ุน ูˆุณูˆู… ู…ู† ูˆุณู… ( ุงู†ุธุฑ: ูˆุณุงู… ) ุŒ ุฃุซุฑ ุงู„ูƒูŠ ุจุงู„ู…ูŠุณู…. ูˆุงู„ุณู…ุฉ: ุงู„ุนู„ุงู…ุฉ. (ูู‚ู‡ูŠุฉ)"
]
},
{
"title": "ูˆูŽุณู‘ูŽู…ูŽ: (ูุนู„)",
"description": [
"ูˆุณู‘ูŽู…ูŽ ูŠููˆุณู‘ูู… ุŒ ุชูˆุณูŠู…ู‹ุง ุŒ ูู‡ูˆ ู…ููˆุณู‘ูู… ุŒ ูˆุงู„ู…ูุนูˆู„ ู…ููˆุณู‘ูŽู…",
"ูˆุณู‘ูŽู… ูู„ุงู†ู‹ุง: ุฃุนุทุงู‡ ุฃูˆ ู…ู†ุญู‡ ูˆุณุงู…ู‹ุง"
]
}
],
"contextual_examples": {
"title": "ุฃู…ุซู„ุฉ ุณูŠุงู‚ูŠุฉ: ุงู„ูˆุณู…ุŒ ุฌู…ู„ ูˆุฑุฏ ุจู‡ุง ุงู„ูˆุณู…",
"examples": [
"ูˆุณู…ูŠ ุงู„ู†ุจุงู„ุฉ ูŽ ุจุงู„ู…ู„ุงุญู…ู ุชุชุณู…ู’ โ€ฆ................. ูˆุณู…ูŠ ุงู„ุตุจุงุจุฉ ูŽ ุจุงู„ุนูˆุงุทู ุชุฎู„ุฏู (ุดุนุฑ ุงู„ุดุงุนุฑ: ุฃุญู…ุฏ ุดูˆู‚ูŠ )",
"ูˆู„ูˆ ูˆุณู…ูŽ ุงู„ู†ุงุณู ุงู„ุฌุจุงู‡ูŽ ุจู…ุฏุญู‡ู โ€ฆ................. ุฅุฐุงู‹ ู„ุงุณุชู„ุฐูˆุง ุงู„ูˆุณู…ูŽ ูˆุงู„ูˆุณู…ู ูŠุคู„ู…ู (ุดุนุฑ ุงู„ุดุงุนุฑ: ุงุจู† ุงู„ุฑูˆู…ูŠ )",
"ูˆูŽู‡ููŠูŽ ุงู„ู’ู…ูŽุญุงู…ูุฏู ุฃูŽุจู’ู‚ูŽุชู’ ุฎุงู…ูู„ุงู‹ ุฃูŽุจูŽุฏุงู‹ โ€ฆ................. ู…ู†ู’ ู„ู…ู’ ุชุณู…ู’ ูˆุณู…ุง ู…ู„ูƒูŒ ุจู‡ุง ูˆุณู…ุง (ุดุนุฑ ุงู„ุดุงุนุฑ:ุงุจู† ุญูŠูˆุณ )",
]
},
"similar_words": {
"title": "ูƒู„ู…ุงุช ุฐุงุช ุตู„ุฉ",
"words": [
"ุงูุชู‘ูุณุงู…",
"ุฃูŽูˆู’ุณูู…ุฉ",
"ุฃูˆุณูŽู…",
"ุชูŽูˆูŽุณู‘ูŽู…ูŽ",
"ุชูˆุณูŠู…"
]
},
"related_words": {
"title": "ูƒู„ู…ุงุช ู‚ุฑูŠุจุฉ",
"words": [
"ุงู„ูˆุณุท ุงู„ู‡ู†ุฏุณูŠ ุจูŠู† ู…ู‚ุฏุงุฑูŠู†",
"ุงู„ูˆุณุท ุงู„ู‡ู†ุฏุณูŠ ู„ุทูˆู„ูŠู† ุฃูˆ ุนุฏุฏูŠู†",
"ุงู„ูˆุณุท ู…ู† ุงู„ุดูŠุก"
]
},
"word_derivative": {
"title": "ุงู†ุธุฑ ู…ุนู†ู‰ ูˆูŽุณู’ู…ูŒ ู…ุดุชู‚ุงุช ูˆ ุชุญู„ูŠู„ ุงู„ู’ูˆูŽุณู’ู…",
"derivative": "ุงู„ู’ูˆูŽุณู’ู… : ูƒู„ู…ุฉ ุฃุตู„ู‡ุง ุงู„ุงุณู… (ูˆูŽุณู’ู…ูŒ) ููŠ ุตูˆุฑุฉ ู…ูุฑุฏ ู…ุฐูƒุฑ ูˆุฌุฐุฑู‡ุง (ูˆุณู…) ูˆุฌุฐุนู‡ุง (ูˆุณู…) ูˆุชุญู„ูŠู„ู‡ุง (ุงู„ + ูˆุณู…)"
}
}