Mubawab.ma Housing Scraper avatar

Mubawab.ma Housing Scraper

Pricing

from $1.00 / 1,000 results

Go to Apify Store
Mubawab.ma Housing Scraper

Mubawab.ma Housing Scraper

Scrapes Moroccan real estate listings from mubawab.ma and outputs a structured dataset ready for ML model training (price prediction, classification).

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

LIAICHI MUSTAPHA

LIAICHI MUSTAPHA

Maintained by Community

Actor stats

0

Bookmarked

12

Total users

4

Monthly active users

15 days ago

Last modified

Share

The Moroccan Housing Dataset — an open-source Apify Actor that scrapes real estate listings from mubawab.ma and produces a flat, ML-ready dataset modelled after the classic California Housing dataset (Géron, Hands-On ML, Chapter 2).

Apify Actor LICENSE Node.js 20 Playwright Open Issues


Table of Contents


What it does

Morocco's real estate market lacks structured, machine-readable public data. This actor closes that gap by crawling mubawab.ma — Morocco's largest property portal — and extracting every listing into a single CSV/JSON dataset suitable for:

  • 🏠 Price prediction models (regression)
  • 📍 Geo-spatial analysis by city and neighborhood
  • 📊 Market trend dashboards
  • 🤖 AI / LLM-powered property assistants

The scraper uses a two-phase Playwright crawl (search results → detail pages) and persists output through the Apify storage API so you can export CSV/JSON directly from the platform or via API with zero extra tooling.


Output dataset

Every scraped listing maps to one row with these fields:

FieldTypeDescription
priceDhnumber | nullTarget variable — price in Moroccan Dirhams (MAD)
pricePerM2number | nullDerived: price ÷ surface (MAD/m²)
surfaceM2number | nullLiving area in m²
numRoomsinteger | nullBedrooms
numBathroomsinteger | nullBathrooms
floorinteger | nullFloor (0 = ground / RDC)
propertyTypestring | nullappartement, villa, maison, riad, …
standingstring | nulleconomique, moyen_standing, haut_standing
statestring | nullneuf, bon_etat, a_renover, en_cours_de_construction
citystring | nullLowercase ASCII name, e.g. casablanca
neighborhoodstring | nullSub-area within the city
transactionTypestring | nullvente or location
urlstringDirect link to the listing on mubawab.ma
titlestring | nullRaw listing title
scrapedAtstringISO-8601 scrape timestamp

Sample record

{
"priceDh": 1250000,
"pricePerM2": 12500,
"surfaceM2": 100,
"numRooms": 3,
"numBathrooms": 2,
"floor": 3,
"propertyType": "appartement",
"standing": "moyen_standing",
"state": "bon_etat",
"city": "casablanca",
"neighborhood": "maârif",
"transactionType": "vente",
"url": "https://www.mubawab.ma/fr/a/12345/appartement-a-vendre-casablanca",
"title": "Appartement à vendre à Maârif, Casablanca",
"scrapedAt": "2025-03-27T14:32:00.000Z"
}

Quick start

Option A — Run on Apify (no setup needed)

  1. Open the actor on the Apify Store
  2. Click Try for free
  3. Configure inputs in the visual form
  4. Click Start → export results as CSV or JSON once the run completes

Option B — Run locally

Prerequisites: Node.js 20+, Apify CLI

# 1. Install the CLI
npm install -g apify-cli
# 2. Clone this repo
git clone https://github.com/MuLIAICHI/Mubawab-Housing-Scraper.git
cd Mubawab-Housing-Scraper
# 3. Install dependencies
npm install
# 4. Quick test — 10 listings only
apify run --input='{"maxListings": 10, "transactionType": "vente"}'
# 5. Full run — all 9 cities, up to 5 000 listings
apify run

Results are saved locally under storage/datasets/mubawab-housing/.

Option C — Deploy to your Apify account

apify login # Enter your Apify API token
apify push # Build & upload the actor

Then run and schedule from console.apify.com.


Input configuration

Configure the actor via the Apify Console form or by passing a JSON input:

ParameterTypeDefaultDescription
transactionTypestring"vente""vente" · "location" · "both"
citiesstring[](all 9 cities)Filter to specific cities, e.g. ["casablanca", "rabat"]
propertyTypesstring[]4 main typesappartements · villas · maisons · riads · terrains · bureaux · commerces
maxListingsinteger5000Hard cap on detail pages scraped (0 = unlimited)
maxConcurrencyinteger5Parallel browser tabs (max 20)
startUrlsarray[]Override seed URLs; leave empty for auto-generation
proxyConfigurationobjectApify ResidentialProxy settings — residential proxy is strongly recommended

Example input

{
"transactionType": "vente",
"cities": ["casablanca", "marrakech", "rabat"],
"propertyTypes": ["appartements", "villas"],
"maxListings": 1000,
"maxConcurrency": 5,
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

Apify Console output

After a run completes, the Output tab in Apify Console shows four named links:

OutputDescription
Housing listings (Overview)All scraped records in a table view (city, type, price, surface, rooms, URL)
ML-ready datasetSame records restricted to the 12 ML feature columns — export this as CSV for model training
Run statisticsJSON with total listings, pages visited, null-rates per field, elapsed time
Debug HTML snapshotsHTML captured when a page could not be parsed — useful for debugging after site updates

ML usage example (Python)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
# 1. Load dataset exported from Apify as CSV (ML Dataset view)
df = pd.read_csv("mubawab_dataset.csv")
# 2. Drop rows missing the target variable
df = df.dropna(subset=["priceDh", "surfaceM2"])
# 3. Encode categoricals
df = pd.get_dummies(df, columns=["propertyType", "standing", "state", "city", "transactionType"])
# 4. Feature engineering — Géron-style derived features
df["roomsPerM2"] = df["numRooms"] / df["surfaceM2"]
feature_cols = [c for c in df.columns if c not in ["priceDh", "pricePerM2", "neighborhood", "url", "title", "scrapedAt"]]
X = df[feature_cols].fillna(0)
y = df["priceDh"]
# 5. Train & evaluate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=200, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):,.0f} MAD")

Architecture

.
├── .actor/
│ ├── actor.json ← Actor metadata + schema references
│ ├── input_schema.json ← Typed input form for Apify Console
│ ├── output_schema.json ← Output tab links (dataset + KV store)
│ ├── dataset_schema.json ← Field definitions + two table views
│ └── key_value_store_schema.json ← KV store collections (stats / snapshots)
├── src/
│ ├── main.js ← Entry point: reads input, seeds URLs, starts crawler
│ ├── router.js ← Crawlee router with LISTING_PAGE + DETAIL_PAGE labels
│ ├── parsers/
│ │ ├── listingPage.js ← Extracts listing URLs + next-page link from search results
│ │ └── detailPage.js ← Extracts all 15 schema fields from a property detail page
│ └── utils/
│ └── normalize.js ← Pure functions: parsePrice(), parseSurface(), normalizeCity()
├── Dockerfile ← Apify Playwright image (Node.js 20 + Chromium)
├── package.json
└── README.md

Crawl flow

main.js ──builds seed URLs──► LISTING_PAGE handler
┌─────▼──────────────────────────┐
│ Parse search result page │
│ Extract listing URLs │
│ Follow rel="next" pagination │
└─────┬──────────────────────────┘
│ enqueue detail URLs
┌─────▼──────────────────────────┐
DETAIL_PAGE handler │
│ detailPage.js extracts fields │
│ normalize.js cleans values │
│ Actor.pushData() → dataset │
└────────────────────────────────┘

Key technical decisions

  • Playwright (not Cheerio) — mubawab.ma is JS-rendered; a headless browser is required
  • Multiple CSS selector fallbacks — the site uses different HTML structures for individual listings vs. project/ensemble listings
  • Polite delays — 500–800 ms between requests to avoid rate-limiting
  • Named dataset mubawab-housing — makes the output easy to find and retrieve via API

Cities & property types covered

Cities (default): Casablanca · Marrakech · Rabat · Agadir · Tanger · Fès · Meknès · Oujda · Tétouan

Property types: Appartements · Villas · Maisons · Riads · Terrains · Bureaux · Commerces

Pass any subset via the cities and propertyTypes input fields.


Proxy recommendation

mubawab.ma blocks datacenter IPs. Using Apify Residential Proxy (the default) is strongly recommended for production runs. A free Apify account includes a proxy trial.

Without a proxy, you will encounter CAPTCHAs and 403 errors.


Contributing

Contributions are welcome! Here is how to get started:

  1. Fork this repository
  2. Create a feature branch: git checkout -b feat/your-feature
  3. Make your changes and run a quick local test:
    $apify run --input='{"maxListings": 5}'
  4. Open a Pull Request with a clear description of what changed and why

Good first issues

  • Add support for additional Moroccan cities (agadir, beni-mellal, laayoune…)
  • Improve null-rate for standing and state fields on project listings
  • Add listing_id extraction from the URL slug
  • Write unit tests for normalize.js (Jest or Vitest)

Please open an issue before starting large changes.


License

LICENSE © 2025 Mustapha LIAICHI


Built with Crawlee · Playwright · Apify SDK