generate_maxdiff_data#

pymc_marketing.customer_choice.synthetic_data.generate_maxdiff_data(n_respondents=200, n_items=20, n_tasks_per_resp=12, subset_size=4, true_utilities=None, sigma_respondent=0.6, item_correlation=None, items=None, random_seed=None)[source]#

Generate synthetic MaxDiff (best-worst scaling) data.

Simulates a MaxDiff survey where each respondent sees n_tasks_per_resp tasks, each showing a random subset_size of items drawn uniformly from the full pool of n_items. The respondent picks the best and worst items from the subset according to the Louviere sequential best-worst model.

Parameters:
n_respondentsint, default 200

Number of respondents.

n_itemsint, default 20

Full item pool size.

n_tasks_per_respint, default 12

Tasks shown per respondent.

subset_sizeint, default 4

Items shown per task (must be <= n_items).

true_utilitiesnp.ndarray, optional

Ground-truth item utilities of length n_items. If None, drawn from Normal(0, 1). The last item’s utility is shifted to 0 to match the default identification constraint.

sigma_respondentfloat, default 0.6

Scale of per-respondent item-level deviations (standard deviation). Set to 0 for a homogeneous-preferences population.

item_correlationnp.ndarray, optional

Shape (n_items, n_items) correlation matrix for the per-respondent utility deviations. Must be symmetric, positive semi-definite, with ones on the diagonal. When supplied, respondent deviations are drawn from MVNormal(0, diag(σ) @ item_correlation @ diag(σ)); otherwise deviations are drawn independently (diagonal covariance). Use this to generate correlated ground truth for validating MaxDiffMixedLogit(full_covariance=True) recovery.

itemslist[str], optional

Item names (length n_items). Defaults to ["item_0", ...].

random_seednp.random.Generator or int, optional

Random state for reproducibility.

Returns:
task_dfpd.DataFrame

Long-format data with columns respondent_id, task_id, item_id, is_best, is_worst. One row per shown item per task.

ground_truthdict

{"utilities", "respondent_utilities", "sigma_respondent", "item_correlation", "items"}. utilities is the population-level ground truth (reference item at 0); respondent_utilities holds per-respondent values used for simulation; item_correlation is the (n_items, n_items) correlation matrix used — np.eye(n_items) when item_correlation was not supplied.

Notes

Subsets are drawn uniformly without replacement. Real MaxDiff studies use balanced designs (BIBD) for efficiency; this generator trades that for simplicity and is adequate for parameter-recovery testing.

To verify that MaxDiffMixedLogit(full_covariance=True) recovers the latent correlation structure, generate data with a non-identity item_correlation and compare the posterior mean of corr_matrix against ground_truth["item_correlation"].