Six Human-Centered Artificial Intelligence Grand Challenges

Read full paper →
Authors
Özlem Özmen Garibay, Brent Winslow, Salvatore Andolina, Margherita Antona, Anja Bodenschatz, Constantinos K. Coursaris, Gregory Falco, Stephen M. Fiore, Iván Garibay, Keri Grieman, John C. Havens, Marina Jirotka, Hernisa Kacorri, Waldemar Karwowski, Joseph T. Kider, Joseph A. Konstan, Sean Koon, Mónica López-González, Iliana Maifeld-Carucci, Sean McGregor, Gavriel Salvendy, Ben Shneiderman, Constantine Stephanidis, Christina Strobel, Carolyn Ten Holter, Wei Xu
Journal
International Journal of Human-Computer Interaction
Year
2023
Citations
419

TL;DR

This paper identifies six grand challenges for developing AI that is ethical, fair, and enhances human well-being, but it does not report any experimental data, effect sizes, or testable interventions — it is a consensus-based position paper, not a study you can directly replicate in a self-experiment.

What they tested

This is not an experimental study. The authors convened 26 experts from academia, industry, and government to reach consensus on the most pressing challenges for human-centered artificial intelligence (HCAI). The "intervention" is a structured expert elicitation process. The "outcome" is a list of six grand challenges and associated research directions. There are no comparators, no control conditions, and no quantitative outcome measures.

Who was studied

26 experts in human-centered AI, drawn from:

Academia (universities in the US, Europe, and Asia)

Industry (technology companies, consulting firms)

Government (research agencies, policy bodies)

No demographic details (age, gender, years of experience, specific disciplines) are reported. The paper does not describe how experts were selected, whether they were representative of the broader HCAI community, or whether any experts declined participation.

How they measured it

No instruments or scales were used. The methodology is a qualitative consensus-building process:

Experts participated in a series of workshops and discussions

The authors synthesized the discussions into six grand challenges

No formal voting, ranking, or quantitative prioritization was conducted

No inter-rater reliability, agreement metrics, or validation procedures are reported

The paper does not measure any human outcomes, behaviors, or system performance. It is a framework paper, not an empirical study.

Methodology

**Study design:** This is a consensus-based position paper, sometimes called a "grand challenges" paper or a "vision" paper. It is not a systematic review, meta-analysis, randomized controlled trial, or any form of empirical research.

**Process:** The authors report that the six challenges emerged from "an international collaboration across academia, industry and government" and represent "the consensus views of a group of 26 experts." No details are provided about:

How experts were recruited or selected

Whether the process was structured (e.g., Delphi method, nominal group technique) or unstructured (open discussion)

How disagreements were resolved

Whether the experts were blinded to each other's views

How many rounds of discussion occurred

The duration of the process (weeks? months?)

Whether any literature review or evidence synthesis preceded the discussions

**What this design can prove:** This design can produce a list of topics that a specific group of experts considers important. It can identify areas where experts agree there is a need for more research. It can generate hypotheses and research agendas.

**What this design cannot prove:** This design cannot prove:

That these six challenges are the most important ones (no systematic prioritization)

That these challenges are empirically grounded (no data collection or analysis)

That any specific intervention or approach works (no experiments)

That the expert group's views represent the broader scientific community (no sampling frame)

That the challenges are feasible to address (no implementation testing)

**Major methodological weaknesses:**

No systematic literature review to ground the challenges in existing evidence

No structured consensus methodology (e.g., Delphi) with documented rounds and agreement thresholds

No reporting of dissenting views or minority opinions

No preregistration of the process

No quantitative prioritization (e.g., ranking, rating, or voting)

No conflict of interest disclosures for individual experts

No description of how the authors' own biases may have shaped the synthesis

Key findings

The paper presents six grand challenges, each with associated research directions. No quantitative results, effect sizes, confidence intervals, or p-values are reported because no data were collected.

**Grand Challenge 1: AI should be centered in human well-being**

Research directions: Define and measure well-being in AI contexts; develop AI that promotes flourishing rather than just avoiding harm; study long-term impacts of AI on life satisfaction, autonomy, and social connection.

**Grand Challenge 2: AI should be designed responsibly**

Research directions: Create frameworks for responsible AI development; address unintended consequences like algorithmic bias, job displacement, and environmental impact; develop methods for value alignment.

**Grand Challenge 3: AI should respect privacy**

Research directions: Develop privacy-preserving AI techniques (e.g., federated learning, differential privacy); create transparent data governance models; study trade-offs between privacy and AI performance.

**Grand Challenge 4: AI should follow human-centered design principles**

Research directions: Involve end-users throughout the AI lifecycle; design for transparency, interpretability, and explainability; ensure AI systems are accessible and inclusive.

**Grand Challenge 5: AI should be subject to appropriate governance and oversight**

Research directions: Develop regulatory frameworks that balance innovation with protection; create auditing and certification mechanisms; establish accountability for AI outcomes.

**Grand Challenge 6: AI should interact with individuals while respecting human cognitive capacities**

Research directions: Design AI that augments rather than replaces human cognition; avoid cognitive overload, automation bias, and deskilling; study human-AI teaming and trust.

Effect magnitude

Not applicable. No quantitative effects are reported. The paper does not test any intervention or measure any outcome. The "effect" is a conceptual framework — the six challenges themselves — and there is no way to quantify their importance, feasibility, or impact from this paper alone.

Limitations

**What the authors acknowledge:**

The challenges are based on expert consensus, not empirical data

The list is not exhaustive; other challenges may exist

The research directions are high-level and need operationalization

Implementation will require interdisciplinary collaboration

**What a critical reader would note:**

**No evidence base:** The paper does not cite any systematic review or meta-analysis to support the claim that these are the most pressing challenges. It is entirely opinion-based.

**Selection bias:** The 26 experts were chosen by the authors, but no criteria for selection are given. The group may overrepresent certain perspectives (e.g., Western, academic, male) and underrepresent others (e.g., Global South, industry practitioners, affected communities).

**No dissent reported:** Consensus processes typically document areas of disagreement. This paper presents the challenges as if all 26 experts agreed on every point, which is unlikely.

**No prioritization:** All six challenges are presented as equally important. There is no ranking, no weighting, and no guidance on which to tackle first.

**No actionable specificity:** The research directions are broad (e.g., "develop methods for value alignment") and do not provide testable hypotheses, measurable outcomes, or concrete interventions.

**No timeline or feasibility assessment:** The paper does not estimate how long it might take to address these challenges, what resources would be required, or what success would look like.

**Potential for self-fulfilling prophecy:** By declaring these "grand challenges," the paper may shape research funding and priorities without evidence that these are the right problems to solve.

**No conflict of interest disclosures:** Individual experts' affiliations are listed, but no conflicts of interest are declared. Some experts may have financial interests in certain AI approaches or technologies.

Practical takeaways

**Important caveat:** This paper does not report any experimental data, so it cannot directly inform a self-experiment. The practical takeaways below are derived from the research directions the paper proposes, but they are speculative and not evidence-based.

For someone running their own n=1 experiment related to human-centered AI:

**What to test (specific intervention and dose):**

Test whether using an AI tool that provides explanations for its recommendations (vs. a black-box AI) changes your trust, decision quality, or cognitive load.

Test whether using a privacy-preserving AI (e.g., on-device processing) vs. cloud-based AI affects your willingness to use the tool or your perceived privacy risk.

Test whether taking regular "AI-free" breaks (e.g., 1 hour per day) affects your well-being, productivity, or sense of autonomy.

**Minimum meaningful duration:**

For trust and cognitive load: 2–4 weeks per condition, with at least 5–10 interactions per day.

For well-being and autonomy: 4–8 weeks per condition, as these outcomes change slowly.

For privacy perceptions: 1–2 weeks per condition, as attitudes can shift quickly with experience.

**What to measure (specific metrics):**

**Trust:** Use the Trust in Automation scale (0–100, higher = more trust). Measure daily after using the AI.

**Cognitive load:** Use the NASA Task Load Index (NASA-TLX, 0–100, higher = more load). Measure after each AI interaction.

**Decision quality:** Track accuracy or speed of decisions made with vs. without AI assistance. Use a simple task like classifying images or answering trivia.

**Well-being:** Use the Positive and Negative Affect Schedule (PANAS, 10–50 per subscale, higher = more affect). Measure weekly.

**Privacy concern:** Use the Internet Users' Information Privacy Concerns scale (IUIPC, 1–7, higher = more concern). Measure at baseline and end of each condition.

**Autonomy:** Use the Basic Psychological Need Satisfaction scale (autonomy subscale, 1–7, higher = more autonomy). Measure weekly.

**Key confounds to control for:**

**Order effects:** Randomize the order of conditions (e.g., explainable AI first vs. black-box first) or use a crossover design with a washout period of at least 3 days.

**Task difficulty:** Keep the task type and difficulty constant across conditions. If the AI helps with different tasks, you cannot compare.

**Time of day:** Conduct AI interactions at the same time each day to control for circadian effects on cognition and mood.

**Prior experience:** If you have used similar AI tools before, note your baseline familiarity. Consider a 1-week familiarization period before data collection.

**Expectation effects:** You may expect the explainable AI to be better. Use a blinding procedure where possible (e.g., don't reveal which condition you're in until after the experiment).

**External stressors:** Track major life events (work deadlines, illness, travel) that could affect well-being and confound results.

**What a positive result would look like:**

**For explainable AI:** A 10–15 point increase in trust scores (on 0–100 scale) and a 5–10 point decrease in NASA-TLX scores compared to black-box AI, with consistent effects across at least 3 weeks.

**For privacy-preserving AI:** A 1–2 point increase in willingness to use the tool (on a 1–7 scale) and a 0.5–1 point decrease in privacy concern scores compared to cloud-based AI.

**For AI-free breaks:** A 5–10 point increase in positive affect and a 3–5 point decrease in negative affect on PANAS, plus a 0.5–1 point increase in autonomy scores, compared to weeks without breaks.

**General rule:** Look for effects that are consistent across at least 70% of measurement days and exceed your baseline variability (calculate your standard deviation during a 1-week baseline period; a positive result should be at least 1 standard deviation above baseline).

**What this paper cannot tell you:**

Whether any of these interventions actually work (no data)

What dose or duration is optimal (no experiments)

What the mechanism of action might be (no theory testing)

Whether effects generalize to other AI tools, tasks, or populations (no replication)

**Bottom line:** Use this paper as inspiration for hypotheses to test in your own experiments, but do not treat its recommendations as evidence-based. The six grand challenges are expert opinions, not scientific findings. If you want to run a self-experiment on human-centered AI, start with a specific, testable intervention (like explainability or privacy-preserving design) and measure concrete outcomes (trust, cognitive load, well-being) over at least 2–4 weeks per condition. Control for order effects, task difficulty, and expectation bias. And remember: your n=1 results will not prove or disprove the grand challenges — they will only tell you what works for you.

Six Human-Centered Artificial Intelligence Grand Challenges | Steady Practice | SteadyPractice