Research Article

GenAI-HITL Development and Validity Review of Reading -Writing Constructed -Response Tasks

Park Goun

한국교원대학교

Published: January 2025 · Vol. 60, No. 4 · pp. 129-174

DOI: https://doi.org/10.20880/kler.2025.60.4.129

Abstract

This study developed constructed-response tasks aligned with the 2022 revised Korean Language Arts (“Reading and Writing”) achievement standards using a Generative AI-Human-in-the-Loop (HITL) approach and reported preliminary validity evidence from expert-criterion ratings. A three-stage protocol integrating context engineering and Chain-of- Thought generated a linked package of texts, prompts, rubrics, and explanations. Eighteen in-service Korean language teachers from 13 regions rated the outputs on 12 items across three domains. The overall mean was high (M = 4.32), with the strongest ratings for standard alignment and structural coherence; internal consistency and inter-rater agreement were acceptable. Learner-level appropriateness was lower, indicating limits in capturing non-formal factors (developmental stage, classroom context) despite effective operationalisation of formalised curriculum elements. The findings suggest AI outputs can serve as teacher-adjustable drafts, and that human-AI collaboration may strengthen the balance between formal and substantive validity.

Keywords: 자동 문항 생성생성형 AI인간-AI 협력서술형 평가독서와 작문2022 개정 교육과정

REFERENCES (68)

[1]
[1] [단행본] 학생의 사고력과 문제해결력을 키우는 중등 논술형 평가 길라잡이/경기도교육청/경기도교육청/2024/~/// Google Scholar ›
[2]
[2] [학술지(정기간행물)] 곽선영/17개 교육청의 서·논술형 평가 지침 비교/함께 여는 국어교육/2025/157/84~97// Google Scholar ›
[3]
[3] [보고서] 제6차 교육과정(교육부 고시 제1992-11호)/교육부/교육부/1992/~/ Google Scholar ›
[4]
[4] [보고서] 2022 개정 국어과 교육과정(교육과정 고시 제2022-33호)/교육부/교육부/2022/~/ Google Scholar ›
[5]
[5] [학술지(정기간행물)] 권태현/국어과 평가의 문제점과 체계화 방안 - 수행과 지필 평가의 균형적 접근을 중심으로/어문론집/2021/85/359~394// Google Scholar ›
[6]
[6] [학술지(정기간행물)] 김경희/서·논술형 평가의 평가학적 의미 탐색/교육평가연구/2020/33(4)/839~862// Google Scholar ›
[7]
[7] [단행본] 사고력 함양을 위한 서·논술형 평가 도구 개발 이론과 실제/김선/AMEC/2023/~/// Google Scholar ›
[8]
[8] [학술지(정기간행물)] 김형성/국어 교사의 논술형 평가 전문성 검사 도구 개발/새국어교육/2023/136/167~208// Google Scholar ›
[9]
[9] [학술지(정기간행물)] 남민우/국어과 평가 문항의 양호도 분석틀개발을 위한 기초 연구/청람어문교육/2022/86/71~95// Google Scholar ›
[10]
[10] [학술지(정기간행물)] 박고운/국어과 읽기 영역 선다형 평가를 위한 자동 문항 생성 방안 연구/교육과정평가연구/2025/28(1)/215~246// Google Scholar ›
[11]
[11] [학술지(정기간행물)] 박고운/GAI-HITL 기반 독서 문항 자동 생성(AIG)의 심리측정학적 타당성분석 연구/교육과정평가연구/2025/28(3)/319~359// Google Scholar ›
[12]
[12] [학술지(정기간행물)] 박도순/서·논술형 평가 시행에 관한 고찰/함께 여는 국어교육/2025/157/242~247// Google Scholar ›
[13]
[13] [학술지(정기간행물)] 박종임/국어과 서·논술형 평가의 도입 현황 및 실행 상의 쟁점 탐색 연구/청람어문교육/2024/101/273~307// Google Scholar ›
[14]
[14] [보고서] 컴퓨터 기반 서·논술형 평가를 위한 자동채점 방안 설계(Ⅰ)/박종임/한국교육과정평가원/2022/~/ Google Scholar ›
[15]
[15] [보고서] 수업-평가 연계 강화를 통한 서·논술형 평가 내실화 방안/박혜영/한국교육과정평가원/2019/~/ Google Scholar ›
[16]
[16] [보고서] 서·논술형 평가도구 자료집(국어과)/서울특별시교육청/한국교육과정평가원/2022/~/ Google Scholar ›
[17]
[17] [단행본] 교육평가의 기초/성태제/학지사/2019/~/// Google Scholar ›
[18]
[18] [학술지(정기간행물)] 송슬기/깊이 있는 학습을 위한 필요조건으로서의 논술형 평가의 특징과 지원 방향에 관한 탐색/교육문화연구/2024/30(4)/149~172// Google Scholar ›
[19]
[19] [학술지(정기간행물)] 장성민/대학수학능력시험 서·논술형 평가 도입의 철학적 정당화와 방향 탐색/작문연구/2021/51/117~151// Google Scholar ›
[20]
[20] [학술지(정기간행물)] 장성민/도구 교과로서의 역할을 고려한 표현론적 관점에서의 학문 문식성 구체화 방향 탐색: 수능 서·논술형 문항 설계를 위한 논증 과제 분류를 중심으로/작문연구/2024/62/51~90// Google Scholar ›
[21]
[21] [학술지(정기간행물)] 정민주/좋은 국어과 평가 문항 특성에 관한 질적 분석 연구: 국어과 평가 문항 양호도 분석틀 개발 연구(2)/청람어문교육/2022/89/43~78// Google Scholar ›
[22]
[22] [학술지(정기간행물)] 최숙기/서·논술형 수능 도입을 대비한 2022 개정 국어과 교육과정의 개정 방향 탐색/청람어문교육/2021/83/129~156// Google Scholar ›
[23]
[23] [학술지(정기간행물)] 최숙기/국어과 서·논술형 수능 평가 문항 개발 방안 연구/청람어문교육/2023/91/135~178// Google Scholar ›
[24]
[24] [학술지(정기간행물)] 최숙기/2022 개정 국어과 교육과정 <독서와 작문> 교육과정 개발의 원리와 방향/작문연구/2023/57/165~199// Google Scholar ›
[25]
[25] [학술지(정기간행물)] 최숙기/생성형 AI를 활용한 현직 국어교사의 서·논술형 평가 문항 개발 양상분석/청람어문교육/2024/97/243~270// Google Scholar ›
[26]
[26] [인터넷자원] 서·논술형 평가 도구 개발의 방법과 사례/https://stas.moe.go.kr/bbs/artcl/artclDtl:EVAL_TASK_DEV_S3?page=0&size=10&redraw=&totalPages=6&sBbsId=EVAL_TASK_DEV_S3&sArtclSeq=500658&sFileKey=&sCprtYn=Y&sCond=ARTCL_TITLE&sWord=/학생평가지원포털/20250623/학생평가지원포털/20241231 Google Scholar ›
[27]
[27] [단행본] 2026학년도 수능특강: 국어영역 독서/한국교육방송공사/한국교육방송공사/2025/~/// Google Scholar ›
[28]
[28] [학술지(정기간행물)] 함은혜/GPT를 활용한 서술형 문항 생성 프로토콜과문항의 질 평가: 국어과 사례를 중심으로/교육학연구/2024/62(8)/63~93// Google Scholar ›
[29]
[29] [단행본] A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives/Anderson, L. W./Longman/2001/~/// Google Scholar ›
[30]
[30] [학술지(정기간행물)] Attali Y./The interactive reading task: Transformer-based automatic item generation/Frontiers in Artificial Intelligence/2022/5/903077~// Google Scholar ›
[31]
[31] [학술대회논문] Bender, E. M./On the dangers of stochastic parrots: Can language models be too big?/Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency/2021//610~623/Association for Computing Machinery Google Scholar ›
[32]
[32] [학술지(정기간행물)] Bozkurt A./Tell me your prompts and I will make them true: The alchemy of prompt engineering and generative AI/Open Praxis/2024/16(2)/111~118// Google Scholar ›
[33]
[33] [단행본] Metacognition, motivation, and understanding/Brown, A. L./Lawrence Erlbaum Associates/1987/65~116///Metacognition, executive control, self - regulation, and other more mysterious mechanisms Google Scholar ›
[34]
[34] [학술지(정기간행물)] Circi R/Automatic item generation: Foundations and machine-learning -based approaches for assessments/Frontiers in Education/2023/8/858273~// Google Scholar ›
[35]
[35] [보고서] AP Seminar - End-of -Course Exam Scoring Guidelines/College Board/The College Board/2019/~/ Google Scholar ›
[36]
[36] [학술대회논문] Dhuliawala, S./Chain - of -verification reduces hallucination in large language models/Findings of the Association for Computational Linguistics: ACL 2024/2024//3563~3578/ Google Scholar ›
[37]
[37] [학술지(정기간행물)] Eager, B./Prompting higher education towards AI -augmented teaching and learning practice/Journal of University Teaching & Learning Practice/2023/20(5)/2~// Google Scholar ›
[38]
[38] [학술지(정기간행물)] Fitzgerald, J./Reading and writing relations and their development/Educational Psychologist/2000/35(1)/39~50// Google Scholar ›
[39]
[39] [기타자료] Ganguli, D./Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned/arXiv preprint. arXiv:2209.07858/2022/~// Google Scholar ›
[40]
[40] [인터넷자원] Gemini 2.5 Pro: Model card/https://storage.googleapis.com/model-cards/documents/gemini-2.5-pro.pdf/Google DeepMind//Google Cloud Storage/20250627 Google Scholar ›
[41]
[41] [학술지(정기간행물)] /////~// Google Scholar ›
[42]
[42] [학술지(정기간행물)] Kan, A./Crossed random-effect modelling: Examining the effects of teacher experience and rubric use in performance assessments/Eurasian Journal of Educational Research/2014/57/1~28// Google Scholar ›
[43]
[43] [학술지(정기간행물)] Kane, M. T./Validating the interpretations and uses of test scores/Journal of Educational Measurement/2013/50(1)/1~73// Google Scholar ›
[44]
[44] [학술대회논문] Kharrufa A./The Potential and Implications of Generative AI on HCI Education/Proceedings of the 6th Annual Symposium on HCI Education (EduCHI '24)/2024//1~8/Association for Computing Machinery Google Scholar ›
[45]
[45] [학술지(정기간행물)] Koo T. K./A guideline of selecting and reporting intraclass correlation coefficients for reliability research/Journal of Chiropractic Medicine/2016/15(2)/155~163// Google Scholar ›
[46]
[46] [단행본] Educational testing and measurement: Classroom application and practice/Kubiszyn, T./John Wiley & Sons/2013/~/// Google Scholar ›
[47]
[47] [보고서] Einheitliche Prüfungsanforderungen in der Abiturprüfung Deutsch/Kultusministerkonferenz/Kultusministerkonferenz/2002/~/ Google Scholar ›
[48]
[48] [학술대회논문] Lewis, P./Retrieval -augmented generation for knowledge-intensive NLP tasks/Advances in Neural Information Processing Systems/2020/33/9459~9474/ Google Scholar ›
[49]
[49] [기타자료] Lightman H./Let's verify step by step/arXiv preprint arXiv:2305.20050/2023/~// Google Scholar ›
[50]
[50] [기타자료] Madaan, A./Self -Refine: Iterative refinement with self - feedback/arXiv preprint arXiv:2305.17651/2023/~// Google Scholar ›
[51]
[51] [보고서] Best practices for constructed - response scoring/McCaffrey, D. F./Educational Testing Service/2022/~/ Google Scholar ›
[52]
[52] [단행본] Classroom assessment: Principles and practice for effective standards - based instruction (6th ed.)/McMillan, J. H./Pearson/2014/~/// Google Scholar ›
[53]
[53] [학술지(정기간행물)] Memarian B./Human-in - the -loop in artificial intelligence in education: A review and entity - relationship (ER) analysis/Computers in Human Behavior: Artificial Humans/2024/2(1)/100053~// Google Scholar ›
[54]
[54] [단행본] Educational measurement/Messick, S./Macmillan/1989/13~103///Validity Google Scholar ›
[55]
[55] [단행본] Measurement and assessment in teaching/Miller, M. D./Pearson Education/2013/~/// Google Scholar ›
[56]
[56] [인터넷자원] OpenAI o3 and o4 -mini: System card/https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf/OpenAI///20250416 Google Scholar ›
[57]
[57] [학술지(정기간행물)] Qian, Y./Prompt engineering in education: A systematic review of approaches and educational applications/Journal of Educational Computing Research/2025/63(7-8)/~// Google Scholar ›
[58]
[58] [단행본] The reflective practitioner: How professionals think in action/Schön, D. A./Basic Books/1983/~/// Google Scholar ›
[59]
[59] [기타자료] Shah C./From prompt engineering to prompt science with human in the loop/arXiv preprint arXiv:2401.04122/2024/~// Google Scholar ›
[60]
[60] [학술지(정기간행물)] Shrout, P. E./Intraclass correlations: Uses in assessing rater reliability/Psychological Bulletin/1979/86(2)/420~428// Google Scholar ›
[61]
[61] [학술지(정기간행물)] Tavakol M./Making sense of Cronbach's alpha/International Journal of Medical Education/2011/2/53~55// Google Scholar ›
[62]
[62] [보고서] Artificial intelligence and the future of teaching and learning: Insights and recommendations/U.S. Department of Education/Office of Educational Technology/2023/~/ Google Scholar ›
[63]
[63] [학술대회논문] Wang L./Plan -andsolve prompting: Improving zero - shot chain - of - thought reasoning by large language models/Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)/2023/1/2609~2634/Association for Computational Linguistics Google Scholar ›
[64]
[64] [단행본] Webb's Depth - of - Knowledge Guide: Career and technical education definitions/Webb, N. L./Wisconsin Center for Education Research, University of Wisconsin - Madison/2009/~/// Google Scholar ›
[65]
[65] [학술대회논문] Wei, J./Chain -of -Thought prompting elicits reasoning in large language models/Proceedings of the 36th Conference on Neural Information Processing Systems(NeurIPS 2022)/2022/1800/24824~24837/ Google Scholar ›
[66]
[66] [기타자료] White, J./A prompt pattern catalog to enhance prompt engineering with ChatGPT/arXiv preprint arXiv:2302.11382/2023/~// Google Scholar ›
[67]
[67] [보고서] Shaping the future of learning: The role of AI in Education 4.0/World Economic Forum/World Economic Forum/2024/~/ Google Scholar ›
[68]
[68] [학술지(정기간행물)] Zanzotto, F. M./Viewpoint: Human-in - the - loop artificial intelligence/Journal of Artificial Intelligence Research/2019/64(1)/243~252// Google Scholar ›

Publication History

Published 2025-01-01

Metrics

Cited by 0