GenAI-HITL Development and Validity Review of Reading -Writing Constructed -Response Tasks

Park Goun

한국교원대학교

Korea Business Review 60Vol. 4No. pp.129-174 (2025)

DOI: 10.20880/kler.2025.60.4.129

Abstract

This study developed constructed-response tasks aligned with the 2022 revised Korean Language Arts (“Reading and Writing”) achievement standards using a Generative AI-Human-in-the-Loop (HITL) approach and reported preliminary validity evidence from expert-criterion ratings. A three-stage protocol integrating context engineering and Chain-of- Thought generated a linked package of texts, prompts, rubrics, and explanations. Eighteen in-service Korean language teachers from 13 regions rated the outputs on 12 items across three domains. The overall mean was high (M = 4.32), with the strongest ratings for standard alignment and structural coherence; internal consistency and inter-rater agreement were acceptable. Learner-level appropriateness was lower, indicating limits in capturing non-formal factors (developmental stage, classroom context) despite effective operationalisation of formalised curriculum elements. The findings suggest AI outputs can serve as teacher-adjustable drafts, and that human-AI collaboration may strengthen the balance between formal and substantive validity.

Keywords

자동 문항 생성생성형 AI인간-AI 협력서술형 평가독서와 작문2022 개정 교육과정

References

[1] [단행본] 학생의 사고력과 문제해결력을 키우는 중등 논술형 평가 길라잡이/경기도교육청/경기도교육청/2024/~///
[2] [학술지(정기간행물)] 곽선영/17개 교육청의 서·논술형 평가 지침 비교/함께 여는 국어교육/2025/157/84~97//
[3] [보고서] 제6차 교육과정(교육부 고시 제1992-11호)/교육부/교육부/1992/~/
[4] [보고서] 2022 개정 국어과 교육과정(교육과정 고시 제2022-33호)/교육부/교육부/2022/~/
[5] [학술지(정기간행물)] 권태현/국어과 평가의 문제점과 체계화 방안 - 수행과 지필 평가의 균형적 접근을 중심으로/어문론집/2021/85/359~394//
[6] [학술지(정기간행물)] 김경희/서·논술형 평가의 평가학적 의미 탐색/교육평가연구/2020/33(4)/839~862//
[7] [단행본] 사고력 함양을 위한 서·논술형 평가 도구 개발 이론과 실제/김선/AMEC/2023/~///
[8] [학술지(정기간행물)] 김형성/국어 교사의 논술형 평가 전문성 검사 도구 개발/새국어교육/2023/136/167~208//
[9] [학술지(정기간행물)] 남민우/국어과 평가 문항의 양호도 분석틀개발을 위한 기초 연구/청람어문교육/2022/86/71~95//
[10] [학술지(정기간행물)] 박고운/국어과 읽기 영역 선다형 평가를 위한 자동 문항 생성 방안 연구/교육과정평가연구/2025/28(1)/215~246//
[11] [학술지(정기간행물)] 박고운/GAI-HITL 기반 독서 문항 자동 생성(AIG)의 심리측정학적 타당성분석 연구/교육과정평가연구/2025/28(3)/319~359//
[12] [학술지(정기간행물)] 박도순/서·논술형 평가 시행에 관한 고찰/함께 여는 국어교육/2025/157/242~247//
[13] [학술지(정기간행물)] 박종임/국어과 서·논술형 평가의 도입 현황 및 실행 상의 쟁점 탐색 연구/청람어문교육/2024/101/273~307//
[14] [보고서] 컴퓨터 기반 서·논술형 평가를 위한 자동채점 방안 설계(Ⅰ)/박종임/한국교육과정평가원/2022/~/
[15] [보고서] 수업-평가 연계 강화를 통한 서·논술형 평가 내실화 방안/박혜영/한국교육과정평가원/2019/~/
[16] [보고서] 서·논술형 평가도구 자료집(국어과)/서울특별시교육청/한국교육과정평가원/2022/~/
[17] [단행본] 교육평가의 기초/성태제/학지사/2019/~///
[18] [학술지(정기간행물)] 송슬기/깊이 있는 학습을 위한 필요조건으로서의 논술형 평가의 특징과 지원 방향에 관한 탐색/교육문화연구/2024/30(4)/149~172//
[19] [학술지(정기간행물)] 장성민/대학수학능력시험 서·논술형 평가 도입의 철학적 정당화와 방향 탐색/작문연구/2021/51/117~151//
[20] [학술지(정기간행물)] 장성민/도구 교과로서의 역할을 고려한 표현론적 관점에서의 학문 문식성 구체화 방향 탐색: 수능 서·논술형 문항 설계를 위한 논증 과제 분류를 중심으로/작문연구/2024/62/51~90//
[21] [학술지(정기간행물)] 정민주/좋은 국어과 평가 문항 특성에 관한 질적 분석 연구: 국어과 평가 문항 양호도 분석틀 개발 연구(2)/청람어문교육/2022/89/43~78//
[22] [학술지(정기간행물)] 최숙기/서·논술형 수능 도입을 대비한 2022 개정 국어과 교육과정의 개정 방향 탐색/청람어문교육/2021/83/129~156//
[23] [학술지(정기간행물)] 최숙기/국어과 서·논술형 수능 평가 문항 개발 방안 연구/청람어문교육/2023/91/135~178//
[24] [학술지(정기간행물)] 최숙기/2022 개정 국어과 교육과정 <독서와 작문> 교육과정 개발의 원리와 방향/작문연구/2023/57/165~199//
[25] [학술지(정기간행물)] 최숙기/생성형 AI를 활용한 현직 국어교사의 서·논술형 평가 문항 개발 양상분석/청람어문교육/2024/97/243~270//
[26] [인터넷자원] 서·논술형 평가 도구 개발의 방법과 사례/https://stas.moe.go.kr/bbs/artcl/artclDtl:EVAL_TASK_DEV_S3?page=0&size=10&redraw=&totalPages=6&sBbsId=EVAL_TASK_DEV_S3&sArtclSeq=500658&sFileKey=&sCprtYn=Y&sCond=ARTCL_TITLE&sWord=/학생평가지원포털/20250623/학생평가지원포털/20241231
[27] [단행본] 2026학년도 수능특강: 국어영역 독서/한국교육방송공사/한국교육방송공사/2025/~///
[28] [학술지(정기간행물)] 함은혜/GPT를 활용한 서술형 문항 생성 프로토콜과문항의 질 평가: 국어과 사례를 중심으로/교육학연구/2024/62(8)/63~93//
[29] [단행본] A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives/Anderson, L. W./Longman/2001/~///
[30] [학술지(정기간행물)] Attali Y./The interactive reading task: Transformer-based automatic item generation/Frontiers in Artificial Intelligence/2022/5/903077~//
[31] [학술대회논문] Bender, E. M./On the dangers of stochastic parrots: Can language models be too big?/Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency/2021//610~623/Association for Computing Machinery
[32] [학술지(정기간행물)] Bozkurt A./Tell me your prompts and I will make them true: The alchemy of prompt engineering and generative AI/Open Praxis/2024/16(2)/111~118//
[33] [단행본] Metacognition, motivation, and understanding/Brown, A. L./Lawrence Erlbaum Associates/1987/65~116///Metacognition, executive control, self - regulation, and other more mysterious mechanisms
[34] [학술지(정기간행물)] Circi R/Automatic item generation: Foundations and machine-learning -based approaches for assessments/Frontiers in Education/2023/8/858273~//
[35] [보고서] AP Seminar - End-of -Course Exam Scoring Guidelines/College Board/The College Board/2019/~/
[36] [학술대회논문] Dhuliawala, S./Chain - of -verification reduces hallucination in large language models/Findings of the Association for Computational Linguistics: ACL 2024/2024//3563~3578/
[37] [학술지(정기간행물)] Eager, B./Prompting higher education towards AI -augmented teaching and learning practice/Journal of University Teaching & Learning Practice/2023/20(5)/2~//
[38] [학술지(정기간행물)] Fitzgerald, J./Reading and writing relations and their development/Educational Psychologist/2000/35(1)/39~50//
[39] [기타자료] Ganguli, D./Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned/arXiv preprint. arXiv:2209.07858/2022/~//
[40] [인터넷자원] Gemini 2.5 Pro: Model card/https://storage.googleapis.com/model-cards/documents/gemini-2.5-pro.pdf/Google DeepMind//Google Cloud Storage/20250627
[41] [학술지(정기간행물)] /////~//
[42] [학술지(정기간행물)] Kan, A./Crossed random-effect modelling: Examining the effects of teacher experience and rubric use in performance assessments/Eurasian Journal of Educational Research/2014/57/1~28//
[43] [학술지(정기간행물)] Kane, M. T./Validating the interpretations and uses of test scores/Journal of Educational Measurement/2013/50(1)/1~73//
[44] [학술대회논문] Kharrufa A./The Potential and Implications of Generative AI on HCI Education/Proceedings of the 6th Annual Symposium on HCI Education (EduCHI '24)/2024//1~8/Association for Computing Machinery
[45] [학술지(정기간행물)] Koo T. K./A guideline of selecting and reporting intraclass correlation coefficients for reliability research/Journal of Chiropractic Medicine/2016/15(2)/155~163//
[46] [단행본] Educational testing and measurement: Classroom application and practice/Kubiszyn, T./John Wiley & Sons/2013/~///
[47] [보고서] Einheitliche Prüfungsanforderungen in der Abiturprüfung Deutsch/Kultusministerkonferenz/Kultusministerkonferenz/2002/~/
[48] [학술대회논문] Lewis, P./Retrieval -augmented generation for knowledge-intensive NLP tasks/Advances in Neural Information Processing Systems/2020/33/9459~9474/
[49] [기타자료] Lightman H./Let's verify step by step/arXiv preprint arXiv:2305.20050/2023/~//
[50] [기타자료] Madaan, A./Self -Refine: Iterative refinement with self - feedback/arXiv preprint arXiv:2305.17651/2023/~//
[51] [보고서] Best practices for constructed - response scoring/McCaffrey, D. F./Educational Testing Service/2022/~/
[52] [단행본] Classroom assessment: Principles and practice for effective standards - based instruction (6th ed.)/McMillan, J. H./Pearson/2014/~///
[53] [학술지(정기간행물)] Memarian B./Human-in - the -loop in artificial intelligence in education: A review and entity - relationship (ER) analysis/Computers in Human Behavior: Artificial Humans/2024/2(1)/100053~//
[54] [단행본] Educational measurement/Messick, S./Macmillan/1989/13~103///Validity
[55] [단행본] Measurement and assessment in teaching/Miller, M. D./Pearson Education/2013/~///
[56] [인터넷자원] OpenAI o3 and o4 -mini: System card/https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf/OpenAI///20250416
[57] [학술지(정기간행물)] Qian, Y./Prompt engineering in education: A systematic review of approaches and educational applications/Journal of Educational Computing Research/2025/63(7-8)/~//
[58] [단행본] The reflective practitioner: How professionals think in action/Schön, D. A./Basic Books/1983/~///
[59] [기타자료] Shah C./From prompt engineering to prompt science with human in the loop/arXiv preprint arXiv:2401.04122/2024/~//
[60] [학술지(정기간행물)] Shrout, P. E./Intraclass correlations: Uses in assessing rater reliability/Psychological Bulletin/1979/86(2)/420~428//
[61] [학술지(정기간행물)] Tavakol M./Making sense of Cronbach's alpha/International Journal of Medical Education/2011/2/53~55//
[62] [보고서] Artificial intelligence and the future of teaching and learning: Insights and recommendations/U.S. Department of Education/Office of Educational Technology/2023/~/
[63] [학술대회논문] Wang L./Plan -andsolve prompting: Improving zero - shot chain - of - thought reasoning by large language models/Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)/2023/1/2609~2634/Association for Computational Linguistics
[64] [단행본] Webb's Depth - of - Knowledge Guide: Career and technical education definitions/Webb, N. L./Wisconsin Center for Education Research, University of Wisconsin - Madison/2009/~///
[65] [학술대회논문] Wei, J./Chain -of -Thought prompting elicits reasoning in large language models/Proceedings of the 36th Conference on Neural Information Processing Systems(NeurIPS 2022)/2022/1800/24824~24837/
[66] [기타자료] White, J./A prompt pattern catalog to enhance prompt engineering with ChatGPT/arXiv preprint arXiv:2302.11382/2023/~//
[67] [보고서] Shaping the future of learning: The role of AI in Education 4.0/World Economic Forum/World Economic Forum/2024/~/
[68] [학술지(정기간행물)] Zanzotto, F. M./Viewpoint: Human-in - the - loop artificial intelligence/Journal of Artificial Intelligence Research/2019/64(1)/243~252//

Article Info

Abstract

Keywords

References