essay scoring online

Free Online Paper Grader Calculator: Rate Your Essay In Seconds

Free Online Paper Grader Calculator: Rate Your Essay In Seconds

Get Expert Help

StudyCrumb is a globally trusted company delivering academic writing assistance. Backed by qualified writers, we provide unique academic papers tailored to clients specific needs.

Take your writing to a whole new level with our editing and proofreading services. Our academic proofreaders will fine-tune your essay and make it impeccable.

Why Choose StudyCrumb

All you have to do is type in or paste your text below this instruction and click Check text to get all the results. Click on the highlighted spelling error or grammar improvements

Grade My Essay for Free, StudyCrumb

Reasons to use our grade my paper calculator .

I can grade my paper free online! Yes! It finally happened! (We hope you don’t mind us continuing our monologue from the perspective of a student.) But why exactly would I use this particular website that grades papers?

Three-in-one solution I can finally check what grade is my writing going to get, evaluate the readability of my essay, and grade my paper for plagiarism at the same time! It's like readability checker , plagiarism checker and writing checker in one paper grading tool.
No registration Finally, a decent service that does not require your passport details and the names of your pets to operate. At least somewhere my private life stays private.
I pay nothing No fees or hidden payments – my money stays in my pocket. With that measly sum, I have each month for expenses, I can buy more food. God bless those altruists for helping students of the world!
Easy to use My computer can’t handle another app, no space on a hard drive. All that hustle and fuss are long gone. Now I can check my texts everywhere, without downloading anything. One second and I know everything I need to know.

Use Student Essay Scorer Online to Improve Your Writing

Now you have found a perfect grader tool for free essay scoring. After you write something, just insert your final version into the box on our website that grades your essay. Based on received feedback and smart suggestions, you will be able to fix typing mistakes, spelling errors, increase the readability and quality of your work. The writing was never an easy thing to master, so any help will be greatly appreciated. Especially if that help comes at the right time and provides the right amount of information. This exact balance makes our tool so great. It does not overpower you with red markers and warning signs. It casually and friendly says “here are some of the mistakes I’ve noticed. Would you like to solve them?” Finally, you can stop looking for other ways to “score my essay”.

Haven't started writing? Delegate your " do my essay " task to StudyCrumb and get supreme academic service.

Rate My Paper Free: Grade Any Type of Academic Writing

“Help me rate my writing! Please, rate my paper grammar!” That’s how one morning began for us a few years back. An email from a student, depicting the unjust reality of college academic writing. We saw it as an opportunity to help, so the development of a proficient online content checker began. After a number of sleepless hours connecting AI to machine learning, it was done. Finally, a beautiful unicorn. The one and only, friendlier one among paper raters. Now, our software is capable of proofreading, plagiarism and grammar checking, and formatting every type of written document there is. Any level of difficulty, fully automatic rating, available 24/7. Here are some short descriptions of our most popular grading tasks.

Automatic Essay Grading

Free essay review online is completely automatic now! No more need to press those prehistoric buttons, everything happens in the background. It happens so fast, that results will appear on the screen faster than you say “review my essay free please”. Advanced information technologies and algorithms are always ready to serve you. Essay evaluator online is free, easy to use, and yields fantastic results. It will show your weaknesses, and show smart suggestions on how to improve your writing. Isn’t that what every student wants? Clear and unobtrusive experience. Modern product to satisfy somewhat redundant needs and fit annoying requirements.

There is only one case when your won't need a paper grader. Academic works delivered by our college paper writing service are so great that you won;t need any essay rater.

Online Research Paper Grader

Access this research paper rater free online and get your article professionally assessed in a blink of an eye! No more “where can I grade my paper free” questions – you have the website, you know what to do. Do it! Don’t even try submitting your article without checking it. No commission will allow you to fix your mistakes after the submission. And what if the plagiarism percentage is too high? Trust us, you don’t want all that. Do you want a clean entry with high scores? Then use our free college paper grader to improve your texts right now! In case you haven't written your project, try our research paper services . This way you will get a high-quality paper that eets all requirements.

Thesis Grader

“I am a happy student now, my favorite thesis rater can now rate my thesis!” Those words we expect to hear from you on short notice. Your thesis is getting closer, and we hope you have started working on it already. If you have not, don’t wait for too long and hire an experienced thesis writer . Time is running out, as always. After you type the last letter, take some time to evaluate your thesis. Check for mistakes, spelling errors, assess plagiarism and readability. Fortunately, you now know just the right place to do it – StudyCrumb! Check it, improve it, and get your A+!

Who Can Use Our Essay Rater to Grade Papers

Who do you think uses our essay tester? Aliens? No! Average people, just like you. There are plenty of people who need their texts checked and corrected. Since it’s hard to find a part of modern life or profession where the writing of some sort is not involved, just about everyone uses it. Parents are using it as a school paper grader to help their kids. Teachers and professors use it as college essay grader. No modern education institution can live without essay or paper rating. However, it is necessary to discuss specifics, get to those details, look in every nook and cranny. Let’s have a glimpse at three main categories of our users.

Online Paper Grader for Students

Grading college papers is a pain for every student out there. But writing those papers is even worse. You have to come up with an idea, turn an idea into words, words into sentences, and so on. And even after you’re done, you have one more step – grading paper. You can ignore it, but how would you know your weaknesses? Please, use our grading papers calculator to check your essays so you could always get the best marks and stay on top!

Free Essay Grading Software for Teachers

Almost every teacher has a lot of essays to check, so essay grader for teachers free must change the game! No need to check them manually, just copy and paste a student's text to our website and get the instant score. Paper grader for teachers can become the main way of evaluating students. Also, a teacher can specify which service students should use so everyone will be on the same page when it comes to essay or paper quality.

Online Paper Rater for Writers

Writers rarely need to rate essay. Paper graders free are also not their choice. Writers need a powerful instrument that can evaluate on a far more complex level and provide deep insights, and the tool should account for that. However, we managed to tune our tool just about right so writers could use it for their needs without being slapped in the face with the truth. Now, thousands of writers check their texts here and improve them with our help.

Tired of writing your own essays?

How to use our free essay revisor online.

Do you see this big area in the middle? Type in or copy-paste your text into the box. Check whether your text meets size requirements.

Online essay revision free is done automatically in the background. After evaluation, results and grades will appear on the screen.

Evaluate your mistakes, correct them, and improve your writing skills! Feel free to edit your essay right in the input window.

Why Choose Our Free Online Essay Grader Tool?

Feel free to score essays online without breaking a bank. It even gets better – do all that and much more without spending a single penny.

Beautifully crafted design of our automatic essay grader free online is a true feast for the eyes. A refined and intuitive interface is miles ahead of the competition.

Grading papers online has never been so fast. Blazing speeds with the professional quality of assessment. Enjoy the best of both worlds right after the input.

Big brother is surely watching, but will never know that you grade essay before sending it. No data is stored on servers or sold away.

Features of Free Paper Grader for Students

Grammar grader is one of the core functions of our tool. Without correct spelling and sentence construction, even the smartest text will look boring. Smart algorithms and advanced Artificial Intelligence see tiny little mistakes in words and sentences.

Improve originality of your work by checking it in our essay revisor free of charge! Enormous databases and the latest advancements in machine learning can find even the slightest resemblances between essays. So, pay attention to what you’re copying.

Can our online essay scorer free people from boring texts? Yes, it can. Investigate your essays even further with readability scoring. Try keeping your text on point at all times. Brevity is the soul of wit, as they say.

Complete analysis and assessment are available in the final online essay review. Just glance at it and get precise and in-depth information about your writing skills. With some time and effort, you will definitely get better!

FAQ About Automatic Paper Grader

1. is your essay grader free.

We are a proud fully free website that grades essays. We strongly believe that every student must have the ability to grade and rate their essays before sending them. Our tools also serve another purpose – improving writing quality among teachers and scholars of universities and colleges.

2. Who can revise my paper for free?

Our paper grader for free will do it! Instead of employing editors and writers, we gave this job to intelligent machines. The quality is better, more tasks can be done simultaneously, and we manage to keep our tool absolutely and utterly free! Looking forward to working with you!

3. Do I need to register to grade my writing?

Fortunately, no registration is needed for online paper grader free. Your personal information stays personal. We don’t care who you are. All we care about is providing the highest quality proofreading and text rating at zero price. Just paste your essay and get instant results!

4. How to make my paper better?

After you get feedback from paper grading software, look at your weak spots. Determine main problems and try fixing them one at a time. To fix grammar, pay more attention to what you are reading online. For fixing plagiarism – rewrite your text or use our rewriter tool. You got the gist?

Use our ultrapowerful, fully free paper rater to accurately grade your essay before submitting it. Get deep and extensive feedback for perfecting your written assignments.

One of the most popular searches among students is “grade my essay free”. It is not hard to understand scholars. Colleges from all over the world are now loading their pupils with absurd amounts of essays. Tens of research papers per studying year, writing all day long. And with all that pressure students are forced to maintain good grades. Essay topics never change, but they expect original thoughts from students. How is it fair? Let’s imagine a situation. I am a college student. Each day I wake up at 6 am and start writing. Finally, three hours of hard labor finally bore fruit – an essay. I can’t submit it straight away. I have to rate my essay online so I can fix all problems and resolve all issues. Only after I grade my college essay on a trusted website I can send it to my professor and be sure of getting a good grade.

Entrust your task to StudyCrumb and get a paper tailored to your needs.

Now you can revise essay online free without registration or spending money. Follow these three simple steps.

Finding a good free essay grader online is a real pain for each student. Some services provide miserably small feedback. Others are too detailed and overloaded. During the development of our tool, we did our best to eliminate all mistakes of our competitors. Here are four reasons to choose our tool.

Just like with pokemons, paper grader online free services have their own unique features. Choosing the right one can significantly increase your writing efficiency and skills. If you want your papers and essays to be amazing, you have to select our writing rater. Here are some features of it to back us up on this:

Some of you probably have some questions left regarding automated essay scoring online. Please, check these answers below:

Essay Grader Solution

Responsive learning environments typically involve frequent formative assessments in order to gauge how well students are absorbing classroom instruction. In order to handle the corresponding high volume of paper grading, many teachers rely on test grader apps that can expedite the scoring process.

However, in English, history, and other humanities classes that tend to have more essay assignments, oral presentations, and project-based work, teacher-graded rubrics are a more effective approach to evaluating performance and comprehension. That appears to rule out standard bubble form graders for teachers in those subject areas, leaving them without any grading assistance at all.

Scoring with rubrics

Rubrics establish a guide for evaluating the quality of student work. Whether scoring an essay or research paper , a live performance or art project, or other student-constructed responses, rubrics clearly delineate the various components of the assignment to be graded and the degree of success achieved within each of those areas.

These expectations are communicated to the student at the beginning of the assignment and then scored accordingly by the teacher upon its completion. The dilemma that arises is how to simplify and speed up that grading process when score determination must be done directly by the teacher.

ASSESSMENT MADE EASY

Because GradeCam was the brainchild of experienced teachers, creating a solution for handling time-consuming rubric assignments was a priority. Obviously, there is a certain amount of teacher time required to score these assignments that simply can’t be avoided, but there is also a way to streamline this process and save time on the backend.

Rather than using student-completed answer forms like with regular tests, GradeCam allows teachers to create teacher-completed rubric forms that can be quickly and easily filled in using “The Bingo Method” and then scanned and recorded automatically. This speeds up the assignment and transfer of grades, as well as the data generation necessary to review and respond to areas of concern.

Easy Grader Highlights:

Try Gradient Teacher Premium free for 60 days.

Find the solution that’s right for you.

Revolutionize Your Writing Process with Smodin AI Grader: A Smarter Way to get feedback and achieve academic excellence!

For Students

Stay ahead of the curve, with objective feedback and tools to improve your writing.

Your Virtual Tutor

Harness the expertise of a real-time virtual teacher who will guide every paragraph in your writing process, ensuring you produce an A+ masterpiece in a fraction of the time.

Unbiased Evaluation

Ensure an impartial and objective assessment, removing any potential bias or subjectivity that may be an influence in traditional grading methods.

Perfect your assignments

With the “Write with AI” tool, transform your ideas into words with a few simple clicks. Excel at all your essays, assignments, reports etc. and witness your writing skills soar to new heights

For teachers

Revolutionize your Teaching Methods

Spend less on grading

Embrace the power of efficiency and instant feedback with our cutting-edge tool, designed to save you time while providing a fair and unbiased evaluation, delivering consistent and objective feedback.

Reach out to more students

Upload documents in bulk and establish your custom assessment criteria, ensuring a tailored evaluation process. Expand your reach and impact by engaging with more students.

Focus on what you love

Let AI Grading handle the heavy lifting of assessments for you. With its data-driven algorithms and standardized criteria, it takes care of all your grading tasks, freeing up your valuable time to do what you're passionate about: teaching.

Grader Rubrics

Pick the systematic frameworks that work as guidelines for assessing and evaluating the quality, proficiency, and alignment of your work, allowing for consistent and objective grading without any bias.

Analytical Thinking

Originality

Organization

Focus Point

Write with AI

Set your tone and keywords, and generate brilliance through your words

AI Grader Average Deviation from Real Grade

Our AI grader matches human scores 82% of the time* AI Scores are 100% consistent**

Deviation from real grade (10 point scale)

Graph: A dataset of essays were graded by professional graders on a range of 1-10 and cross-referenced against the detailed criteria within the rubric to determine their real scores. Deviation was defined by the variation of scores from the real score. The graph contains an overall score (the average of all criterias) as well as each individual criteria. The criteria are the premade criteria available on Smodin's AI Grader, listed in the graph as column headings. The custom rubrics were made using Smodin's AI Grader custom criteria generator to produce each criteria listed in Smodin's premade criterias (the same criteria as the column headings). The overall score for Smodin Premade Rubrics matched human scores 73% of the time with our advanced AI, while custom rubrics generated by Smodin's custom rubric generator matched human grades 82% of the time with our advanced AI. The average deviation from the real scores for all criteria is shown above.

* Rubrics created using Smodin's AI custom criteria matched human scores 82% of the time on the advanced AI setting. Smodin's premade criteria matched human scores 73% of the time. When the AI score differed from the human scores, 86% of the time the score only differed by 1 point on a 10 point scale.

** The AI grader provides 100% consistency, meaning that the same essay will produce the same score every time it's graded. All grades used in the data were repeated 3 times and produced 100% consistency across all 3 grading attempts.

AI Feedback

Unleash the Power of Personalized Feedback: Elevate Your Writing with the Ultimate Web-based Feedback Tool

Elevate your essay writing skills with Smodin AI Grader, and achieve the success you deserve with Smodin. the ultimate AI-powered essay grader tool. Whether you are a student looking to improve your grades or a teacher looking to provide valuable feedback to your students, Smodin has got you covered. Get objective feedback to improve your essays and excel at writing like never before! Don't miss this opportunity to transform your essay-writing journey and unlock your full potential.

Smodin AI Grader: The Best AI Essay Grader for Writing Improvement

As a teacher or as a student, writing essays can be a daunting task. It takes time, effort, and a lot of attention to detail. But what if there was a tool that could make the process easier? Meet Smodin Ai Grader, the best AI essay grader on the market that provides objective feedback and helps you to improve your writing skills.

Objective Feedback with Smodin - The Best AI Essay Grader

Traditional grading methods can often be subjective, with different teachers providing vastly different grades for the same piece of writing. Smodin eliminates this problem by providing consistent and unbiased feedback, ensuring that all students are evaluated fairly. With advanced algorithms, Smodin can analyze and grade essays in real-time, providing instant feedback on strengths and weaknesses.

Improve Your Writing Skills with Smodin - The Best AI Essay Grader

Smodin can analyze essays quickly and accurately, providing detailed feedback on different aspects of your writing, including structure, grammar, vocabulary, and coherence. By identifying areas that need improvement and providing suggestions on how to make your writing more effective, if Smodin detects that your essay has a weak thesis statement, it will provide suggestions on how to improve it. If it detects that your essay has poor grammar, it will provide suggestions on how to correct the errors. This makes it easier for you to make improvements to your essay and get better grades and become a better writer.

Smodin Ai Grader for Teachers - The Best Essay Analysis Tool

For teachers, Smodin can be a valuable tool for grading essays quickly and efficiently, providing detailed feedback to students, and helping them improve their writing skills. With Smodin Ai Grader, teachers can grade essays in real-time, identify common errors, and provide suggestions on how to correct them.

Smodin Ai Grader for Students - The Best Essay Analysis Tool

For students, Smodin can be a valuable tool for improving your writing skills and getting better grades. By analyzing your essay's strengths and weaknesses, Smodin can help you identify areas that need improvement and provide suggestions on how to make your writing more effective. This can be especially useful for students who are struggling with essay writing and need extra help and guidance.

Increase your productivity - The Best AI Essay Grader

Using Smodin can save you a lot of time and effort. Instead of spending hours grading essays manually or struggling to improve your writing without feedback, you can use Smodin to get instant and objective feedback, allowing you to focus on other important tasks.

Smodin is the best AI essay grader on the market that uses advanced algorithms to provide objective feedback and help improve writing skills. With its ability to analyze essays quickly and accurately, Smodin can help students and teachers alike to achieve better results in essay writing.

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, automated essay scoring.

26 papers with code • 1 benchmarks • 1 datasets

Essay scoring: Automated Essay Scoring is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.

Source: A Joint Model for Multimodal Document Quality Assessment

Benchmarks Add a Result

Most implemented papers, automated essay scoring based on two-stage learning.

Current state-of-art feature-engineered and end-to-end Automated Essay Score (AES) methods are proven to be unable to detect adversarial samples, e. g. the essays composed of permuted sentences and the prompt-irrelevant essays.

A Neural Approach to Automated Essay Scoring

nusnlp/nea • EMNLP 2016

SkipFlow: Incorporating Neural Coherence Features for End-to-End Automatic Text Scoring

Our new method proposes a new \textsc{SkipFlow} mechanism that models relationships between snapshots of the hidden representations of a long short-term memory (LSTM) network as it reads.

Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input

Youmna-H/Coherence_AES • NAACL 2018

We demonstrate that current state-of-the-art approaches to Automated Essay Scoring (AES) are not well-suited to capturing adversarially crafted input of grammatical but incoherent sequences of sentences.

Co-Attention Based Neural Network for Source-Dependent Essay Scoring

This paper presents an investigation of using a co-attention based neural network for source-dependent essay scoring.

Language models and Automated Essay Scoring

In this paper, we present a new comparative study on automatic essay scoring (AES).

Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems

midas-research/calling-out-bluff • 14 Jul 2020

This number is increasing further due to COVID-19 and the associated automation of education and testing.

Prompt Agnostic Essay Scorer: A Domain Generalization Approach to Cross-prompt Automated Essay Scoring

Cross-prompt automated essay scoring (AES) requires the system to use non target-prompt essays to award scores to a target-prompt essay.

Many Hands Make Light Work: Using Essay Traits to Automatically Score Essays

To find out which traits work best for different types of essays, we conduct ablation tests for each of the essay traits.

EXPATS: A Toolkit for Explainable Automated Text Scoring

octanove/expats • 7 Apr 2021

Automated text scoring (ATS) tasks, such as automated essay scoring and readability assessment, are important educational applications of natural language processing.

IELTS Podcast

Pass IELTS with expert help.

Get Your IELTS Essay Checked For Free!

Everything you need to know about how it works, why you need it – and how it helped our student sunny to improve her writing score..

Our free IELTS essay checker will give you some personalized writing help you need to improve your writing score.

Don’t waste time with fake ‘VIP’ courses or expensive ‘experts’. The IELTS Writing test is tough, and you want to make sure you do everything you can to improve your score.

There are a lot of different things that go into writing a good IELTS essay, and it can be difficult to keep track of everything.

Our free IELTS essay checker will help you identify the areas where you need improvement for task 1 and task 2 so that you can score higher on your next exam.

Click here to check out our free IELTS Essay checker

What is an ielts writing checker.

Sunny was, like a lot of IELTS students, worrying about improving her score – and she learned that the smartest students use the right tools.

An IELTS writing checker is a tool or service that assesses and evaluates the quality of an IELTS writing task.

The checker evaluates the writing skills of the test taker, including the clarity, coherence, organization, and accuracy of the written text.

An IELTS writing checker may provide feedback on various aspects of the essay, including grammar, vocabulary, spelling, punctuation, and sentence structure. Additionally, they may give suggestions on how to improve the essay and achieve a higher score.

Why should you use an IELTS writing checker?

Sunny had previously tried to use a well-known online grammar checker which had helped her writing sound better – but she needed an IELTS-specific tool.

An IELTS writing checker can be a helpful tool for anyone who is preparing to take the IELTS exam. Here are a few reasons why you might want to use an IELTS writing checker:

To get feedback on your writing: This is especially helpful if you are studying for the exam on your own and don’t have a teacher or tutor to give you feedback.
To improve your score: By identifying the specific areas that you need to work on in order to improve your score on the writing section of the exam.
To practice writing under timed conditions: The IELTS writing checker can simulate the conditions of the actual exam by giving you a prompt and a time limit to complete your essay. You should complete section 1 in 20 minutes, and write at least 150 words. Section 2 should take 40 minutes, with at least 250 words. This can help you get used to writing under pressure and develop your time management skills.
To learn from your mistakes: The IELTS writing checker can highlight the mistakes you make in your writing, such as grammar errors, spelling mistakes, or problems with sentence structure. By learning from your mistakes, you can avoid making the same errors in the future.

How to use the IELTS writing checker effectively?

Familiarize yourself with the marking criteria: Before you start using the IELTS writing checker, it’s important to understand the criteria that the examiners use to mark your writing. This will help you understand what you need to focus on to improve your score.
Practice writing regularly: To get the most out of the IELTS writing checker, it’s important to practice writing regularly. This will help you improve your writing skills and give you more opportunities to use the checker.
Analyze your mistakes: When the writing checker highlights your mistakes, take the time to analyze them and understand why you made them. This will help you avoid making the same mistakes in the future.
Use the feedback to improve your writing: The IELTS writing checker provides feedback on your writing, so use it to your advantage. Take note of the areas where you need to improve and make the necessary changes to your writing.
Work on your time management: During the IELTS exam, time management is crucial. To prepare for this, try to complete your writing tasks within the allotted time and use the writing checker to check your work quickly.
Don’t rely on the IELTS writing checker entirely: While the writing checker is a useful tool, it’s important to remember that it’s not perfect. Use it as a guide, but don’t rely on it entirely. Always use your own judgement and common sense when it comes to your writing.

What are some common mistakes made in IELTS writing?

There are several common mistakes that candidates make in IELTS writing. Here are a few:

Not addressing the task properly: One of the most common mistakes that candidates make is not addressing the task properly. They may write a well-organized and grammatically correct essay, but if it does not answer the question asked in the prompt, they will not get a good score. For example, ‘What are the advantages and disadvantages of owning a car?’ If you only write about the advantages of having a car, you can not score high on task achievement.
Poor grammar and spelling: Another common mistake is poor grammar and spelling errors. Candidates should aim to write in grammatically correct sentences and avoid spelling mistakes. These errors can significantly impact the overall score.
Lack of coherence and cohesion: Candidates should ensure that their writing is coherent and cohesive. The essay should have a logical flow, and ideas should be linked together using appropriate transition words and phrases.
Inappropriate word choice: Using inappropriate words or vocabulary can also result in a lower score. Candidates should aim to use a range of vocabulary, but it should be used appropriately in context.
Not meeting the word count: Candidates must meet the word count requirements for each task. Writing too few or too many words can result in a lower score.
Not organizing the essay properly: Candidates should aim to organize their essay into clear paragraphs with a clear introduction, body, and conclusion.
Plagiarism: Plagiarism is a serious offence and can lead to disqualification from courses and exams. If you plagiarise practice IELTS essays, it’s hard to know what you are capable of writing by yourself anyway.

How can the IELTS writing checker help you improve your score?

The IELTS writing check can be a valuable tool in helping you improve your writing skills and ultimately, your IELTS score. Here are a few ways in which the IELTS writing check can assist you:

Feedback on your strengths and weaknesses: The IELTS writing check provides you with personalized feedback on your writing, including an assessment of your strengths and weaknesses. This feedback can help you identify the areas you need to work on to improve your score.

For example, Sunny hadn’t realised that she had a habit of writing ‘however’ in the middle of paragraphs and mis-

spelling ‘because’, ‘in conclusion’ and ‘instead’. Now, she is much more aware of these mistakes.

Identification of common mistakes: The IELTS writing check can also help you identify common mistakes that you may be making in your writing, such as grammar, vocabulary, or sentence structure errors.
Practice opportunities: The IELTS writing check provides you with practice opportunities to work on your writing skills. By submitting practice essays for review, you can receive feedback on your writing and work on improving your weaknesses.
Familiarization with the IELTS exam format: The IELTS writing check can help you become more familiar with the IELTS exam format and requirements. This can help you feel more confident and prepared for the actual exam.
Customized study plan: Based on the feedback provided by the IELTS writing check, you can develop a customized study plan to focus on the areas that need improvement. This can help you maximize your study time and improve your score more efficiently.

Use our free IELTS essay checker to improve your writing score

Taking the IELTS test can be daunting, but with help from our free essay checker, you can bring your score up.

By identifying common mistakes and offering personalized advice on how to correct them, our tool will help you sharpen your writing skills so that you are ready for the real thing.

Whether you’re preparing for a general or academic IELTS exam, using our free essay check tool is an essential part of taking the test and improving your overall international education experience.

Boost Your Scores: Try Our Online IELTS Essay Checker Today

Writing good essays is a big part of the IELTS test. To get a good score, your essay must be strong. But how can you know? That’s where our tool can help. It’s named the online IELTS essay checker , and you can try it here .

This tool is like a friendly teacher. You show it your essay, and it tells you what is good and what needs to be better. This way, you learn fast. The best thing? It saves you money.

Many students spend a lot of money on classes or buying books. But our tool is less costly. And it helps you right away. After you learn from our free essay checks on this page, use our online IELTS essay checker . It can help you know how to make your writing stronger.

In short, our online IELTS essay checker is here to guide you. It’s simple, it helps fast, and it won’t take much money. If you want to do well in the IELTS and not spend a lot, our tool is a great choice. Good luck and happy writing!

Frequently Asked Questions (FAQ)

How does an ielts essay check work.

Once you upload an essay to the essay checker, the grammar checker scans your text and highlights IELTS essay issues within your document so you can see it in context.

Your feedback will include detailed explanations so you can understand why the text was flagged. Other highlighted areas will include examples of how the issues can be fixed.

Is an essay checker worth it?

Yes. You can get instant feedback without having to wait for a teacher to mark your essay. Sign up for more IELTS Materials here.

GR 10 Use a variety of complex and simple sentences

GR 11 Check your essay for errors.

How to Crack IELTS: Exam Preparation Tips and Tricks
How to Prepare for IELTS at Home: Best Ways to Study and get your Perfect Score

Check your IELTS essay online

Improve your ielts writing score within two weeks.

40,285 students have used our tool to improve their band scores without paying for expensive tutoring. The service checks your IELTS essay in seconds.

The best new way to check your essay.

Why you will love it, get a band score online, find ideas quickly, improve ielts writing grammar, money back guarantee.

Achieving my dream score seemed impossible until I found this site. The detailed feedback on essays was crucial.

Gill Avneet Kaur

IELTS Writing Result: 7.5

What it can edit?

Writing9 scans your text for all types of mistakes, from typos to sentence structure problems and beyond.

Perfect evaluation

Hundreds of algorithms will assess your writing according to 4 evaluation criteria. Writing9 helps you find the weak points of your essay and make it flawless.

Helpful hints

After you write your essay, you will get helpful tips showing you how to make your essay better. So you always get a band score above 7.

Topic Ideas & Vocabulary Boost

Get ideas and useful words for your essay topic. Make your text more interesting and show off your vocabulary. Your essay will shine!

How does it work?

Type or paste your essay, press the 'check essay' button, get a band score instantly, amazing. right, what people say about writing9.

Join 40,285 people who love Writing9

🚀 Improve your writing skills today

Unlimited checking of essays
Instant feedback
Highlighting & analysing mistakes
Advanced grammar checker
Weakness discovery
Personalized suggestions
Ideas and vocabulary generator
IELTS Speaking Simulator
E-Book "The ultimate guide to get a target band score of 7+"
App to Improve Speaking Skills
Premium support

The instant feedback feature is a game-changer. It pinpointed my mistakes and drastically improved my skills.

Palihapitiya Inesh

IELTS Writing Result: 7

There's really no risk in your purchase!

Is there a free trial, who corrects my essays, can i trust the service, what tasks can i check, how often can i use your service to check charts, letters and essays, can i check an essay on my own topic, will my writing checking result be similar to the score i will get on the ielts exam, can your service help improve my writing skills overall, start checking your ielts essays today.

What is automated essay scoring?

Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment. In fact, it’s been around far longer than “machine learning” and “artificial intelligence” have been buzzwords in the general public! The field of psychometrics has been doing such groundbreaking work for decades.

So how does AES work, and how can you apply it?

The first and most critical thing to know is that there is not an algorithm that “reads” the student essays. Instead, you need to train an algorithm. That is, if you are a teacher and don’t want to grade your essays, you can’t just throw them in an essay scoring system. You have to actually grade the essays (or at least a large sample of them) and then use that data to fit a machine learning algorithm. Data scientists use the term train the model , which sounds complicated, but if you have ever done simple linear regression, you have experience with training models.

There are three steps for automated essay scoring:

Establish your data set. Begin by gathering a substantial collection of student essays, ensuring a diverse range of topics and writing styles. Each essay should be meticulously graded by human experts to create a reliable and accurate benchmark. This data set forms the foundation of your automated scoring system, providing the necessary examples for the machine learning model to learn from.
Determine the features. Identify the key features that will serve as predictor variables in your model. These features might include grammar, syntax, vocabulary usage, coherence, structure, and argument strength. Carefully selecting these attributes is crucial as they directly impact the model’s ability to assess essays accurately. The goal is to choose features that are indicative of overall writing quality and are relevant to the scoring criteria.
Train the machine learning model. Use the established data set and selected features to train your machine learning model. This involves feeding the graded essays into the model, allowing it to learn the relationship between the features and the assigned grades. Through iterative training and validation processes, the model adjusts its algorithms to improve accuracy. Continuous refinement and testing ensure that the model can reliably score new, unseen essays with a high degree of precision.

Here’s an extremely oversimplified example:

You have a set of 100 student essays, which you have scored on a scale of 0 to 5 points.
The essay is on Napoleon Bonaparte, and you want students to know certain facts, so you want to give them “credit” in the model if they use words like: Corsica, Consul, Josephine, Emperor, Waterloo, Austerlitz, St. Helena. You might also add other Features such as Word Count, number of grammar errors, number of spelling errors, etc.
You create a map of which students used each of these words, as 0/1 indicator variables. You can then fit a multiple regression with 7 predictor variables (did they use each of the 7 words) and the 5 point scale as your criterion variable. You can then use this model to predict each student’s score from just their essay text.

Obviously, this example is too simple to be of use, but the same general idea is done with massive, complex studies. The establishment of the core features (predictive variables) can be much more complex, and models are going to be much more complex than multiple regression (neural networks, random forests, support vector machines).

Here’s an example of the very start of a data matrix for features, from an actual student essay. Imagine that you also have data on the final scores, 0 to 5 points. You can see how this is then a regression situation.

How do you score the essay?

If they are on paper, then automated essay scoring won’t work unless you have an extremely good software for character recognition that converts it to a digital database of text. Most likely, you have delivered the exam as an online assessment and already have the database. If so, your platform should include functionality to manage the scoring process, including multiple custom rubrics. An example of our FastTest platform is provided below.

Some rubrics you might use:

Supporting arguments
Organization
Vocabulary / word choice

How do you pick the Features?

This is one of the key research problems. In some cases, it might be something similar to the Napoleon example. Suppose you had a complex item on Accounting, where examinees review reports and spreadsheets and need to summarize a few key points. You might pull out a few key terms as features (mortgage amortization) or numbers (2.375%) and consider them to be Features. I saw a presentation at Innovations In Testing 2022 that did exactly this. Think of them as where you are giving the students “points” for using those keywords, though because you are using complex machine learning models, it is not simply giving them a single unit point. It’s contributing towards a regression-like model with a positive slope.

In other cases, you might not know. Maybe it is an item on an English test being delivered to English language learners, and you ask them to write about what country they want to visit someday. You have no idea what they will write about. But what you can do is tell the algorithm to find the words or terms that are used most often, and try to predict the scores with that. Maybe words like “jetlag” or “edification” show up in students that tend to get high scores, while words like “clubbing” or “someday” tend to be used by students with lower scores. The AI might also pick up on spelling errors. I worked as an essay scorer in grad school, and I can’t tell you how many times I saw kids use “ludacris” (name of an American rap artist) instead of “ludicrous” when trying to describe an argument. They had literally never seen the word used or spelled correctly. Maybe the AI model finds to give that a negative weight. That’s the next section!

How do you train a model?

Well, if you are familiar with data science, you know there are TONS of models, and many of them have a bunch of parameterization options. This is where more research is required. What model works the best on your particular essay, and doesn’t take 5 days to run on your data set? That’s for you to figure out. There is a trade-off between simplicity and accuracy. Complex models might be accurate but take days to run. A simpler model might take 2 hours but with a 5% drop in accuracy. It’s up to you to evaluate.

If you have experience with Python and R, you know that there are many packages which provide this analysis out of the box – it is a matter of selecting a model that works.

How effective is automated essay scoring?

Well, as psychometricians love to say, “it depends.” You need to do the model fitting research for each prompt and rubric. It will work better for some than others. The general consensus in research is that AES algorithms work as well as a second human, and therefore serve very well in that role. But you shouldn’t use them as the only score; of course, that’s impossible in many cases.

Here’s a graph from some research we did on our algorithm, showing the correlation of human to AES. The three lines are for the proportion of sample used in the training set; we saw decent results from only 10% in this case! Some of the models correlated above 0.80 with humans, even though this is a small data set. We found that the Cubist model took a fraction of the time needed by complex models like Neural Net or Random Forest; in this case it might be sufficiently powerful.

How can I implement automated essay scoring without writing code from scratch?

There are several products on the market. Some are standalone, some are integrated with a human-based essay scoring platform. ASC’s platform for automated essay scoring is SmartMarq; click here to learn more . It is currently in a standalone approach like you see below, making it extremely easy to use. It is also in the process of being integrated into our online assessment platform, alongside human scoring, to provide an efficient and easy way of obtaining a second or third rater for QA purposes.

Want to learn more? Contact us to request a demonstration .

Latest Posts

Nathan Thompson, PhD

Latest posts by nathan thompson, phd ( see all ).

What are technology enhanced items? - May 31, 2024
Automated item generation - May 30, 2024
Modified-Angoff Method Study - May 30, 2024

Online Assessment

Psychometrics.

Cookie Policy

Our website uses cookies to understand content and feature usage to drive site improvements over time. To learn more, review our Terms of Use and Privacy Policy .

Project Score: Write Like a Historian

Build structured, rubric-based support into your writing instruction. Provide additional support where needed based on Score, the OER Project’s automated essay-scoring service.

Every history teacher deserves a writing assistant that helps identify and focus on areas for student improvement. Project Score includes lessons, tools, prompts, and an automated essay-scoring service proven to help students write better and improve their historical writing and literacy skills.

Illustration from the book of the city ladies

Why Teach with Project Score?

This course features Score, the free OER Project essay-scoring service. When combined with writing prompts and scaffolded pre- and post-writing activities, this powerful tool provides consistent formative feedback and loads of data to inform instruction and discussion.

Prompts: A variety of writing prompts address enduring historical questions about our planet, life, and humanity.

Prewriting: Scaffolded warm-up activities help prepare students for writing by introducing them to key elements of the Score Writing Rubric and giving them ample opportunity to prepare for successful writing.

Postwriting: Getting students to revise their writing can be a challenge. Scaffolded revision activities, focused on the Score Writing Rubric, give students the guidance and structure they need.

Consistent, instant feedback: Students can independently check their writing and receive immediate feedback to help inform their writing.

Informs instruction: Teachers receive individual and class data to help identify who needs extra help or where a class is struggling.

Sparks discussion: Score is not a grading tool—it’s a conversation starter for you and your students and a way for you to make data-driven instructional decisions.

Flexible. Ready to Use. Fully Supported.

Project Score can be taught as a two-week course, or you can use the activities anywhere within the Big History Project or World History Project courses for additional writing instruction. Find comprehensive support from the OER Project team and our vibrant teacher community.

Breakdown of 3 courses Big History, Origins to the present, and 1750 to the present

Help yourself to everything our free, online history courses have to offer: mind-blowing content balanced with critical skills development and lots of support.

4 bar graphs titled language and style, analysis and evidence, organization, claim and focus

Get started building student writing capabilities—with confidence.

screenshot of the OER Project Online Teacher Community page

Not sure how you will use this in the classroom? Connect with fellow teachers and scholars in the OER Project Online Teacher Community to get your questions answered.

Extend Student Learning with our Supplemental Units

Try an OER Project supplemental unit to boost skill development in your classroom. These extension materials are the perfect supplement to an existing history or language arts curriculum.

Climate Project Extension

A supplemental unit that starts with evidence and ends with student-developed plans to reach net carbon zero by 2050.

Help students understand and use data to confront urgent world topics such as poverty, democracy, and climate. This 2-week supplement is designed for high-school students and includes 10 data-exploration exercises that lead up to a final class presentation.

Access this course

Join OER Project to get instant access to the Project Score course for FREE with no hidden catches.

Free Online GRE AWA Essay Grader

Automatic essay rating software for practice.

The GRE analytical writing is a small but important component of the test that troubles many international test takers. We’ve had several candidates asking us:

“How can I rate my GRE AWA essay for practice?”
“Can I download a free GRE essay e-rater?”

Well, there are a few paid options offered by some test prep companies. But not much out there that’s free and a close approximation of the real deal. So we created this free essay grader for GRE essays.

Conceptualized and developed by Sameer Kamat , the software uses Natural Language Processing (NLP) principles and our understanding of how AWA essays are evaluated. We know it’s far from perfect, since no automated essay grader can accurately do (yet) what the trained human brain can. And it’s definitely not the equivalent of a free ScoreItNow report, if that’s what you wanted.

But we hope it’s better than having no feedback at all on your practice AWA essays during the GRE preparation journey.

Grade my GRE Essay

There is no software to download. You can use our free online GRE essay immediately. All you need to do is:

Type or paste your GRE essay in the box below. [Wait for the text box to load. If it’s taking too long, refresh the page.]
Click on the ‘Check’ button
Your essay grade along with the breakup across 3 dimensions (Structure, Readability and Coherence) will be displayed.

Here’s a brief introduction to the various sub-topics that our online essay evaluator covers:

Organization: This checks the attributes related to the building blocks of GRE essays i.e. attributes related to the words, sentences and paragraphs in the AWA essay.

Readability: This tests (using industry standard metrics) how easy it is for the reader to grasp what you have written. Try to maintain a balance between the over-simplistic and the hard-to-comprehend approach.

Coherence: This goes into the nuances of natural language processing and evaluates how you have connected the building blocks using the appropriate English language constructs.

Automatic essay scoring using NLP

[email protected]

Article contents
Figures & tables
Supplementary Data
Peer Review
Reprints and Permissions
Cite Icon Cite
Search Site

M. Sheshikala , Mothe Rajesh , Mahesh Akarapu; Automatic essay scoring using NLP. AIP Conf. Proc. 5 June 2024; 2971 (1): 020053. https://doi.org/10.1063/5.0195909

Download citation file:

Ris (Zotero)
Reference Manager

It is a known fact that any update in the history of educational sector has always been a positive impact in the livelihood of people towards technology. Our project is one such a kind where rating essays is the major criteria we want to work on. Essay evaluation is considered as a systematic way to give rating to the essays written. Automatic essay scoring is a process of grading essays without human intervention. The computer systems are trained using technical, artificial intelligence architectures where natural language processing comes into picture. The process of making machine resembles to the human intelligence and to work, as if as a human could is the main motive of natural language processing. Under this criterion, we have chosen a part of educational preview to build a system that is capable of rating written work, namely essays. Our project aims to provide a solution that evaluates essays as an automatic process. The basic idea here is to develop a software system that can be beneficial to educational institutions, business organizations, researchers, etc. Automatic essay scoring has a powerful gain over making it work, because it helps in reduction of manual work, gives a scope for every element without bias, also act as a key role in being time-efficient. There are past approaches in finding a way to develop an automated system to score essays using regression analysis, convolution neural networks, while we worked through transformer-based model, named BERT.

Sign in via your Institution

Citing articles via, publish with us - request a quote.

Sign up for alerts

Online ISSN 1551-7616
Print ISSN 0094-243X
For Researchers
For Librarians
For Advertisers
Our Publishing Partners
Physics Today
Conference Proceedings
Special Topics

pubs.aip.org

Connect with AIP Publishing

This feature is available to subscribers only.

Open supplemental data
Reference Manager
Simple TEXT file

People also looked at

Original research article, explainable automated essay scoring: deep learning really has pedagogical value.

School of Computing and Information Systems, Faculty of Science and Technology, Athabasca University, Edmonton, AB, Canada

Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores. Consequently, the AES black box has remained impenetrable. Although several algorithms from Explainable Artificial Intelligence have recently been published, no research has yet investigated the role that these explanation models can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing personalized, formative, and fine-grained feedback to students during the writing process. Building on previous studies where models were trained to predict both the holistic and rubric scores of essays, using the Automated Student Assessment Prize’s essay datasets, this study focuses on predicting the quality of the writing style of Grade-7 essays and exposes the decision processes that lead to these predictions. In doing so, it evaluates the impact of deep learning (multi-layer perceptron neural networks) on the performance of AES. It has been found that the effect of deep learning can be best viewed when assessing the trustworthiness of explanation models. As more hidden layers were added to the neural network, the descriptive accuracy increased by about 10%. This study shows that faster (up to three orders of magnitude) SHAP implementations are as accurate as the slower model-agnostic one. It leverages the state-of-the-art in natural language processing, applying feature selection on a pool of 1592 linguistic indices that measure aspects of text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity. In addition to the list of most globally important features, this study reports (a) a list of features that are important for a specific essay (locally), (b) a range of values for each feature that contribute to higher or lower rubric scores, and (c) a model that allows to quantify the impact of the implementation of formative feedback.

Automated essay scoring (AES) is a compelling topic in Learning Analytics (LA) for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; only a few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores ( Kumar et al., 2017 ; Taghipour, 2017 ; Kumar and Boulanger, 2020 ). None has attempted to explain the whole decision process of AES, from holistic scores to rubric scores and from rubric scores to writing feature modeling. Although several algorithms from XAI (explainable artificial intelligence) ( Adadi and Berrada, 2018 ; Murdoch et al., 2019 ) have recently been published (e.g., LIME, SHAP) ( Ribeiro et al., 2016 ; Lundberg and Lee, 2017 ), no research has yet investigated the role that these explanation models (trained on top of predictive models) can play in: (a) discovering the decision-making process that drives AES, (b) fine-tuning predictive models to improve generalizability and interpretability, and (c) providing teachers and students with personalized, formative, and fine-grained feedback during the writing process.

One of the key anticipated benefits of AES is the elimination of human bias such as rater fatigue, rater’s expertise, severity/leniency, scale shrinkage, stereotyping, Halo effect, rater drift, perception difference, and inconsistency ( Taghipour, 2017 ). At its turn, AES may suffer from its own set of biases (e.g., imperfections in training data, spurious correlations, overrepresented minority groups), which has incited the research community to look for ways to make AES more transparent, accountable, fair, unbiased, and consequently trustworthy while remaining accurate. This required changing the perception that AES is merely a machine learning and feature engineering task ( Madnani et al., 2017 ; Madnani and Cahill, 2018 ). Hence, researchers have advocated that AES should be seen as a shared task requiring several methodological design decisions along the way such as curriculum alignment, construction of training corpora, reliable scoring process, and rater performance evaluation, where the goal is to build and deploy fair and unbiased scoring models to be used in large-scale assessments and classroom settings ( Rupp, 2018 ; West-Smith et al., 2018 ; Rupp et al., 2019 ). Unfortunately, although these measures are intended to design reliable and valid AES systems, they may still fail to build trust among users, keeping the AES black box impenetrable for teachers and students.

It has been previously recognized that divergence of opinion among human and machine graders has been only investigated superficially ( Reinertsen, 2018 ). So far, researchers investigated the characteristics of essays through qualitative analyses which ended up rejected by AES systems (requiring a human to score them) ( Reinertsen, 2018 ). Others strived to justify predicted scores by identifying essay segments that actually caused the predicted scores. In spite of the fact that these justifications hinted at and quantified the importance of these spatial cues, they did not provide any feedback as to how to improve those suboptimal essay segments ( Mizumoto et al., 2019 ).

Related to this study and the work of Kumar and Boulanger (2020) is Revision Assistant, a commercial AES system developed by Turnitin ( Woods et al., 2017 ; West-Smith et al., 2018 ), which in addition to predicting essays’ holistic scores provides formative, rubric-specific, and sentence-level feedback over multiple drafts of a student’s essay. The implementation of Revision Assistant moved away from the traditional approach to AES, which consists in using a limited set of features engineered by human experts representing only high-level characteristics of essays. Like this study, it rather opted for including a large number of low-level writing features, demonstrating that expert-designed features are not required to produce interpretable predictions. Revision Assistant’s performance was reported on two essay datasets, one of which was the Automated Student Assessment Prize (ASAP) 1 dataset. However, performance on the ASAP dataset was reported in terms of quadratic weighted kappa and this for holistic scores only. Models predicting rubric scores were trained only with the other dataset which was hosted on and collected through Revision Assistant itself.

In contrast to feature-based approaches like the one adopted by Revision Assistant, other AES systems are implemented using deep neural networks where features are learned during model training. For example, Taghipour (2017) in his doctoral dissertation leverages a recurrent neural network to improve accuracy in predicting holistic scores, implement rubric scoring (i.e., organization and argument strength), and distinguish between human-written and computer-generated essays. Interestingly, Taghipour compared the performance of his AES system against other AES systems using the ASAP corpora, but he did not use the ASAP corpora when it came to train rubric scoring models although ASAP provides two corpora provisioning rubric scores (#7 and #8). Finally, research was also undertaken to assess the generalizability of rubric-based models by performing experiments across various datasets. It was found that the predictive power of such rubric-based models was related to how much the underlying feature set covered a rubric’s criteria ( Rahimi et al., 2017 ).

Despite their numbers, rubrics (e.g., organization, prompt adherence, argument strength, essay length, conventions, word choices, readability, coherence, sentence fluency, style, audience, ideas) are usually investigated in isolation and not as a whole, with the exception of Revision Assistant which provides feedback at the same time on the following five rubrics: claim, development, audience, cohesion, and conventions. The literature reveals that rubric-specific automated feedback includes numerical rubric scores as well as recommendations on how to improve essay quality and correct errors ( Taghipour, 2017 ). Again, except for Revision Assistant which undertook a holistic approach to AES including holistic and rubric scoring and provision of rubric-specific feedback at the sentence level, AES has generally not been investigated as a whole or as an end-to-end product. Hence, the AES used in this study and developed by Kumar and Boulanger (2020) is unique in that it uses both deep learning (multi-layer perceptron neural network) and a huge pool of linguistic indices (1592), predicts both holistic and rubric scores, explaining holistic scores in terms of rubric scores, and reports which linguistic indices are the most important by rubric. This study, however, goes one step further and showcases how to explain the decision process behind the prediction of a rubric score for a specific essay, one of the main AES limitations identified in the literature ( Taghipour, 2017 ) that this research intends to address, at least partially.

Besides providing explanations of predictions both globally and individually, this study not only goes one step further toward the automated provision of formative feedback but also does so in alignment with the explanation model and the predictive model, allowing to better map feedback to the actual characteristics of an essay. Woods et al. (2017) succeeded in associating sentence-level expert-derived feedback with strong/weak sentences having the greatest influence on a rubric score based on the rubric, essay score, and the sentence characteristics. While Revision Assistant’s feature space consists of counts and binary occurrence indicators of word unigrams, bigrams and trigrams, character four-grams, and part-of-speech bigrams and trigrams, they are mainly textual and locational indices; by nature they are not descriptive or self-explanative. This research fills this gap by proposing feedback based on a set of linguistic indices that can encompass several sentences at a time. However, the proposed approach omits locational hints, leaving the merging of the two approaches as the next step to be addressed by the research community.

Although this paper proposes to extend the automated provision of formative feedback through an interpretable machine learning method, it rather focuses on the feasibility of automating it in the context of AES instead of evaluating the pedagogical quality (such as the informational and communicational value of feedback messages) or impact on students’ writing performance, a topic that will be kept for an upcoming study. Having an AES system that is capable of delivering real-time formative feedback sets the stage to investigate (1) when feedback is effective, (2) the types of feedback that are effective, and (3) whether there exist different kinds of behaviors in terms of seeking and using feedback ( Goldin et al., 2017 ). Finally, this paper omits describing the mapping between the AES model’s linguistic indices and a pedagogical language that is easily understandable by students and teachers, which is beyond its scope.

Methodology

This study showcases the application of the PDR framework ( Murdoch et al., 2019 ), which provides three pillars to describe interpretations in the context of the data science life cycle: P redictive accuracy, D escriptive accuracy, and R elevancy to human audience(s). It is important to note that in a broader sense both terms “explainable artificial intelligence” and “interpretable machine learning” can be used interchangeably with the following meaning ( Murdoch et al., 2019 ): “the use of machine-learning models for the extraction of relevant knowledge about domain relationships contained in data.” Here “predictive accuracy” refers to the measurement of a model’s ability to fit data; “descriptive accuracy” is the degree at which the relationships learned by a machine learning model can be objectively captured; and “relevant knowledge” implies that a particular audience gets insights into a chosen domain problem that guide its communication, actions, and discovery ( Murdoch et al., 2019 ).

In the context of this article, formative feedback that assesses students’ writing skills and prescribes remedial writing strategies is the relevant knowledge sought for, whose effectiveness on students’ writing performance will be validated in an upcoming study. However, the current study puts forward the tools and evaluates the feasibility to offer this real-time formative feedback. It also measures the predictive and descriptive accuracies of AES and explanation models, two key components to generate trustworthy interpretations ( Murdoch et al., 2019 ). Naturally, the provision of formative feedback is dependent on the speed of training and evaluating new explanation models every time a new essay is ingested by the AES system. That is why this paper investigates the potential of various SHAP implementations for speed optimization without compromising the predictive and descriptive accuracies. This article will show how the insights generated by the explanation model can serve to debug the predictive model and contribute to enhance the feature selection and/or engineering process ( Murdoch et al., 2019 ), laying the foundation for the provision of actionable and impactful pieces of knowledge to educational audiences, whose relevancy will be judged by the human stakeholders and estimated by the magnitude of resulting changes.

Figure 1 overviews all the elements and steps encompassed by the AES system in this study. The following subsections will address each facet of the overall methodology, from hyperparameter optimization to relevancy to both students and teachers.

Figure 1. A flow chart exhibiting the sequence of activities to develop an end-to-end AES system and how the various elements work together to produce relevant knowledge to the intended stakeholders.

Automated Essay Scoring System, Dataset, and Feature Selection

As previously mentioned, this paper reuses the AES system developed by Kumar and Boulanger (2020) . The AES models were trained using the ASAP’s seventh essay corpus. These narrative essays were written by Grade-7 students in the setting of state-wide assessments in the United States and had an average length of 171 words. Students were asked to write a story about patience. Kumar and Boulanger’s work consisted in training a predictive model for each of the four rubrics according to which essays were graded: ideas, organization, style, and conventions. Each essay was scored by two human raters on a 0−3 scale (integer scale). Rubric scores were resolved by adding the rubric scores assigned by the two human raters, producing a resolved rubric score between 0 and 6. This paper is a continuation of Boulanger and Kumar (2018 , 2019 , 2020) and Kumar and Boulanger (2020) where the objective is to open the AES black box to explain the holistic and rubric scores that it predicts. Essentially, the holistic score ( Boulanger and Kumar, 2018 , 2019 ) is determined and justified through its four rubrics. Rubric scores, in turn, are investigated to highlight the writing features that play an important role within each rubric ( Kumar and Boulanger, 2020 ). Finally, beyond global feature importance, it is not only indispensable to identify which writing indices are important for a particular essay (local), but also to discover how they contribute to increase or decrease the predicted rubric score, and which feature values are more/less desirable ( Boulanger and Kumar, 2020 ). This paper is a continuation of these previous works by adding the following link to the AES chain: holistic score, rubric scores, feature importance, explanations, and formative feedback. The objective is to highlight the means for transparent and trustable AES while empowering learning analytics practitioners with the tools to debug these models and equip educational stakeholders with an AI companion that will semi-autonomously generate formative feedback to teachers and students. Specifically, this paper analyzes the AES reasoning underlying its assessment of the “style” rubric, which looks for command of language, including effective and compelling word choice and varied sentence structure, that clearly supports the writer’s purpose and audience.

This research’s approach to AES leverages a feature-based multi-layer perceptron (MLP) deep neural network to predict rubric scores. The AES system is fed by 1592 linguistic indices quantitatively measured by the Suite of Automatic Linguistic Analysis Tools 2 (SALAT), which assess aspects of grammar and mechanics, sentiment analysis and cognition, text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity ( Kumar and Boulanger, 2020 ). The purpose of using such a huge pool of low-level writing features is to let deep learning extract the most important ones; the literature supports this practice since there is evidence that features automatically selected are not less interpretable than those engineered ( Woods et al., 2017 ). However, to facilitate this process, this study opted for a semi-automatic strategy that consisted of both filter and embedded methods. Firstly, the original ASAP’s seventh essay dataset consists of a training set of 1567 essays and a validation and testing sets of 894 essays combined. While the texts of all 2461 essays are still available to the public, only the labels (the rubric scores of two human raters) of the training set have been shared with the public. Yet, this paper reused the unlabeled 894 essays of the validation and testing sets for feature selection, a process that must be carefully carried out by avoiding being informed by essays that will train the predictive model. Secondly, feature data were normalized, and features with variances lower than 0.01 were pruned. Thirdly, the last feature of any pair of features having an absolute Pearson correlation coefficient greater than 0.7 was also pruned (the one that comes last in terms of the column ordering in the datasets). After the application of these filter methods, the number of features was reduced from 1592 to 282. Finally, the Lasso and Ridge regression regularization methods (whose combination is also called ElasticNet) were applied during the training of the rubric scoring models. Lasso is responsible for pruning further features, while Ridge regression is entrusted with eliminating multicollinearity among features.

Hyperparameter Optimization and Training

To ensure a fair evaluation of the potential of deep learning, it is of utmost importance to minimally describe this study’s exploration of the hyperparameter space, a step that is often found to be missing when reporting the outcomes of AES models’ performance ( Kumar and Boulanger, 2020 ). First, a study should list the hyperparameters it is going to investigate by testing for various values of each hyperparameter. For example, Table 1 lists all hyperparameters explored in this study. Note that L 1 and L 2 are two regularization hyperparameters contributing to feature selection. Second, each study should also report the range of values of each hyperparameter. Finally, the strategy to explore the selected hyperparameter subspace should be clearly defined. For instance, given the availability of high-performance computing resources and the time/cost of training AES models, one might favor performing a grid (a systematic testing of all combinations of hyperparameters and hyperparameter values within a subspace) or a random search (randomly selecting a hyperparameter value from a range of values per hyperparameter) or both by first applying random search to identify a good starting candidate and then grid search to test all possible combinations in the vicinity of the starting candidate’s subspace. Of particular interest to this study is the neural network itself, that is, how many hidden layers should a neural network have and how many neurons should compose each hidden layer and the neural network as a whole. These two variables are directly related to the size of the neural network, with the number of hidden layers being a defining trait of deep learning. A vast swath of literature is silent about the application of interpretable machine learning in AES and even more about measuring its descriptive accuracy, the two components of trustworthiness. Hence, this study pioneers the comprehensive assessment of deep learning impact on AES’s predictive and descriptive accuracies.

Table 1. Hyperparameter subspace investigated in this article along with best hyperparameter values per neural network architecture.

Consequently, the 1567 labeled essays were divided into a training set (80%) and a testing set (20%). No validation set was put aside; 5-fold cross-validation was rather used for hyperparameter optimization. Table 1 delineates the hyperparameter subspace from which 800 different combinations of hyperparameter values were randomly selected out of a subspace of 86,248,800 possible combinations. Since this research proposes to investigate the potential of deep learning to predict rubric scores, several architectures consisting of 2 to 6 hidden layers and ranging from 9,156 to 119,312 parameters were tested. Table 1 shows the best hyperparameter values per depth of neural networks.

Again, the essays of the testing set were never used during the training and cross-validation processes. In order to retrieve the best predictive models during training, every time the validation loss reached a record low, the model was overwritten. Training stopped when no new record low was reached during 100 epochs. Moreover, to avoid reporting the performance of overfit models, each model was trained five times using the same set of best hyperparameter values. Finally, for each resulting predictive model, a corresponding ensemble model (bagging) was also obtained out of the five models trained during cross-validation.

Predictive Models and Predictive Accuracy

Table 2 delineates the performance of predictive models trained previously by Kumar and Boulanger (2020) on the four scoring rubrics. The first row lists the agreement levels between the resolved and predicted rubric scores measured by the quadratic weighted kappa. The second row is the percentage of accurate predictions; the third row reports the percentages of predictions that are either accurate or off by 1; and the fourth row reports the percentages of predictions that are either accurate or at most off by 2. Prediction of holistic scores is done merely by adding up all rubric scores. Since the scale of rubric scores is 0−6 for every rubric, then the scale of holistic scores is 0−24.

Table 2. Rubric scoring models’ performance on testing set.

While each of these rubric scoring models might suffer from its own systemic bias and hence cancel off each other’s bias by adding up the rubric scores to derive the holistic score, this study (unlike related works) intends to highlight these biases by exposing the decision making process underlying the prediction of rubric scores. Although this paper exclusively focuses on the Style rubric, the methodology put forward to analyze the local and global importance of writing indices and their context-specific contributions to predicted rubric scores is applicable to every rubric and allows to control for these biases one rubric at a time. Comparing and contrasting the role that a specific writing index plays within each rubric context deserves its own investigation, which has been partly addressed in the study led by Kumar and Boulanger (2020) . Moreover, this paper underscores the necessity to measure the predictive accuracy of rubric-based holistic scoring using additional metrics to account for these rubric-specific biases. For example, there exist several combinations of rubric scores to obtain a holistic score of 16 (e.g., 4-4-4-4 vs. 4-3-4-5 vs. 3-5-2-6). Even though the predicted holistic score might be accurate, the rubric scores could all be inaccurate. Similarity or distance metrics (e.g., Manhattan and Euclidean) should then be used to describe the authenticity of the composition of these holistic scores.

According to what Kumar and Boulanger (2020) report on the performance of several state-of-the-art AES systems trained on ASAP’s seventh essay dataset, the AES system they developed and which will be reused in this paper proved competitive while being fully and deeply interpretable, which no other AES system does. They also supply further information about the study setting, essay datasets, rubrics, features, natural language processing (NLP) tools, model training, and evaluation against human performance. Again, this paper showcases the application of explainable artificial intelligence in automated essay scoring by focusing on the decision process of the Rubric #3 (Style) scoring model. Remember that the same methodology is applicable to each rubric.

Explanation Model: SHAP

SH apley A dditive ex P lanations (SHAP) is a theoretically justified XAI framework that can provide simultaneously both local and global explanations ( Molnar, 2020 ); that is, SHAP is able to explain individual predictions taking into account the uniqueness of each prediction, while highlighting the global factors influencing the overall performance of a predictive model. SHAP is of keen interest because it unifies all algorithms of the class of additive feature attribution methods, adhering to a set of three properties that are desirable in interpretable machine learning: local accuracy, missingness, and consistency ( Lundberg and Lee, 2017 ). A key advantage of SHAP is that feature contributions are all expressed in terms of the outcome variable (e.g., rubric scores), providing a same scale to compare the importance of each feature against each other. Local accuracy refers to the fact that no matter the explanation model, the sum of all feature contributions is always equal to the prediction explained by these features. The missingness property implies that the prediction is never explained by unmeasured factors, which are always assigned a contribution of zero. However, the converse is not true; a contribution of zero does not imply an unobserved factor, it can also denote a feature irrelevant to explain the prediction. The consistency property guarantees that a more important feature will always have a greater magnitude than a less important one, no matter how many other features are included in the explanation model. SHAP proves superior to other additive attribution methods such as LIME (Local Interpretable Model-Agnostic Explanations), Shapley values, and DeepLIFT in that they never comply with all three properties, while SHAP does ( Lundberg and Lee, 2017 ). Moreover, the way SHAP assesses the importance of a feature differs from permutation importance methods (e.g., ELI5), measured as the decrease in model performance (accuracy) as a feature is perturbated, in that it is based on how much a feature contributes to every prediction.

Essentially, a SHAP explanation model (linear regression) is trained on top of a predictive model, which in this case is a complex ensemble deep learning model. Table 3 demonstrates a scale explanation model showing how SHAP values (feature contributions) work. In this example, there are five instances and five features describing each instance (in the context of this paper, an instance is an essay). Predictions are listed in the second to last column, and the base value is the mean of all predictions. The base value constitutes the reference point according to which predictions are explained; in other words, reasons are given to justify the discrepancy between the individual prediction and the mean prediction (the base value). Notice that the table does not contain the actual feature values; these are SHAP values that quantify the contribution of each feature to the predicted score. For example, the prediction of Instance 1 is 2.46, while the base value is 3.76. Adding up the feature contributions of Instance 1 to the base value produces the predicted score:

Table 3. Array of SHAP values: local and global importance of features and feature coverage per instance.

Hence, the generic equation of the explanation model ( Lundberg and Lee, 2017 ) is:

where g(x) is the prediction of an individual instance x, σ 0 is the base value, σ i is the feature contribution of feature x i , x i ∈ {0,1} denotes whether feature x i is part of the individual explanation, and j is the total number of features. Furthermore, the global importance of a feature is calculated by adding up the absolute values of its corresponding SHAP values over all instances, where n is the total number of instances and σ i ( j ) is the feature contribution for instance i ( Lundberg et al., 2018 ):

Therefore, it can be seen that Feature 3 is the most globally important feature, while Feature 2 is the least important one. Similarly, Feature 5 is Instance 3’s most important feature at the local level, while Feature 2 is the least locally important. The reader should also note that a feature shall not necessarily be assigned any contribution; some of them are just not part of the explanation such as Feature 2 and Feature 3 in Instance 2. These concepts lay the foundation for the explainable AES system presented in this paper. Just imagine that each instance (essay) will be rather summarized by 282 features and that the explanations of all the testing set’s 314 essays will be provided.

Several implementations of SHAP exist: KernelSHAP, DeepSHAP, GradientSHAP, and TreeSHAP, among others. KernelSHAP is model-agnostic and works for any type of predictive models; however, KernelSHAP is very computing-intensive which makes it undesirable for practical purposes. DeepSHAP and GradientSHAP are two implementations intended for deep learning which takes advantage of the known properties of neural networks (i.e., MLP-NN, CNN, or RNN) to accelerate up to three orders of magnitude the processing time to explain predictions ( Chen et al., 2019 ). Finally, TreeSHAP is the most powerful implementation intended for tree-based models. TreeSHAP is not only fast; it is also accurate. While the three former implementations estimate SHAP values, TreeSHAP computes them exactly. Moreover, TreeSHAP not only measures the contribution of individual features, but it also considers interactions between pairs of features and assigns them SHAP values. Since one of the goals of this paper is to assess the potential of deep learning on the performance of both predictive and explanation models, this research tested the former three implementations. TreeSHAP is recommended for future work since the interaction among features is critical information to consider. Moreover, KernelSHAP, DeepSHAP, and GradientSHAP all require access to the whole original dataset to derive the explanation of a new instance, another constraint TreeSHAP is not subject to.

Descriptive Accuracy: Trustworthiness of Explanation Models

This paper reuses and adapts the methodology introduced by Ribeiro et al. (2016) . Several explanation models will be trained, using different SHAP implementations and configurations, per deep learning predictive model (for each number of hidden layers). The rationale consists in randomly selecting and ignoring 25% of the 282 features feeding the predictive model (e.g., turning them to zero). If it causes the prediction to change beyond a specific threshold (in this study 0.10 and 0.25 were tested), then the explanation model should also reflect the magnitude of this change while ignoring the contributions of these same features. For example, the original predicted rubric score of an essay might be 5; however, when ignoring the information brought in by a subset of 70 randomly selected features (25% of 282), the prediction may turn to 4. On the other side, if the explanation model also predicts a 4 while ignoring the contributions of the same subset of features, then the explanation is considered as trustworthy. This allows to compute the precision, recall, and F1-score of each explanation model (number of true and false positives and true and false negatives). The process is repeated 500 times for every essay to determine the average precision and recall of every explanation model.

Judging Relevancy

So far, the consistency of explanations with predictions has been considered. However, consistent explanations do not imply relevant or meaningful explanations. Put another way, explanations only reflect what predictive models have learned during training. How can the black box of these explanations be opened? Looking directly at the numerical SHAP values of each explanation might seem a daunting task, but there exist tools, mainly visualizations (decision plot, summary plot, and dependence plot), that allow to make sense out of these explanations. However, before visualizing these explanations, another question needs to be addressed: which explanations or essays should be picked for further scrutiny of the AES system? Given the huge number of essays to examine and the tedious task to understand the underpinnings of a single explanation, a small subset of essays should be carefully picked that should represent concisely the state of correctness of the underlying predictive model. Again, this study applies and adapts the methodology in Ribeiro et al. (2016) . A greedy algorithm selects essays whose predictions are explained by as many features of global importance as possible to optimize feature coverage. Ribeiro et al. demonstrated in unrelated studies (i.e., sentiment analysis) that the correctness of a predictive model can be assessed with as few as four or five well-picked explanations.

For example, Table 3 reveals the global importance of five features. The square root of each feature’s global importance is also computed and considered instead to limit the influence of a small group of very influential features. The feature coverage of Instance 1 is 100% because all features are engaged in the explanation of the prediction. On the other hand, Instance 2 has a feature coverage of 61.5% because only Features 1, 4, and 5 are part of the prediction’s explanation. The feature coverage is calculated by summing the square root of each explanation’s feature’s global importance together and dividing by the sum of the square roots of all features’ global importance:

Additionally, it can be seen that Instance 4 does not have any zero-feature value although its feature coverage is only 84.6%. The algorithm was constrained to discard from the explanation any feature whose contribution (local importance) was too close to zero. In the case of Table 3 ’s example, any feature whose absolute SHAP value is less than 0.10 is ignored, hence leading to a feature coverage of:

In this paper’s study, the real threshold was 0.01. This constraint was actually a requirement for the DeepSHAP and GradientSHAP implementations because they only output non-zero SHAP values contrary to KernelSHAP which generates explanations with a fixed number of features: a non-zero SHAP value indicates that the feature is part of the explanation, while a zero value excludes the feature from the explanation. Without this parameter, all 282 features would be part of the explanation although a huge number only has a trivial (very close to zero) SHAP value. Now, a much smaller but variable subset of features makes up each explanation. This is one way in which Ribeiro et al.’s SP-LIME algorithm (SP stands for Submodular Pick) has been adapted to this study’s needs. In conclusion, notice how Instance 4 would be selected in preference to Instance 5 to explain Table 3 ’s underlying predictive model. Even though both instances have four features explaining their prediction, Instance 4’s features are more globally important than Instance 5’s features, and therefore Instance 4 has greater feature coverage than Instance 5.

Whereas Table 3 ’s example exhibits the feature coverage of one instance at a time, this study computes it for a subset of instances, where the absolute SHAP values are aggregated (summed) per candidate subset. When the sum of absolute SHAP values per feature exceeds the set threshold, the feature is then considered as covered by the selected set of instances. The objective in this study was to optimize the feature coverage while minimizing the number of essays to validate the AES model.

Research Questions

One of this article’s objectives is to assess the potential of deep learning in automated essay scoring. The literature has often claimed ( Hussein et al., 2019 ) that there are two approaches to AES, feature-based and deep learning, as though these two approaches were mutually exclusive. Yet, the literature also puts forward that feature-based AES models may be more interpretable than deep learning ones ( Amorim et al., 2018 ). This paper embraces the viewpoint that these two approaches can also be complementary by leveraging the state-of-the-art in NLP and automatic linguistic analysis and harnessing one of the richest pools of linguistic indices put forward in the research community ( Crossley et al., 2016 , 2017 , 2019 ; Kyle, 2016 ; Kyle et al., 2018 ) and applying a thorough feature selection process powered by deep learning. Moreover, the ability of deep learning of modeling complex non-linear relationships makes it particularly well-suited for AES given that the importance of a writing feature is highly dependent on its context, that is, its interactions with other writing features. Besides, this study leverages the SHAP interpretation method that is well-suited to interpret very complex models. Hence, this study elected to work with deep learning models and ensembles to test SHAP’s ability to explain these complex models. Previously, the literature has revealed the difficulty to have at the same time both accurate and interpretable models ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ), where favoring one comes at the expense of the other. However, this research shows how XAI makes it now possible to produce both accurate and interpretable models in the area of AES. Since ensembles have been repeatedly shown to boost the accuracy of predictive models, they were included as part of the tested deep learning architectures to maximize generalizability and accuracy, while making these predictive models interpretable and exploring whether deep learning can even enhance their descriptive accuracy further.

This study investigates the trustworthiness of explanation models, and more specifically, those explaining deep learning predictive models. For instance, does the depth, defined as the number of hidden layers, of an MLP neural network increases the trustworthiness of its SHAP explanation model? The answer to this question will help determine whether it is possible to have very accurate AES models while having competitively interpretable/explainable models, the corner stone for the generation of formative feedback. Remember that formative feedback is defined as “any kind of information provided to students about their actual state of learning or performance in order to modify the learner’s thinking or behavior in the direction of the learning standards” and that formative feedback “conveys where the student is, what are the goals to reach, and how to reach the goals” ( Goldin et al., 2017 ). This notion contrasts with summative feedback which basically is “a justification of the assessment results” ( Hao and Tsikerdekis, 2019 ).

As pointed out in the previous section, multiple SHAP implementations are evaluated in this study. Hence, this paper showcases whether the faster DeepSHAP and GradientSHAP implementations are as reliable as the slower KernelSHAP implementation . The answer to this research question will shed light on the feasibility of providing immediate formative feedback and this multiple times throughout students’ writing processes.

This study also looks at whether a summary of the data produces as trustworthy explanations as those from the original data . This question will be of interest to AES researchers and practitioners because it could allow to significantly decrease the processing time of the computing-intensive and model-agnostic KernelSHAP implementation and test further the potential of customizable explanations.

KernelSHAP allows to specify the total number of features that will shape the explanation of a prediction; for instance, this study experiments with explanations of 16 and 32 features and observes whether there exists a statistically significant difference in the reliability of these explanation models . Knowing this will hint at whether simpler or more complex explanations are more desirable when it comes to optimize their trustworthiness. If there is no statistically significant difference, then AES practitioners are given further flexibility in the selection of SHAP implementations to find the sweet spot between complexity of explanations and speed of processing. For instance, the KernelSHAP implementation allows to customize the number of factors making up an explanation, while the faster DeepSHAP and GradientSHAP do not.

Finally, this paper highlights the means to debug and compare the performance of predictive models through their explanations. Once a model is debugged, the process can be reused to fine-tune feature selection and/or feature engineering to improve predictive models and for the generation of formative feedback to both students and teachers.

The training, validation, and testing sets consist of 1567 essays, each of which has been scored by two human raters, who assigned a score between 0 and 3 per rubric (ideas, organization, style, and conventions). In particular, this article looks at predictive and descriptive accuracy of AES models on the third rubric, style. Note that although each essay has been scored by two human raters, the literature ( Shermis, 2014 ) is not explicit about whether only two or more human raters participated in the scoring of all 1567 essays; given the huge number of essays, it is likely that more than two human raters were involved in the scoring of these essays so that the amount of noise introduced by the various raters’ biases is unknown while probably being at some degree balanced among the two groups of raters. Figure 2 shows the confusion matrices of human raters on Style Rubric. The diagonal elements (dark gray) correspond to exact matches, whereas the light gray squares indicate adjacent matches. Figure 2A delineates the number of essays per pair of ratings, and Figure 2B shows the percentages per pair of ratings. The agreement level between each pair of human raters, measured by the quadratic weighted kappa, is 0.54; the percentage of exact matches is 65.3%; the percentage of adjacent matches is 34.4%; and 0.3% of essays are neither exact nor adjacent matches. Figures 2A,B specify the distributions of 0−3 ratings per group of human raters. Figure 2C exhibits the distribution of resolved scores (a resolved score is the sum of the two human ratings). The mean is 3.99 (with a standard deviation of 1.10), and the median and mode are 4. It is important to note that the levels of predictive accuracy reported in this article are measured on the scale of resolved scores (0−6) and that larger scales tend to slightly inflate quadratic weighted kappa values, which must be taken into account when comparing against the level of agreement between human raters. Comparison of percentages of exact and adjacent matches must also be made with this scoring scale discrepancy in mind.

Figure 2. Summary of the essay dataset (1567 Grade-7 narrative essays) investigated in this study. (A) Number of essays per pair of human ratings; the diagonal (dark gray squares) lists the numbers of exact matches while the light-gray squares list the numbers of adjacent matches; and the bottom row and the rightmost column highlight the distributions of ratings for both groups of human raters. (B) Percentages of essays per pair of human ratings; the diagonal (dark gray squares) lists the percentages of exact matches while the light-gray squares list the percentages of adjacent matches; and the bottom row and the rightmost column highlight the distributions (frequencies) of ratings for both groups of human raters. (C) The distribution of resolved rubric scores; a resolved score is the addition of its two constituent human ratings.

Predictive Accuracy and Descriptive Accuracy

Table 4 compiles the performance outcomes of the 10 predictive models evaluated in this study. The reader should remember that the performance of each model was averaged over five iterations and that two models were trained per number of hidden layers, one non-ensemble and one ensemble. Except for the 6-layer models, there is no clear winner among other models. Even for the 6-layer models, they are superior in terms of exact matches, the primary goal for a reliable AES system, but not according to adjacent matches. Nevertheless, on average ensemble models slightly outperform non-ensemble models. Hence, these ensemble models will be retained for the next analysis step. Moreover, given that five ensemble models were trained per neural network depth, the most accurate model among the five is selected and displayed in Table 4 .

Table 4. Performance of majority classifier and average/maximal performance of trained predictive models.

Next, for each selected ensemble predictive model, several explanation models are trained per predictive model. Every predictive model is explained by the “Deep,” “Grad,” and “Random” explainers, except for the 6-layer model where it was not possible to train a “Deep” explainer apparently due to a bug in the original SHAP code caused by either a unique condition in this study’s data or neural network architecture. However, this was beyond the scope of this study to fix and investigate this issue. As it will be demonstrated, no statistically significant difference exists between the accuracy of these explainers.

The “Random” explainer serves as a baseline model for comparison purpose. Remember that to evaluate the reliability of explanation models, the concurrent impact of randomly selecting and ignoring a subset of features on the prediction and explanation of rubric scores is analyzed. If the prediction changes significantly and its corresponding explanation changes (beyond a set threshold) accordingly (a true positive) or if the prediction remains within the threshold as does the explanation (a true negative), then the explanation is deemed as trustworthy. Hence, in the case of the Random explainer, it simulates random explanations by randomly selecting 32 non-zero features from the original set of 282 features. These random explanations consist only of non-zero features because, according to SHAP’s missingness property, a feature with a zero or a missing value never gets assigned any contribution to the prediction. If at least one of these 32 features is also an element of the subset of the ignored features, then the explanation is considered as untrustworthy, no matter the size of a feature’s contribution.

As for the layer-2 model, six different explanation models are evaluated. Recall that layer-2 models generated the least mean squared error (MSE) during hyperparameter optimization (see Table 1 ). Hence, this specific type of architecture was selected to test the reliability of these various explainers. The “Kernel” explainer is the most computing-intensive and took approximately 8 h of processing. It was trained using the full distributions of feature values in the training set and shaped explanations in terms of 32 features; the “Kernel-16” and “Kernel-32” models were trained on a summary (50 k -means centroids) of the training set to accelerate the processing by about one order of magnitude (less than 1 h). Besides, the “Kernel-16” explainer derived explanations in terms of 16 features, while the “Kernel-32” explainer explained predictions through 32 features. Table 5 exhibits the descriptive accuracy of these various explanation models according to a 0.10 and 0.25 threshold; in other words, by ignoring a subset of randomly picked features, it assesses whether or not the prediction and explanation change simultaneously. Note also how each explanation model, no matter the underlying predictive model, outperforms the “Random” model.

Table 5. Precision, recall, and F1 scores of the various explainers tested per type of predictive model.

The first research question addressed in this subsection asks whether there exists a statistically significant difference between the “Kernel” explainer, which generates 32-feature explanations and is trained on the whole training set, and the “Kernel-32” explainer which also generates 32-feature explanations and is trained on a summary of the training set. To determine this, an independent t-test was conducted using the precision, recall, and F1-score distributions (500 iterations) of both explainers. Table 6 reports the p -values of all the tests and for the 0.10 and 0.25 thresholds. It reveals that there is no statistically significant difference between the two explainers.

Table 6. p -values of independent t -tests comparing whether there exist statistically significant differences between the mean precisions, recalls, and F1-scores of 2-layer explainers and between those of the 2-layer’s, 4-layer’s, and 6-layer’s Gradient explainers.

The next research question tests whether there exists a difference in the trustworthiness of explainers shaping 16 or 32-feature explanations. Again t-tests were conducted to verify this. Table 6 lists the resulting p -values. Again, there is no statistically significant difference in the average precisions, recalls, and F1-scores of both explainers.

This leads to investigating whether the “Kernel,” “Deep,” and “Grad” explainers are equivalent. Table 6 exhibits the results of the t-tests conducted to verify this and reveals that none of the explainers produce a statistically significantly better performance than the other.

Armed with this evidence, it is now possible to verify whether deeper MLP neural networks produce more trustworthy explanation models. For this purpose, the performance of the “Grad” explainer for each type of predictive model will be compared against each other. The same methodology as previously applied is employed here. Table 6 , again, confirms that the explanation model of the 2-layer predictive model is statistically significantly less trustworthy than the 4-layer’s explanation model; the same can be said of the 4-layer and 6-layer models. The only exception is the difference in average precision between 2-layer and 4-layer models and between 4-layer and 6-layer models; however, there clearly exists a statistically significant difference in terms of precision (and also recall and F1-score) between 2-layer and 6-layer models.

The Best Subset of Essays to Judge AES Relevancy

Table 7 lists the four best essays optimizing feature coverage (93.9%) along with their resolved and predicted scores. Notice how two of the four essays were picked by the adapted SP-LIME algorithm with some strong disagreement between the human and the machine graders, two were picked with short and trivial text, and two were picked exhibiting perfect agreement between the human and machine graders. Interestingly, each pair of longer and shorter essays exposes both strong agreement and strong disagreement between the human and AI agents, offering an opportunity to debug the model and evaluate its ability to detect the presence or absence of more basic (e.g., very small number of words, occurrences of sentence fragments) and more advanced aspects (e.g., cohesion between adjacent sentences, variety of sentence structures) of narrative essay writing and to appropriately reward or penalize them.

Table 7. Set of best essays to evaluate the correctness of the 6-layer ensemble AES model.

Local Explanation: The Decision Plot

The decision plot lists writing features by order of importance from top to bottom. The line segments display the contribution (SHAP value) of each feature to the predicted rubric score. Note that an actual decision plot consists of all 282 features and that only the top portion of it (20 most important features) can be displayed (see Figure 3 ). A decision plot is read from bottom to top. The line starts at the base value and ends at the predicted rubric score. Given that the “Grad” explainer is the only explainer common to all predictive models, it has been selected to derive all explanations. The decision plots in Figure 3 show the explanations of the four essays in Table 7 ; the dashed line in these plots represents the explanation of the most accurate predictive model, that is the ensemble model with 6 hidden layers which also produced the most trustworthy explanation model. The predicted rubric score of each explanation model is listed in the bottom-right legend. Explanation of the writing features follow in a next subsection.

Figure 3. Comparisons of all models’ explanations of the most representative set of four essays: (A) Essay 228, (B) Essay 68, (C) Essay 219, and (D) Essay 124.

Global Explanation: The Summary Plot

It is advantageous to use SHAP to build explanation models because it provides a single framework to discover the writing features that are important to an individual essay (local) or a set of essays (global). While the decision plots list features of local importance, Figure 4 ’s summary plot ranks writing features by order of global importance (from top to bottom). All testing set’s 314 essays are represented as dots in the scatterplot of each writing feature. The position of a dot on the horizontal axis corresponds to the importance (SHAP value) of the writing feature for a specific essay and its color indicates the magnitude of the feature value in relation to the range of all 314 feature values. For example, large or small numbers of words within an essay generally contribute to increase or decrease rubric scores by up to 1.5 and 1.0, respectively. Decision plots can also be used to find the most important features for a small subset of essays; Figure 5 demonstrates the new ordering of writing indices when aggregating the feature contributions (summing the absolute values of SHAP values) of the four essays in Table 7 . Moreover, Figure 5 allows to compare the contributions of a feature to various essays. Note how the orderings in Figures 3 −5 can differ from each other, sharing many features of global importance as well as having their own unique features of local importance.

Figure 4. Summary plot listing the 32 most important features globally.

Figure 5. Decision plot delineating the best model’s explanations of Essays 228, 68, 219, and 124 (6-layer ensemble).

Definition of Important Writing Indices

The reader shall understand that it is beyond the scope of this paper to make a thorough description of all writing features. Nevertheless, the summary and decision plots in Figures 4 , 5 allow to identify a subset of features that should be examined in order to validate this study’s predictive model. Supplementary Table 1 combines and describes the 38 features in Figures 4 , 5 .

Dependence Plots

Although the summary plot in Figure 4 is insightful to determine whether small or large feature values are desirable, the dependence plots in Figure 6 prove essential to recommend whether a student should aim at increasing or decreasing the value of a specific writing feature. The dependence plots also reveal whether the student should directly act upon the targeted writing feature or indirectly on other features. The horizontal axis in each of the dependence plots in Figure 6 is the scale of the writing feature and the vertical axis is the scale of the writing feature’s contributions to the predicted rubric scores. Each dot in a dependence plot represents one of the testing set’s 314 essays, that is, the feature value and SHAP value belonging to the essay. The vertical dispersion of the dots on small intervals of the horizontal axis is indicative of interaction with other features ( Molnar, 2020 ). If the vertical dispersion is widespread (e.g., the [50, 100] horizontal-axis interval in the “word_count” dependence plot), then the contribution of the writing feature is most likely at some degree dependent on other writing feature(s).

Figure 6. Dependence plots: the horizontal axes represent feature values while vertical axes represent feature contributions (SHAP values). Each dot represents one of the 314 essays of the testing set and is colored according to the value of the feature with which it interacts most strongly. (A) word_count. (B) hdd42_aw. (C) ncomp_stdev. (D) dobj_per_cl. (E) grammar. (F) SENTENCE_FRAGMENT. (G) Sv_GI. (H) adjacent_overlap_verb_sent.

The contributions of this paper can be summarized as follows: (1) it proposes a means (SHAP) to explain individual predictions of AES systems and provides flexible guidelines to build powerful predictive models using more complex algorithms such as ensembles and deep learning neural networks; (2) it applies a methodology to quantitatively assess the trustworthiness of explanation models; (3) it tests whether faster SHAP implementations impact the descriptive accuracy of explanation models, giving insight on the applicability of SHAP in real pedagogical contexts such as AES; (4) it offers a toolkit to debug AES models, highlights linguistic intricacies, and underscores the means to offer formative feedback to novice writers; and more importantly, (5) it empowers learning analytics practitioners to make AI pedagogical agents accountable to the human educator, the ultimate problem holder responsible for the decisions and actions of AI ( Abbass, 2019 ). Basically, learning analytics (which encompasses tools such as AES) is characterized as an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that recurrently measures and proactively advances knowledge boundaries in human learning.

To exemplify this, imagine an AES system that supports instructors in the detection of plagiarism, gaming behaviors, and the marking of writing activities. As previously mentioned, essays are marked according to a grid of scoring rubrics: ideas, organization, style, and conventions. While an abundance of data (e.g., the 1592 writing metrics) can be collected by the AES tool, these data might still be insufficient to automate the scoring process of certain rubrics (e.g., ideas). Nevertheless, some scoring subtasks such as assessing a student’s vocabulary, sentence fluency, and conventions might still be assigned to AI since the data types available through existing automatic linguistic analysis tools prove sufficient to reliably alleviate the human marker’s workload. Interestingly, learning analytics is key for the accountability of AI agents to the human problem holder. As the volume of writing data (through a large student population, high-frequency capture of learning episodes, and variety of big learning data) accumulate in the system, new AI agents (predictive models) may apply for the job of “automarker.” These AI agents can be quite transparent through XAI ( Arrieta et al., 2020 ) explanation models, and a human instructor may assess the suitability of an agent for the job and hire the candidate agent that comes closest to human performance. Explanations derived from these models could serve as formative feedback to the students.

The AI marker can be assigned to assess the writing activities that are similar to those previously scored by the human marker(s) from whom it learns. Dissimilar and unseen essays can be automatically assigned to the human marker for reliable scoring, and the AI agent can learn from this manual scoring. To ensure accountability, students should be allowed to appeal the AI agent’s marking to the human marker. In addition, the human marker should be empowered to monitor and validate the scoring of select writing rubrics scored by the AI marker. If the human marker does not agree with the machine scores, the writing assignments may be flagged as incorrectly scored and re-assigned to a human marker. These flagged assignments may serve to update predictive models. Moreover, among the essays that are assigned to the machine marker, a small subset can be simultaneously assigned to the human marker for continuous quality control; that is, to continue comparing whether the agreement level between human and machine markers remains within an acceptable threshold. The human marker should be at any time able to “fire” an AI marker or “hire” an AI marker from a pool of potential machine markers.

This notion of a human-AI fusion has been observed in previous AES systems where the human marker’s workload has been found to be significantly alleviated, passing from scoring several hundreds of essays to just a few dozen ( Dronen et al., 2015 ; Hellman et al., 2019 ). As the AES technology matures and as the learning analytics tools continue to penetrate the education market, this alliance of semi-autonomous human and AI agents will lead to better evidence-based/informed pedagogy ( Nelson and Campbell, 2017 ). Such a human-AI alliance can also be guided to autonomously self-regulate its own hypothesis-authoring and data-acquisition processes for purposes of measuring and advancing knowledge boundaries in human learning.

Real-Time Formative Pedagogical Feedback

This paper provides the evidence that deep learning and SHAP can be used not only to score essays automatically but also to offer explanations in real-time. More specifically, the processing time to derive the 314 explanations of the testing set’s essays has been benchmarked for several types of explainers. It was found that the faster DeepSHAP and GradientSHAP implementations, which took only a few seconds of processing, did not produce less accurate explanations than the much slower KernelSHAP. KernelSHAP took approximately 8 h of processing to derive the explanation model of a 2-layer MLP neural network predictive model and 16 h for the 6-layer predictive model.

This finding also holds for various configurations of KernelSHAP, where the number of features (16 vs. 32) shaping the explanation (where all other features are assigned zero contributions) did not produce a statistically significant difference in the reliability of the explanation models. On average, the models had a precision between 63.9 and 64.1% and a recall between 41.0 and 42.9%. This means that after perturbation of the predictive and explanation models, on average 64% of the predictions the explanation model identified as changing were accurate. On the other side, only about 42% of all predictions that changed were detected by the various 2-layer explainers. An explanation was considered as untrustworthy if the sum of its feature contributions, when added to the average prediction (base value), was not within 0.1 from the perturbated prediction. Similarly, the average precision and recall of 2-layer explainers for the 0.25-threshold were about 69% and 62%, respectively.

Impact of Deep Learning on Descriptive Accuracy of Explanations

By analyzing the performance of the various predictive models in Table 4 , no clear conclusion can be reached as to which model should be deemed as the most desirable. Despite the fact that the 6-layer models slightly outperform the other models in terms of accuracy (percentage of exact matches between the resolved [human] and predicted [machine] scores), they are not the best when it comes to the percentages of adjacent (within 1 and 2) matches. Nevertheless, if the selection of the “best” model is based on the quadratic weighted kappas, the decision remains a nebulous one to make. Moreover, ensuring that machine learning actually learned something meaningful remains paramount, especially in contexts where the performance of a majority classifier is close to the human and machine performance. For example, a majority classifier model would get 46.3% of predictions accurate ( Table 4 ), while trained predictive models at best produce accurate predictions between 51.9 and 55.1%.

Since the interpretability of a machine learning model should be prioritized over accuracy ( Ribeiro et al., 2016 ; Murdoch et al., 2019 ) for questions of transparency and trust, this paper investigated whether the impact of the depth of a MLP neural network might be more visible when assessing its interpretability, that is, the trustworthiness of its corresponding SHAP explanation model. The data in Tables 1 , 5 , 6 effectively support the hypothesis that as the depth of the neural network increases, the precision and recall of the corresponding explanation model improve. Besides, this observation is particularly interesting because the 4-layer (Grad) explainer, which has hardly more parameters than the 2-layer model, is also more accurate than the 2-layer model, suggesting that the 6-layer explainer is most likely superior to other explainers not only because of its greater number of parameters, but also because of its number of hidden layers. By increasing the number of hidden layers, it can be seen that the precision and recall of an explanation model can pass on average from approximately 64 to 73% and from 42 to 52%, respectively, for the 0.10-threshold; and for the 0.25-threshold, from 69 to 79% and from 62 to 75%, respectively.

These results imply that the descriptive accuracy of an explanation model is an evidence of effective machine learning, which may exceed the level of agreement between the human and machine graders. Moreover, given that the superiority of a trained predictive model over a majority classifier is not always obvious, the consistency of its associated explanation model demonstrates this better. Note that theoretically the SHAP explanation model of the majority classifier should assign a zero contribution to each writing feature since the average prediction of such a model is actually the most frequent rubric score given by the human raters; hence, the base value is the explanation.

An interesting fact emerges from Figure 3 , that is, all explainers (2-layer to 6-layer) are more or less similar. It appears that they do not contradict each other. More specifically, they all agree on the direction of the contributions of the most important features. In other words, they unanimously determine that a feature should increase or decrease the predicted score. However, they differ from each other on the magnitude of the feature contributions.

To conclude, this study highlights the need to train predictive models that consider the descriptive accuracy of explanations. The idea is that explanation models consider predictions to derive explanations; explanations should be considered when training predictive models. This would not only help train interpretable models the very first time but also potentially break the status quo that may exist among similar explainers to possibly produce more powerful models. In addition, this research calls for a mechanism (e.g., causal diagrams) to allow teachers to guide the training process of predictive models. Put another way, as LA practitioners debug predictive models, their insights should be encoded in a language that will be understood by the machine and that will guide the training process to avoid learning the same errors and to accelerate the training time.

Accountable AES

Now that the superiority of the 6-layer predictive and explanation models has been demonstrated, some aspects of the relevancy of explanations should be examined more deeply, knowing that having an explanation model consistent with its underlying predictive model does not guarantee relevant explanations. Table 7 discloses the set of four essays that optimize the coverage of most globally important features to evaluate the correctness of the best AES model. It is quite intriguing to note that two of the four essays are among the 16 essays that have a major disagreement (off by 2) between the resolved and predicted rubric scores (1 vs. 3 and 4 vs. 2). The AES tool clearly overrated Essay 228, while it underrated Essay 219. Naturally, these two essays offer an opportunity to understand what is wrong with the model and ultimately debug the model to improve its accuracy and interpretability.

In particular, Essay 228 raises suspicion on the positive contributions of features such as “Ortho_N,” “lemma_mattr,” “all_logical,” “det_pobj_deps_struct,” and “dobj_per_cl.” Moreover, notice how the remaining 262 less important features (not visible in the decision plot in Figure 5 ) have already inflated the rubric score beyond the base value, more than any other essay. Given the very short length and very low quality of the essay, whose meaning is seriously undermined by spelling and grammatical errors, it is of utmost importance to verify how some of these features are computed. For example, is the average number of orthographic neighbors (Ortho_N) per token computed for unmeaningful tokens such as “R” and “whe”? Similarly, are these tokens considered as types in the type-token ratio over lemmas (lemma_mattr)? Given the absence of a meaningful grammatical structure conveying a complete idea through well-articulated words, it becomes obvious that the quality of NLP (natural language processing) parsing may become a source of (measurement) bias impacting both the way some writing features are computed and the predicted rubric score. To remedy this, two solutions are proposed: (1) enhancing the dataset with the part-of-speech sequence or the structure of dependency relationships along with associated confidence levels, or (2) augmenting the essay dataset with essays enclosing various types of non-sensical content to improve the learning of these feature contributions.

Note that all four essays have a text length smaller than the average: 171 words. Notice also how the “hdd42_aw” and “hdd42_fw” play a significant role to decrease the predicted score of Essays 228 and 68. The reader should note that these metrics require a minimum of 42 tokens in order to compute a non-zero D index, a measure of lexical diversity as explained in Supplementary Table 1 . Figure 6B also shows how zero “hdd42_aw” values are heavily penalized. This is extra evidence that supports the strong role that the number of words plays in determining these rubric scores, especially for very short essays where it is one of the few observations that can be reliably recorded.

Two other issues with the best trained AES model were identified. First, in the eyes of the model, the lowest the average number of direct objects per clause (dobj_per_cl), as seen in Figure 6D , the best it is. This appears to contradict one of the requirements of the “Style” rubric, which looks for a variety of sentence structures. Remember that direct objects imply the presence of transitive verbs (action verbs) and that the balanced usage of linking verbs and action verbs as well as of transitive and intransitive verbs is key to meet the requirement of variety of sentence structures. Moreover, note that the writing feature is about counting the number of direct objects per clause, not by sentence. Only one direct object is therefore possible per clause. On the other side, a sentence may contain several clauses, which determines if the sentence is a simple, compound, or a complex sentence. This also means that a sentence may have multiple direct objects and that a high ratio of direct objects per clause is indicative of sentence complexity. Too much complexity is also undesirable. Hence, it is fair to conclude that the higher range of feature values has reasonable feature contributions (SHAP values), while the lower range does not capture well the requirements of the rubric. The dependence plot should rather display a positive peak somewhere in the middle. Notice how the poor quality of Essay 228’s single sentence prevented the proper detection of the single direct object, “broke my finger,” and the so-called absence of direct objects was one of the reasons to wrongfully improve the predicted rubric score.

The model’s second issue discussed here is the presence of sentence fragments, a type of grammatical errors. Essentially, a sentence fragment is a clause that misses one of three critical components: a subject, a verb, or a complete idea. Figure 6E shows the contribution model of grammatical errors, all types combined, while Figure 6F shows specifically the contribution model of sentence fragments. It is interesting to see how SHAP further penalizes larger numbers of grammatical errors and that it takes into account the length of the essay (red dots represent essays with larger numbers of words; blue dots represent essays with smaller numbers of words). For example, except for essays with no identified grammatical errors, longer essays are less penalized than shorter ones. This is particularly obvious when there are 2−4 grammatical errors. The model increases the predicted rubric score only when there is no grammatical error. Moreover, the model tolerates longer essays with only one grammatical error, which sounds quite reasonable. On the other side, the model finds desirable high numbers of sentence fragments, a non-trivial type of grammatical errors. Even worse, the model decreases the rubric score of essays having no sentence fragment. Although grammatical issues are beyond the scope of the “Style” rubric, the model has probably included these features because of their impact on the quality of assessment of vocabulary usage and sentence fluency. The reader should observe how the very poor quality of an essay can even prevent the detection of such fundamental grammatical errors such as in the case of Essay 228, where the AES tool did not find any grammatical error or sentence fragment. Therefore, there should be a way for AES systems to detect a minimum level of text quality before attempting to score an essay. Note that the objective of this section was not to undertake thorough debugging of the model, but rather to underscore the effectiveness of SHAP in doing so.

Formative Feedback

Once an AES model is considered reasonably valid, SHAP can be a suitable formalism to empower the machine to provide formative feedback. For instance, the explanation of Essay 124, which has been assigned a rubric score of 3 by both human and machine markers, indicates that the top two factors contributing to decreasing the predicted rubric score are: (1) the essay length being smaller than average, and (2) the average number of verb lemma types occurring at least once in the next sentence (adjacent_overlap_verb_sent). Figures 6A,H give the overall picture in which the realism of the contributions of these two features can be analyzed. More specifically, Essay 124 is one of very few essays ( Figure 6H ) that makes redundant usage of the same verbs across adjacent sentences. Moreover, the essay displays poor sentence fluency where everything is only expressed in two sentences. To understand more accurately the impact of “adjacent_overlap_verb_sent” on the prediction, a few spelling errors have been corrected and the text has been divided in four sentences instead of two. Revision 1 in Table 8 exhibits the corrections made to the original essay. The decision plot’s dashed line in Figure 3D represents the original explanation of Essay 124, while Figure 7A demonstrates the new explanation of the revised essay. It can be seen that the “adjacent_overlap_verb_sent” feature is still the second most important feature in the new explanation of Essay 124, with a feature value of 0.429, still considered as very poor according to the dependence plot in Figure 6H .

Table 8. Revisions of Essay 124: improvement of sentence splitting, correction of some spelling errors, and elimination of redundant usage of same verbs (bold for emphasis in Essay 124’s original version; corrections in bold for Revisions 1 and 2).

Figure 7. Explanations of the various versions of Essay 124 and evaluation of feature effect for a range of feature values. (A) Explanation of Essay 124’s first revision. (B) Forecasting the effect of changing the ‘adjacent_overlap_verb_sent’ feature on the rubric score. (C) Explanation of Essay 124’s second revision. (D) Comparison of the explanations of all Essay 124’s versions.

To show how SHAP could be leveraged to offer remedial formative feedback, the revised version of Essay 124 will be explained again for eight different values of “adjacent_overlap_verb_sent” (0, 0.143, 0.286, 0.429, 0.571, 0.714, 0.857, 1.0), while keeping the values of all other features constant. The set of these eight essays are explained by a newly trained SHAP explainer (Gradient), producing new SHAP values for each feature and each “revised” essay. Notice how the new model, called the feedback model, allows to foresee by how much a novice writer can hope to improve his/her score according to the “Style” rubric. If the student employs different verbs at every sentence, the feedback model estimates that the rubric score could be improved from 3.47 up to 3.65 ( Figure 7B ). Notice that the dashed line represents Revision 1, while other lines simulate one of the seven other altered essays. Moreover, it is important to note how changing the value of a single feature may influence the contributions that other features may have on the predicted score. Again, all explanations look similar in terms of direction, but certain features differ in terms of the magnitude of their contributions. However, the reader should observe how the targeted feature varies not only in terms of magnitude, but also of direction, allowing the student to ponder the relevancy of executing the recommended writing strategy.

Thus, upon receiving this feedback, assume that a student sets the goal to improve the effectiveness of his/her verb choice by eliminating any redundant verb, producing Revision 2 in Table 8 . The student submits his essay again to the AES system, which finally gives a new rubric score of 3.98, a significant improvement from the previous 3.47, allowing the student to get a 4 instead of a 3. Figure 7C exhibits the decision plot of Revision 2. To better observe how the various revisions of the student’s essay changed over time, their respective explanations have been plotted in the same decision plot ( Figure 7D ). Notice this time that the ordering of the features has changed to list the features of common importance to all of the essay’s versions. The feature ordering in Figures 7A−C complies with the same ordering as in Figure 3D , the decision plot of the original essay. These figures underscore the importance of tracking the interaction between the various features so that the model understands well the impact that changing one feature has on the others. TreeSHAP, an implementation for tree-based models, offers this capability and its potential on improving the quality of feedback provided to novice writers will be tested in a future version of this AES system.

This paper serves as a proof of concept of the applicability of XAI techniques in automated essay scoring, providing learning analytics practitioners and educators with a methodology on how to “hire” AI markers and make them accountable to their human counterparts. In addition to debug predictive models, SHAP explanation models can serve as some formalism of a broader learning analytics platform, where aspects of prescriptive analytics (provision of remedial formative feedback) can be added on top of the more pervasive predictive analytics.

However, the main weakness of the approach put forward in this paper consists in omitting many types of spatio-temporal data. In other words, it ignores precious information inherent to the writing process, which may prove essential to guess the intent of the student, especially in contexts of poor sentence structures and high grammatical inaccuracy. Hence, this paper calls for adapting current NLP technologies to educational purposes, where the quality of writing may be suboptimal, which is contrary to many utopian scenarios where NLP is used for content analysis, opinion mining, topic modeling, or fact extraction trained on corpora of high-quality texts. By capturing the writing process preceding a submission of an essay to an AES tool, other kinds of explanation models can also be trained to offer feedback not only from a linguistic perspective but also from a behavioral one (e.g., composing vs. revising); that is, the AES system could inform novice writers about suboptimal and optimal writing strategies (e.g., planning a revision phase after bursts of writing).

In addition, associating sections of text with suboptimal writing features, those whose contributions lower the predicted score, would be much more informative. This spatial information would not only allow to point out what is wrong and but also where it is wrong, answering more efficiently the question why an essay is wrong. This problem could be simply approached through a multiple-inputs and mixed-data feature-based (MLP) neural network architecture fed by both linguistic indices and textual data ( n -grams), where the SHAP explanation model would assign feature contributions to both types of features and any potential interaction between them. A more complex approach could address the problem through special types of recurrent neural networks such as Ordered-Neurons LSTMs (long short-term memory), which are well adapted to the parsing of natural language, and where the natural sequence of text is not only captured but also its hierarchy of constituents ( Shen et al., 2018 ). After all, this paper highlights the fact that the potential of deep learning can reach beyond the training of powerful predictive models and be better visible in the higher trustworthiness of explanation models. This paper also calls for optimizing the training of predictive models by considering the descriptive accuracy of explanations and the human expert’s qualitative knowledge (e.g., indicating the direction of feature contributions) during the training process.

Data Availability Statement

The datasets and code of this study can be found in these Open Science Framework’s online repositories: https://osf.io/fxvru/ .

Author Contributions

VK architected the concept of an ethics-bound, semi-autonomous, and trust-enabled human-AI fusion that measures and advances knowledge boundaries in human learning, which essentially defines the key traits of learning analytics. DB was responsible for its implementation in the area of explainable automated essay scoring and for the training and validation of the predictive and explanation models. Together they offer an XAI-based proof of concept of a prescriptive model that can offer real-time formative remedial feedback to novice writers. Both authors contributed to the article and approved its publication.

Research reported in this article was supported by the Academic Research Fund (ARF) publication grant of Athabasca University under award number (24087).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2020.572367/full#supplementary-material

^ https://www.kaggle.com/c/asap-aes
^ https://www.linguisticanalysistools.org/

Abbass, H. A. (2019). Social integration of artificial intelligence: functions, automation allocation logic and human-autonomy trust. Cogn. Comput. 11, 159–171. doi: 10.1007/s12559-018-9619-0

CrossRef Full Text | Google Scholar

Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160. doi: 10.1109/ACCESS.2018.2870052

Amorim, E., Cançado, M., and Veloso, A. (2018). “Automated essay scoring in the presence of biased ratings,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , New Orleans, LA, 229–237.

Google Scholar

Arrieta, A. B., Díaz-Rodríguez, N., Ser, J., Del Bennetot, A., Tabik, S., Barbado, A., et al. (2020). Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inform. Fusion 58, 82–115. doi: 10.1016/j.inffus.2019.12.012

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., et al. (2007). The English lexicon project. Behav. Res. Methods 39, 445–459. doi: 10.3758/BF03193014

PubMed Abstract | CrossRef Full Text | Google Scholar

Boulanger, D., and Kumar, V. (2018). “Deep learning in automated essay scoring,” in Proceedings of the International Conference of Intelligent Tutoring Systems , eds R. Nkambou, R. Azevedo, and J. Vassileva (Cham: Springer International Publishing), 294–299. doi: 10.1007/978-3-319-91464-0_30

Boulanger, D., and Kumar, V. (2019). “Shedding light on the automated essay scoring process,” in Proceedings of the International Conference on Educational Data Mining , 512–515.

Boulanger, D., and Kumar, V. (2020). “SHAPed automated essay scoring: explaining writing features’ contributions to English writing organization,” in Intelligent Tutoring Systems , eds V. Kumar and C. Troussas (Cham: Springer International Publishing), 68–78. doi: 10.1007/978-3-030-49663-0_10

Chen, H., Lundberg, S., and Lee, S.-I. (2019). Explaining models by propagating Shapley values of local components. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1911.11888 (accessed September 22, 2020).

Crossley, S. A., Bradfield, F., and Bustamante, A. (2019). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. J. Writ. Res. 11, 251–270. doi: 10.17239/jowr-2019.11.02.01

Crossley, S. A., Kyle, K., and McNamara, D. S. (2016). The tool for the automatic analysis of text cohesion (TAACO): automatic assessment of local, global, and text cohesion. Behav. Res. Methods 48, 1227–1237. doi: 10.3758/s13428-015-0651-7

Crossley, S. A., Kyle, K., and McNamara, D. S. (2017). Sentiment analysis and social cognition engine (SEANCE): an automatic tool for sentiment, social cognition, and social-order analysis. Behav. Res. Methods 49, 803–821. doi: 10.3758/s13428-016-0743-z

Dronen, N., Foltz, P. W., and Habermehl, K. (2015). “Effective sampling for large-scale automated writing evaluation systems,” in Proceedings of the Second (2015) ACM Conference on Learning @ Scale , 3–10.

Goldin, I., Narciss, S., Foltz, P., and Bauer, M. (2017). New directions in formative feedback in interactive learning environments. Int. J. Artif. Intellig. Educ. 27, 385–392. doi: 10.1007/s40593-016-0135-7

Hao, Q., and Tsikerdekis, M. (2019). “How automated feedback is delivered matters: formative feedback and knowledge transfer,” in Proceedings of the 2019 IEEE Frontiers in Education Conference (FIE) , Covington, KY, 1–6.

Hellman, S., Rosenstein, M., Gorman, A., Murray, W., Becker, L., Baikadi, A., et al. (2019). “Scaling up writing in the curriculum: batch mode active learning for automated essay scoring,” in Proceedings of the Sixth (2019) ACM Conference on Learning @ Scale , (New York, NY: Association for Computing Machinery).

Hussein, M. A., Hassan, H., and Nassef, M. (2019). Automated language essay scoring systems: a literature review. PeerJ Comput. Sci. 5:e208. doi: 10.7717/peerj-cs.208

Kumar, V., and Boulanger, D. (2020). Automated essay scoring and the deep learning black box: how are rubric scores determined? Int. J. Artif. Intellig. Educ. doi: 10.1007/s40593-020-00211-5

Kumar, V., Fraser, S. N., and Boulanger, D. (2017). Discovering the predictive power of five baseline writing competences. J. Writ. Anal. 1, 176–226.

Kyle, K. (2016). Measuring Syntactic Development In L2 Writing: Fine Grained Indices Of Syntactic Complexity And Usage-Based Indices Of Syntactic Sophistication. Dissertation, Georgia State University, Atlanta, GA.

Kyle, K., Crossley, S., and Berger, C. (2018). The tool for the automatic analysis of lexical sophistication (TAALES): version 2.0. Behav. Res. Methods 50, 1030–1046. doi: 10.3758/s13428-017-0924-4

Lundberg, S. M., Erion, G. G., and Lee, S.-I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv [Preprint]. Available online at: https://arxiv.org/abs/1802.03888 (accessed September 22, 2020).

Lundberg, S. M., and Lee, S.-I. (2017). “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems , eds I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, et al. (Red Hook, NY: Curran Associates, Inc), 4765–4774.

Madnani, N., and Cahill, A. (2018). “Automated scoring: beyond natural language processing,” in Proceedings of the 27th International Conference on Computational Linguistics , (Santa Fe: Association for Computational Linguistics), 1099–1109.

Madnani, N., Loukina, A., von Davier, A., Burstein, J., and Cahill, A. (2017). “Building better open-source tools to support fairness in automated scoring,” in Proceedings of the First (ACL) Workshop on Ethics in Natural Language Processing , (Valencia: Association for Computational Linguistics), 41–52.

McCarthy, P. M., and Jarvis, S. (2010). MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behav. Res. Methods 42, 381–392. doi: 10.3758/brm.42.2.381

Mizumoto, T., Ouchi, H., Isobe, Y., Reisert, P., Nagata, R., Sekine, S., et al. (2019). “Analytic score prediction and justification identification in automated short answer scoring,” in Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications , Florence, 316–325.

Molnar, C. (2020). Interpretable Machine Learning . Abu Dhabi: Lulu

Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., and Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. U.S.A. 116, 22071–22080. doi: 10.1073/pnas.1900654116

Nelson, J., and Campbell, C. (2017). Evidence-informed practice in education: meanings and applications. Educ. Res. 59, 127–135. doi: 10.1080/00131881.2017.1314115

Rahimi, Z., Litman, D., Correnti, R., Wang, E., and Matsumura, L. C. (2017). Assessing students’ use of evidence and organization in response-to-text writing: using natural language processing for rubric-based automated scoring. Int. J. Artif. Intellig. Educ. 27, 694–728. doi: 10.1007/s40593-017-0143-2

Reinertsen, N. (2018). Why can’t it mark this one? A qualitative analysis of student writing rejected by an automated essay scoring system. English Austral. 53:52.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). “Why should i trust you?”: explaining the predictions of any classifier. CoRR, abs/1602.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1602.04938 (accessed September 22, 2020).

Rupp, A. A. (2018). Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl. Meas. Educ. 31, 191–214. doi: 10.1080/08957347.2018.1464448

Rupp, A. A., Casabianca, J. M., Krüger, M., Keller, S., and Köller, O. (2019). Automated essay scoring at scale: a case study in Switzerland and Germany. ETS Res. Rep. Ser. 2019, 1–23. doi: 10.1002/ets2.12249

Shen, Y., Tan, S., Sordoni, A., and Courville, A. C. (2018). Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. CoRR, abs/1810.0. arXiv [Preprint]. Available online at: http://arxiv.org/abs/1810.09536 (accessed September 22, 2020).

Shermis, M. D. (2014). State-of-the-art automated essay scoring: competition, results, and future directions from a United States demonstration. Assess. Writ. 20, 53–76. doi: 10.1016/j.asw.2013.04.001

Taghipour, K. (2017). Robust Trait-Specific Essay Scoring using Neural Networks and Density Estimators. Dissertation, National University of Singapore, Singapore.

West-Smith, P., Butler, S., and Mayfield, E. (2018). “Trustworthy automated essay scoring without explicit construct validity,” in Proceedings of the 2018 AAAI Spring Symposium Series , (New York, NY: ACM).

Woods, B., Adamson, D., Miel, S., and Mayfield, E. (2017). “Formative essay feedback using predictive scoring models,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , (New York, NY: ACM), 2071–2080.

Keywords : explainable artificial intelligence, SHAP, automated essay scoring, deep learning, trust, learning analytics, feedback, rubric

Citation: Kumar V and Boulanger D (2020) Explainable Automated Essay Scoring: Deep Learning Really Has Pedagogical Value. Front. Educ. 5:572367. doi: 10.3389/feduc.2020.572367

Received: 14 June 2020; Accepted: 09 September 2020; Published: 06 October 2020.

Reviewed by:

Copyright © 2020 Kumar and Boulanger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: David Boulanger, [email protected]

This article is part of the Research Topic

Learning Analytics for Supporting Individualization: Data-informed Adaptation of Learning

Careers at ETS

Join us in our mission to advance quality and equity in education.

Online Scoring

The Online Network for Evaluation (ONE) is ETS’s proprietary software application that enables Raters to score test-taker responses and/or other submitted work via secure internet access. Raters typically score from their homes or offices, in accordance with ETS guidelines. Online scoring sessions may occasionally be conducted at regional sites.

The process for each testing program includes:

System Training — using online learning modules and quick reference guides. This training is generic and does not need to be repeated for each program or test.
Content Training — using online training sets or possibly scored on paper.
Certification — after completing the required certification test(s) online with successful results.
Scheduling — certified Raters enter their availability for open scoring sessions via the ETS online scheduling system. Schedules are confirmed by ETS based on the overall availability of Raters and scoring volumes. This process is also used to schedule prospective Raters for training and certification.
Calibration — most programs require Raters to take calibration test(s) daily prior to scoring.
Scoring — using online scoring application with access to prompt support materials.
Mentoring — on prescribed application of rubrics and testing program standards is provided by ETS Scoring Leaders throughout each scoring session via online feedback and other means of communication.

To certify and score using the ETS Online Scoring system, prospective Raters must have computer equipment that meets the minimum hardware and software requirements.

Also, please read Frequently Asked Questions About Online Rater/Scoring Professional Programs .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
My Account Login
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 03 June 2024

Applying large language models for automated essay scoring for non-native Japanese

Wenchao Li 1 &
Haitao Liu 2

Humanities and Social Sciences Communications volume 11 , Article number: 723 ( 2024 ) Cite this article

12 Accesses

1 Altmetric

Metrics details

Language and linguistics

Recent advancements in artificial intelligence (AI) have led to an increased use of large language models (LLMs) for language assessment tasks such as automated essay scoring (AES), automated listening tests, and automated oral proficiency assessments. The application of LLMs for AES in the context of non-native Japanese, however, remains limited. This study explores the potential of LLM-based AES by comparing the efficiency of different models, i.e. two conventional machine training technology-based methods (Jess and JWriter), two LLMs (GPT and BERT), and one Japanese local LLM (Open-Calm large model). To conduct the evaluation, a dataset consisting of 1400 story-writing scripts authored by learners with 12 different first languages was used. Statistical analysis revealed that GPT-4 outperforms Jess and JWriter, BERT, and the Japanese language-specific trained Open-Calm large model in terms of annotation accuracy and predicting learning levels. Furthermore, by comparing 18 different models that utilize various prompts, the study emphasized the significance of prompts in achieving accurate and reliable evaluations using LLMs.

Accurate structure prediction of biomolecular interactions with AlphaFold 3

Testing theory of mind in large language models and humans

Highly accurate protein structure prediction with AlphaFold

Conventional machine learning technology in aes.

AES has experienced significant growth with the advancement of machine learning technologies in recent decades. In the earlier stages of AES development, conventional machine learning-based approaches were commonly used. These approaches involved the following procedures: a) feeding the machine with a dataset. In this step, a dataset of essays is provided to the machine learning system. The dataset serves as the basis for training the model and establishing patterns and correlations between linguistic features and human ratings. b) the machine learning model is trained using linguistic features that best represent human ratings and can effectively discriminate learners’ writing proficiency. These features include lexical richness (Lu, 2012 ; Kyle and Crossley, 2015 ; Kyle et al. 2021 ), syntactic complexity (Lu, 2010 ; Liu, 2008 ), text cohesion (Crossley and McNamara, 2016 ), and among others. Conventional machine learning approaches in AES require human intervention, such as manual correction and annotation of essays. This human involvement was necessary to create a labeled dataset for training the model. Several AES systems have been developed using conventional machine learning technologies. These include the Intelligent Essay Assessor (Landauer et al. 2003 ), the e-rater engine by Educational Testing Service (Attali and Burstein, 2006 ; Burstein, 2003 ), MyAccess with the InterlliMetric scoring engine by Vantage Learning (Elliot, 2003 ), and the Bayesian Essay Test Scoring system (Rudner and Liang, 2002 ). These systems have played a significant role in automating the essay scoring process and providing quick and consistent feedback to learners. However, as touched upon earlier, conventional machine learning approaches rely on predetermined linguistic features and often require manual intervention, making them less flexible and potentially limiting their generalizability to different contexts.

In the context of the Japanese language, conventional machine learning-incorporated AES tools include Jess (Ishioka and Kameda, 2006 ) and JWriter (Lee and Hasebe, 2017 ). Jess assesses essays by deducting points from the perfect score, utilizing the Mainichi Daily News newspaper as a database. The evaluation criteria employed by Jess encompass various aspects, such as rhetorical elements (e.g., reading comprehension, vocabulary diversity, percentage of complex words, and percentage of passive sentences), organizational structures (e.g., forward and reverse connection structures), and content analysis (e.g., latent semantic indexing). JWriter employs linear regression analysis to assign weights to various measurement indices, such as average sentence length and total number of characters. These weights are then combined to derive the overall score. A pilot study involving the Jess model was conducted on 1320 essays at different proficiency levels, including primary, intermediate, and advanced. However, the results indicated that the Jess model failed to significantly distinguish between these essay levels. Out of the 16 measures used, four measures, namely median sentence length, median clause length, median number of phrases, and maximum number of phrases, did not show statistically significant differences between the levels. Additionally, two measures exhibited between-level differences but lacked linear progression: the number of attributives declined words and the Kanji/kana ratio. On the other hand, the remaining measures, including maximum sentence length, maximum clause length, number of attributive conjugated words, maximum number of consecutive infinitive forms, maximum number of conjunctive-particle clauses, k characteristic value, percentage of big words, and percentage of passive sentences, demonstrated statistically significant between-level differences and displayed linear progression.

Both Jess and JWriter exhibit notable limitations, including the manual selection of feature parameters and weights, which can introduce biases into the scoring process. The reliance on human annotators to label non-native language essays also introduces potential noise and variability in the scoring. Furthermore, an important concern is the possibility of system manipulation and cheating by learners who are aware of the regression equation utilized by the models (Hirao et al. 2020 ). These limitations emphasize the need for further advancements in AES systems to address these challenges.

Deep learning technology in AES

Deep learning has emerged as one of the approaches for improving the accuracy and effectiveness of AES. Deep learning-based AES methods utilize artificial neural networks that mimic the human brain’s functioning through layered algorithms and computational units. Unlike conventional machine learning, deep learning autonomously learns from the environment and past errors without human intervention. This enables deep learning models to establish nonlinear correlations, resulting in higher accuracy. Recent advancements in deep learning have led to the development of transformers, which are particularly effective in learning text representations. Noteworthy examples include bidirectional encoder representations from transformers (BERT) (Devlin et al. 2019 ) and the generative pretrained transformer (GPT) (OpenAI).

BERT is a linguistic representation model that utilizes a transformer architecture and is trained on two tasks: masked linguistic modeling and next-sentence prediction (Hirao et al. 2020 ; Vaswani et al. 2017 ). In the context of AES, BERT follows specific procedures, as illustrated in Fig. 1 : (a) the tokenized prompts and essays are taken as input; (b) special tokens, such as [CLS] and [SEP], are added to mark the beginning and separation of prompts and essays; (c) the transformer encoder processes the prompt and essay sequences, resulting in hidden layer sequences; (d) the hidden layers corresponding to the [CLS] tokens (T[CLS]) represent distributed representations of the prompts and essays; and (e) a multilayer perceptron uses these distributed representations as input to obtain the final score (Hirao et al. 2020 ).

AES system with BERT (Hirao et al. 2020 ).

The training of BERT using a substantial amount of sentence data through the Masked Language Model (MLM) allows it to capture contextual information within the hidden layers. Consequently, BERT is expected to be capable of identifying artificial essays as invalid and assigning them lower scores (Mizumoto and Eguchi, 2023 ). In the context of AES for nonnative Japanese learners, Hirao et al. ( 2020 ) combined the long short-term memory (LSTM) model proposed by Hochreiter and Schmidhuber ( 1997 ) with BERT to develop a tailored automated Essay Scoring System. The findings of their study revealed that the BERT model outperformed both the conventional machine learning approach utilizing character-type features such as “kanji” and “hiragana”, as well as the standalone LSTM model. Takeuchi et al. ( 2021 ) presented an approach to Japanese AES that eliminates the requirement for pre-scored essays by relying solely on reference texts or a model answer for the essay task. They investigated multiple similarity evaluation methods, including frequency of morphemes, idf values calculated on Wikipedia, LSI, LDA, word-embedding vectors, and document vectors produced by BERT. The experimental findings revealed that the method utilizing the frequency of morphemes with idf values exhibited the strongest correlation with human-annotated scores across different essay tasks. The utilization of BERT in AES encounters several limitations. Firstly, essays often exceed the model’s maximum length limit. Second, only score labels are available for training, which restricts access to additional information.

Mizumoto and Eguchi ( 2023 ) were pioneers in employing the GPT model for AES in non-native English writing. Their study focused on evaluating the accuracy and reliability of AES using the GPT-3 text-davinci-003 model, analyzing a dataset of 12,100 essays from the corpus of nonnative written English (TOEFL11). The findings indicated that AES utilizing the GPT-3 model exhibited a certain degree of accuracy and reliability. They suggest that GPT-3-based AES systems hold the potential to provide support for human ratings. However, applying GPT model to AES presents a unique natural language processing (NLP) task that involves considerations such as nonnative language proficiency, the influence of the learner’s first language on the output in the target language, and identifying linguistic features that best indicate writing quality in a specific language. These linguistic features may differ morphologically or syntactically from those present in the learners’ first language, as observed in (1)–(3).

我-送了-他-一本-书

Wǒ-sòngle-tā-yī běn-shū

1 sg .-give. past- him-one .cl- book

“I gave him a book.”

Agglutinative

彼-に-本-を-あげ-まし-た

Kare-ni-hon-o-age-mashi-ta

3 sg .- dat -hon- acc- give.honorification. past

Inflectional

give, give-s, gave, given, giving

Additionally, the morphological agglutination and subject-object-verb (SOV) order in Japanese, along with its idiomatic expressions, pose additional challenges for applying language models in AES tasks (4).

足-が棒-になり-ました

Ashi-ga bo-ni nar-mashita

leg- nom stick- dat become- past

“My leg became like a stick (I am extremely tired).”

The example sentence provided demonstrates the morpho-syntactic structure of Japanese and the presence of an idiomatic expression. In this sentence, the verb “なる” (naru), meaning “to become”, appears at the end of the sentence. The verb stem “なり” (nari) is attached with morphemes indicating honorification (“ます” - mashu) and tense (“た” - ta), showcasing agglutination. While the sentence can be literally translated as “my leg became like a stick”, it carries an idiomatic interpretation that implies “I am extremely tired”.

To overcome this issue, CyberAgent Inc. ( 2023 ) has developed the Open-Calm series of language models specifically designed for Japanese. Open-Calm consists of pre-trained models available in various sizes, such as Small, Medium, Large, and 7b. Figure 2 depicts the fundamental structure of the Open-Calm model. A key feature of this architecture is the incorporation of the Lora Adapter and GPT-NeoX frameworks, which can enhance its language processing capabilities.

GPT-NeoX Model Architecture (Okgetheng and Takeuchi 2024 ).

In a recent study conducted by Okgetheng and Takeuchi ( 2024 ), they assessed the efficacy of Open-Calm language models in grading Japanese essays. The research utilized a dataset of approximately 300 essays, which were annotated by native Japanese educators. The findings of the study demonstrate the considerable potential of Open-Calm language models in automated Japanese essay scoring. Specifically, among the Open-Calm family, the Open-Calm Large model (referred to as OCLL) exhibited the highest performance. However, it is important to note that, as of the current date, the Open-Calm Large model does not offer public access to its server. Consequently, users are required to independently deploy and operate the environment for OCLL. In order to utilize OCLL, users must have a PC equipped with an NVIDIA GeForce RTX 3060 (8 or 12 GB VRAM).

In summary, while the potential of LLMs in automated scoring of nonnative Japanese essays has been demonstrated in two studies—BERT-driven AES (Hirao et al. 2020 ) and OCLL-based AES (Okgetheng and Takeuchi, 2024 )—the number of research efforts in this area remains limited.

Another significant challenge in applying LLMs to AES lies in prompt engineering and ensuring its reliability and effectiveness (Brown et al. 2020 ; Rae et al. 2021 ; Zhang et al. 2021 ). Various prompting strategies have been proposed, such as the zero-shot chain of thought (CoT) approach (Kojima et al. 2022 ), which involves manually crafting diverse and effective examples. However, manual efforts can lead to mistakes. To address this, Zhang et al. ( 2021 ) introduced an automatic CoT prompting method called Auto-CoT, which demonstrates matching or superior performance compared to the CoT paradigm. Another prompt framework is trees of thoughts, enabling a model to self-evaluate its progress at intermediate stages of problem-solving through deliberate reasoning (Yao et al. 2023 ).

Beyond linguistic studies, there has been a noticeable increase in the number of foreign workers in Japan and Japanese learners worldwide (Ministry of Health, Labor, and Welfare of Japan, 2022 ; Japan Foundation, 2021 ). However, existing assessment methods, such as the Japanese Language Proficiency Test (JLPT), J-CAT, and TTBJ Footnote 1 , primarily focus on reading, listening, vocabulary, and grammar skills, neglecting the evaluation of writing proficiency. As the number of workers and language learners continues to grow, there is a rising demand for an efficient AES system that can reduce costs and time for raters and be utilized for employment, examinations, and self-study purposes.

This study aims to explore the potential of LLM-based AES by comparing the effectiveness of five models: two LLMs (GPT Footnote 2 and BERT), one Japanese local LLM (OCLL), and two conventional machine learning-based methods (linguistic feature-based scoring tools - Jess and JWriter).

The research questions addressed in this study are as follows:

To what extent do the LLM-driven AES and linguistic feature-based AES, when used as automated tools to support human rating, accurately reflect test takers’ actual performance?

What influence does the prompt have on the accuracy and performance of LLM-based AES methods?

The subsequent sections of the manuscript cover the methodology, including the assessment measures for nonnative Japanese writing proficiency, criteria for prompts, and the dataset. The evaluation section focuses on the analysis of annotations and rating scores generated by LLM-driven and linguistic feature-based AES methods.

Methodology

The dataset utilized in this study was obtained from the International Corpus of Japanese as a Second Language (I-JAS) Footnote 3 . This corpus consisted of 1000 participants who represented 12 different first languages. For the study, the participants were given a story-writing task on a personal computer. They were required to write two stories based on the 4-panel illustrations titled “Picnic” and “The key” (see Appendix A). Background information for the participants was provided by the corpus, including their Japanese language proficiency levels assessed through two online tests: J-CAT and SPOT. These tests evaluated their reading, listening, vocabulary, and grammar abilities. The learners’ proficiency levels were categorized into six levels aligned with the Common European Framework of Reference for Languages (CEFR) and the Reference Framework for Japanese Language Education (RFJLE): A1, A2, B1, B2, C1, and C2. According to Lee et al. ( 2015 ), there is a high level of agreement (r = 0.86) between the J-CAT and SPOT assessments, indicating that the proficiency certifications provided by J-CAT are consistent with those of SPOT. However, it is important to note that the scores of J-CAT and SPOT do not have a one-to-one correspondence. In this study, the J-CAT scores were used as a benchmark to differentiate learners of different proficiency levels. A total of 1400 essays were utilized, representing the beginner (aligned with A1), A2, B1, B2, C1, and C2 levels based on the J-CAT scores. Table 1 provides information about the learners’ proficiency levels and their corresponding J-CAT and SPOT scores.

A dataset comprising a total of 1400 essays from the story writing tasks was collected. Among these, 714 essays were utilized to evaluate the reliability of the LLM-based AES method, while the remaining 686 essays were designated as development data to assess the LLM-based AES’s capability to distinguish participants with varying proficiency levels. The GPT 4 API was used in this study. A detailed explanation of the prompt-assessment criteria is provided in Section Prompt . All essays were sent to the model for measurement and scoring.

Measures of writing proficiency for nonnative Japanese

Japanese exhibits a morphologically agglutinative structure where morphemes are attached to the word stem to convey grammatical functions such as tense, aspect, voice, and honorifics, e.g. (5).

食べ-させ-られ-まし-た-か

tabe-sase-rare-mashi-ta-ka

[eat (stem)-causative-passive voice-honorification-tense. past-question marker]

Japanese employs nine case particles to indicate grammatical functions: the nominative case particle が (ga), the accusative case particle を (o), the genitive case particle の (no), the dative case particle に (ni), the locative/instrumental case particle で (de), the ablative case particle から (kara), the directional case particle へ (e), and the comitative case particle と (to). The agglutinative nature of the language, combined with the case particle system, provides an efficient means of distinguishing between active and passive voice, either through morphemes or case particles, e.g. 食べる taberu “eat concusive . ” (active voice); 食べられる taberareru “eat concusive . ” (passive voice). In the active voice, “パンを食べる” (pan o taberu) translates to “to eat bread”. On the other hand, in the passive voice, it becomes “パンが食べられた” (pan ga taberareta), which means “(the) bread was eaten”. Additionally, it is important to note that different conjugations of the same lemma are considered as one type in order to ensure a comprehensive assessment of the language features. For example, e.g., 食べる taberu “eat concusive . ”; 食べている tabeteiru “eat progress .”; 食べた tabeta “eat past . ” as one type.

To incorporate these features, previous research (Suzuki, 1999 ; Watanabe et al. 1988 ; Ishioka, 2001 ; Ishioka and Kameda, 2006 ; Hirao et al. 2020 ) has identified complexity, fluency, and accuracy as crucial factors for evaluating writing quality. These criteria are assessed through various aspects, including lexical richness (lexical density, diversity, and sophistication), syntactic complexity, and cohesion (Kyle et al. 2021 ; Mizumoto and Eguchi, 2023 ; Ure, 1971 ; Halliday, 1985 ; Barkaoui and Hadidi, 2020 ; Zenker and Kyle, 2021 ; Kim et al. 2018 ; Lu, 2017 ; Ortega, 2015 ). Therefore, this study proposes five scoring categories: lexical richness, syntactic complexity, cohesion, content elaboration, and grammatical accuracy. A total of 16 measures were employed to capture these categories. The calculation process and specific details of these measures can be found in Table 2 .

T-unit, first introduced by Hunt ( 1966 ), is a measure used for evaluating speech and composition. It serves as an indicator of syntactic development and represents the shortest units into which a piece of discourse can be divided without leaving any sentence fragments. In the context of Japanese language assessment, Sakoda and Hosoi ( 2020 ) utilized T-unit as the basic unit to assess the accuracy and complexity of Japanese learners’ speaking and storytelling. The calculation of T-units in Japanese follows the following principles:

A single main clause constitutes 1 T-unit, regardless of the presence or absence of dependent clauses, e.g. (6).

ケンとマリはピクニックに行きました (main clause): 1 T-unit.

If a sentence contains a main clause along with subclauses, each subclause is considered part of the same T-unit, e.g. (7).

天気が良かったので (subclause)、ケンとマリはピクニックに行きました (main clause): 1 T-unit.

In the case of coordinate clauses, where multiple clauses are connected, each coordinated clause is counted separately. Thus, a sentence with coordinate clauses may have 2 T-units or more, e.g. (8).

ケンは地図で場所を探して (coordinate clause)、マリはサンドイッチを作りました (coordinate clause): 2 T-units.

Lexical diversity refers to the range of words used within a text (Engber, 1995 ; Kyle et al. 2021 ) and is considered a useful measure of the breadth of vocabulary in L n production (Jarvis, 2013a , 2013b ).

The type/token ratio (TTR) is widely recognized as a straightforward measure for calculating lexical diversity and has been employed in numerous studies. These studies have demonstrated a strong correlation between TTR and other methods of measuring lexical diversity (e.g., Bentz et al. 2016 ; Čech and Miroslav, 2018 ; Çöltekin and Taraka, 2018 ). TTR is computed by considering both the number of unique words (types) and the total number of words (tokens) in a given text. Given that the length of learners’ writing texts can vary, this study employs the moving average type-token ratio (MATTR) to mitigate the influence of text length. MATTR is calculated using a 50-word moving window. Initially, a TTR is determined for words 1–50 in an essay, followed by words 2–51, 3–52, and so on until the end of the essay is reached (Díez-Ortega and Kyle, 2023 ). The final MATTR scores were obtained by averaging the TTR scores for all 50-word windows. The following formula was employed to derive MATTR:

\({\rm{MATTR}}({\rm{W}})=\frac{{\sum }_{{\rm{i}}=1}^{{\rm{N}}-{\rm{W}}+1}{{\rm{F}}}_{{\rm{i}}}}{{\rm{W}}({\rm{N}}-{\rm{W}}+1)}\)

Here, N refers to the number of tokens in the corpus. W is the randomly selected token size (W < N). \({F}_{i}\) is the number of types in each window. The \({\rm{MATTR}}({\rm{W}})\) is the mean of a series of type-token ratios (TTRs) based on the word form for all windows. It is expected that individuals with higher language proficiency will produce texts with greater lexical diversity, as indicated by higher MATTR scores.

Lexical density was captured by the ratio of the number of lexical words to the total number of words (Lu, 2012 ). Lexical sophistication refers to the utilization of advanced vocabulary, often evaluated through word frequency indices (Crossley et al. 2013 ; Haberman, 2008 ; Kyle and Crossley, 2015 ; Laufer and Nation, 1995 ; Lu, 2012 ; Read, 2000 ). In line of writing, lexical sophistication can be interpreted as vocabulary breadth, which entails the appropriate usage of vocabulary items across various lexicon-grammatical contexts and registers (Garner et al. 2019 ; Kim et al. 2018 ; Kyle et al. 2018 ). In Japanese specifically, words are considered lexically sophisticated if they are not included in the “Japanese Education Vocabulary List Ver 1.0”. Footnote 4 Consequently, lexical sophistication was calculated by determining the number of sophisticated word types relative to the total number of words per essay. Furthermore, it has been suggested that, in Japanese writing, sentences should ideally have a length of no more than 40 to 50 characters, as this promotes readability. Therefore, the median and maximum sentence length can be considered as useful indices for assessment (Ishioka and Kameda, 2006 ).

Syntactic complexity was assessed based on several measures, including the mean length of clauses, verb phrases per T-unit, clauses per T-unit, dependent clauses per T-unit, complex nominals per clause, adverbial clauses per clause, coordinate phrases per clause, and mean dependency distance (MDD). The MDD reflects the distance between the governor and dependent positions in a sentence. A larger dependency distance indicates a higher cognitive load and greater complexity in syntactic processing (Liu, 2008 ; Liu et al. 2017 ). The MDD has been established as an efficient metric for measuring syntactic complexity (Jiang, Quyang, and Liu, 2019 ; Li and Yan, 2021 ). To calculate the MDD, the position numbers of the governor and dependent are subtracted, assuming that words in a sentence are assigned in a linear order, such as W1 … Wi … Wn. In any dependency relationship between words Wa and Wb, Wa is the governor and Wb is the dependent. The MDD of the entire sentence was obtained by taking the absolute value of governor – dependent:

MDD = \(\frac{1}{n}{\sum }_{i=1}^{n}|{\rm{D}}{{\rm{D}}}_{i}|\)

In this formula, \(n\) represents the number of words in the sentence, and \({DD}i\) is the dependency distance of the \({i}^{{th}}\) dependency relationship of a sentence. Building on this, the annotation of sentence ‘Mary-ga-John-ni-keshigomu-o-watashita was [Mary- top -John- dat -eraser- acc -give- past] ’. The sentence’s MDD would be 2. Table 3 provides the CSV file as a prompt for GPT 4.

Cohesion (semantic similarity) and content elaboration aim to capture the ideas presented in test taker’s essays. Cohesion was assessed using three measures: Synonym overlap/paragraph (topic), Synonym overlap/paragraph (keywords), and word2vec cosine similarity. Content elaboration and development were measured as the number of metadiscourse markers (type)/number of words. To capture content closely, this study proposed a novel-distance based representation, by encoding the cosine distance between the essay (by learner) and essay task’s (topic and keyword) i -vectors. The learner’s essay is decoded into a word sequence, and aligned to the essay task’ topic and keyword for log-likelihood measurement. The cosine distance reveals the content elaboration score in the leaners’ essay. The mathematical equation of cosine similarity between target-reference vectors is shown in (11), assuming there are i essays and ( L i , …. L n ) and ( N i , …. N n ) are the vectors representing the learner and task’s topic and keyword respectively. The content elaboration distance between L i and N i was calculated as follows:

\(\cos \left(\theta \right)=\frac{{\rm{L}}\,\cdot\, {\rm{N}}}{\left|{\rm{L}}\right|{\rm{|N|}}}=\frac{\mathop{\sum }\nolimits_{i=1}^{n}{L}_{i}{N}_{i}}{\sqrt{\mathop{\sum }\nolimits_{i=1}^{n}{L}_{i}^{2}}\sqrt{\mathop{\sum }\nolimits_{i=1}^{n}{N}_{i}^{2}}}\)

A high similarity value indicates a low difference between the two recognition outcomes, which in turn suggests a high level of proficiency in content elaboration.

To evaluate the effectiveness of the proposed measures in distinguishing different proficiency levels among nonnative Japanese speakers’ writing, we conducted a multi-faceted Rasch measurement analysis (Linacre, 1994 ). This approach applies measurement models to thoroughly analyze various factors that can influence test outcomes, including test takers’ proficiency, item difficulty, and rater severity, among others. The underlying principles and functionality of multi-faceted Rasch measurement are illustrated in (12).

\(\log \left(\frac{{P}_{{nijk}}}{{P}_{{nij}(k-1)}}\right)={B}_{n}-{D}_{i}-{C}_{j}-{F}_{k}\)

(12) defines the logarithmic transformation of the probability ratio ( P nijk /P nij(k-1) )) as a function of multiple parameters. Here, n represents the test taker, i denotes a writing proficiency measure, j corresponds to the human rater, and k represents the proficiency score. The parameter B n signifies the proficiency level of test taker n (where n ranges from 1 to N). D j represents the difficulty parameter of test item i (where i ranges from 1 to L), while C j represents the severity of rater j (where j ranges from 1 to J). Additionally, F k represents the step difficulty for a test taker to move from score ‘k-1’ to k . P nijk refers to the probability of rater j assigning score k to test taker n for test item i . P nij(k-1) represents the likelihood of test taker n being assigned score ‘k-1’ by rater j for test item i . Each facet within the test is treated as an independent parameter and estimated within the same reference framework. To evaluate the consistency of scores obtained through both human and computer analysis, we utilized the Infit mean-square statistic. This statistic is a chi-square measure divided by the degrees of freedom and is weighted with information. It demonstrates higher sensitivity to unexpected patterns in responses to items near a person’s proficiency level (Linacre, 2002 ). Fit statistics are assessed based on predefined thresholds for acceptable fit. For the Infit MNSQ, which has a mean of 1.00, different thresholds have been suggested. Some propose stricter thresholds ranging from 0.7 to 1.3 (Bond et al. 2021 ), while others suggest more lenient thresholds ranging from 0.5 to 1.5 (Eckes, 2009 ). In this study, we adopted the criterion of 0.70–1.30 for the Infit MNSQ.

Moving forward, we can now proceed to assess the effectiveness of the 16 proposed measures based on five criteria for accurately distinguishing various levels of writing proficiency among non-native Japanese speakers. To conduct this evaluation, we utilized the development dataset from the I-JAS corpus, as described in Section Dataset . Table 4 provides a measurement report that presents the performance details of the 14 metrics under consideration. The measure separation was found to be 4.02, indicating a clear differentiation among the measures. The reliability index for the measure separation was 0.891, suggesting consistency in the measurement. Similarly, the person separation reliability index was 0.802, indicating the accuracy of the assessment in distinguishing between individuals. All 16 measures demonstrated Infit mean squares within a reasonable range, ranging from 0.76 to 1.28. The Synonym overlap/paragraph (topic) measure exhibited a relatively high outfit mean square of 1.46, although the Infit mean square falls within an acceptable range. The standard error for the measures ranged from 0.13 to 0.28, indicating the precision of the estimates.

Table 5 further illustrated the weights assigned to different linguistic measures for score prediction, with higher weights indicating stronger correlations between those measures and higher scores. Specifically, the following measures exhibited higher weights compared to others: moving average type token ratio per essay has a weight of 0.0391. Mean dependency distance had a weight of 0.0388. Mean length of clause, calculated by dividing the number of words by the number of clauses, had a weight of 0.0374. Complex nominals per T-unit, calculated by dividing the number of complex nominals by the number of T-units, had a weight of 0.0379. Coordinate phrases rate, calculated by dividing the number of coordinate phrases by the number of clauses, had a weight of 0.0325. Grammatical error rate, representing the number of errors per essay, had a weight of 0.0322.

Criteria (output indicator)

The criteria used to evaluate the writing ability in this study were based on CEFR, which follows a six-point scale ranging from A1 to C2. To assess the quality of Japanese writing, the scoring criteria from Table 6 were utilized. These criteria were derived from the IELTS writing standards and served as assessment guidelines and prompts for the written output.

A prompt is a question or detailed instruction that is provided to the model to obtain a proper response. After several pilot experiments, we decided to provide the measures (Section Measures of writing proficiency for nonnative Japanese ) as the input prompt and use the criteria (Section Criteria (output indicator) ) as the output indicator. Regarding the prompt language, considering that the LLM was tasked with rating Japanese essays, would prompt in Japanese works better Footnote 5 ? We conducted experiments comparing the performance of GPT-4 using both English and Japanese prompts. Additionally, we utilized the Japanese local model OCLL with Japanese prompts. Multiple trials were conducted using the same sample. Regardless of the prompt language used, we consistently obtained the same grading results with GPT-4, which assigned a grade of B1 to the writing sample. This suggested that GPT-4 is reliable and capable of producing consistent ratings regardless of the prompt language. On the other hand, when we used Japanese prompts with the Japanese local model “OCLL”, we encountered inconsistent grading results. Out of 10 attempts with OCLL, only 6 yielded consistent grading results (B1), while the remaining 4 showed different outcomes, including A1 and B2 grades. These findings indicated that the language of the prompt was not the determining factor for reliable AES. Instead, the size of the training data and the model parameters played crucial roles in achieving consistent and reliable AES results for the language model.

The following is the utilized prompt, which details all measures and requires the LLM to score the essays using holistic and trait scores.

Please evaluate Japanese essays written by Japanese learners and assign a score to each essay on a six-point scale, ranging from A1, A2, B1, B2, C1 to C2. Additionally, please provide trait scores and display the calculation process for each trait score. The scoring should be based on the following criteria:

Moving average type-token ratio.

Number of lexical words (token) divided by the total number of words per essay.

Number of sophisticated word types divided by the total number of words per essay.

Mean length of clause.

Verb phrases per T-unit.

Clauses per T-unit.

Dependent clauses per T-unit.

Complex nominals per clause.

Adverbial clauses per clause.

Coordinate phrases per clause.

Mean dependency distance.

Synonym overlap paragraph (topic and keywords).

Word2vec cosine similarity.

Connectives per essay.

Conjunctions per essay.

Number of metadiscourse markers (types) divided by the total number of words.

Number of errors per essay.

Japanese essay text

出かける前に二人が地図を見ている間に、サンドイッチを入れたバスケットに犬が入ってしまいました。それに気づかずに二人は楽しそうに出かけて行きました。やがて突然犬がバスケットから飛び出し、二人は驚きました。バスケットの中を見ると、食べ物はすべて犬に食べられていて、二人は困ってしまいました。(ID_JJJ01_SW1)

The score of the example above was B1. Figure 3 provides an example of holistic and trait scores provided by GPT-4 (with a prompt indicating all measures) via Bing Footnote 6 .

Example of GPT-4 AES and feedback (with a prompt indicating all measures).

Statistical analysis

The aim of this study is to investigate the potential use of LLM for nonnative Japanese AES. It seeks to compare the scoring outcomes obtained from feature-based AES tools, which rely on conventional machine learning technology (i.e. Jess, JWriter), with those generated by AI-driven AES tools utilizing deep learning technology (BERT, GPT, OCLL). To assess the reliability of a computer-assisted annotation tool, the study initially established human-human agreement as the benchmark measure. Subsequently, the performance of the LLM-based method was evaluated by comparing it to human-human agreement.

To assess annotation agreement, the study employed standard measures such as precision, recall, and F-score (Brants 2000 ; Lu 2010 ), along with the quadratically weighted kappa (QWK) to evaluate the consistency and agreement in the annotation process. Assume A and B represent human annotators. When comparing the annotations of the two annotators, the following results are obtained. The evaluation of precision, recall, and F-score metrics was illustrated in equations (13) to (15).

\({\rm{Recall}}(A,B)=\frac{{\rm{Number}}\,{\rm{of}}\,{\rm{identical}}\,{\rm{nodes}}\,{\rm{in}}\,A\,{\rm{and}}\,B}{{\rm{Number}}\,{\rm{of}}\,{\rm{nodes}}\,{\rm{in}}\,A}\)

\({\rm{Precision}}(A,\,B)=\frac{{\rm{Number}}\,{\rm{of}}\,{\rm{identical}}\,{\rm{nodes}}\,{\rm{in}}\,A\,{\rm{and}}\,B}{{\rm{Number}}\,{\rm{of}}\,{\rm{nodes}}\,{\rm{in}}\,B}\)

The F-score is the harmonic mean of recall and precision:

\({\rm{F}}-{\rm{score}}=\frac{2* ({\rm{Precision}}* {\rm{Recall}})}{{\rm{Precision}}+{\rm{Recall}}}\)

The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either precision or recall are zero.

In accordance with Taghipour and Ng ( 2016 ), the calculation of QWK involves two steps:

Step 1: Construct a weight matrix W as follows:

\({W}_{{ij}}=\frac{{(i-j)}^{2}}{{(N-1)}^{2}}\)

i represents the annotation made by the tool, while j represents the annotation made by a human rater. N denotes the total number of possible annotations. Matrix O is subsequently computed, where O_( i, j ) represents the count of data annotated by the tool ( i ) and the human annotator ( j ). On the other hand, E refers to the expected count matrix, which undergoes normalization to ensure that the sum of elements in E matches the sum of elements in O.

Step 2: With matrices O and E, the QWK is obtained as follows:

K = 1- \(\frac{\sum i,j{W}_{i,j}\,{O}_{i,j}}{\sum i,j{W}_{i,j}\,{E}_{i,j}}\)

The value of the quadratic weighted kappa increases as the level of agreement improves. Further, to assess the accuracy of LLM scoring, the proportional reductive mean square error (PRMSE) was employed. The PRMSE approach takes into account the variability observed in human ratings to estimate the rater error, which is then subtracted from the variance of the human labels. This calculation provides an overall measure of agreement between the automated scores and true scores (Haberman et al. 2015 ; Loukina et al. 2020 ; Taghipour and Ng, 2016 ). The computation of PRMSE involves the following steps:

Step 1: Calculate the mean squared errors (MSEs) for the scoring outcomes of the computer-assisted tool (MSE tool) and the human scoring outcomes (MSE human).

Step 2: Determine the PRMSE by comparing the MSE of the computer-assisted tool (MSE tool) with the MSE from human raters (MSE human), using the following formula:

\({\rm{PRMSE}}=1-\frac{({\rm{MSE}}\,{\rm{tool}})\,}{({\rm{MSE}}\,{\rm{human}})\,}=1-\,\frac{{\sum }_{i}^{n}=1{({{\rm{y}}}_{i}-{\hat{{\rm{y}}}}_{{\rm{i}}})}^{2}}{{\sum }_{i}^{n}=1{({{\rm{y}}}_{i}-\hat{{\rm{y}}})}^{2}}\)

In the numerator, ŷi represents the scoring outcome predicted by a specific LLM-driven AES system for a given sample. The term y i − ŷ i represents the difference between this predicted outcome and the mean value of all LLM-driven AES systems’ scoring outcomes. It quantifies the deviation of the specific LLM-driven AES system’s prediction from the average prediction of all LLM-driven AES systems. In the denominator, y i − ŷ represents the difference between the scoring outcome provided by a specific human rater for a given sample and the mean value of all human raters’ scoring outcomes. It measures the discrepancy between the specific human rater’s score and the average score given by all human raters. The PRMSE is then calculated by subtracting the ratio of the MSE tool to the MSE human from 1. PRMSE falls within the range of 0 to 1, with larger values indicating reduced errors in LLM’s scoring compared to those of human raters. In other words, a higher PRMSE implies that LLM’s scoring demonstrates greater accuracy in predicting the true scores (Loukina et al. 2020 ). The interpretation of kappa values, ranging from 0 to 1, is based on the work of Landis and Koch ( 1977 ). Specifically, the following categories are assigned to different ranges of kappa values: −1 indicates complete inconsistency, 0 indicates random agreement, 0.0 ~ 0.20 indicates extremely low level of agreement (slight), 0.21 ~ 0.40 indicates moderate level of agreement (fair), 0.41 ~ 0.60 indicates medium level of agreement (moderate), 0.61 ~ 0.80 indicates high level of agreement (substantial), 0.81 ~ 1 indicates almost perfect level of agreement. All statistical analyses were executed using Python script.

Results and discussion

Annotation reliability of the llm.

This section focuses on assessing the reliability of the LLM’s annotation and scoring capabilities. To evaluate the reliability, several tests were conducted simultaneously, aiming to achieve the following objectives:

Assess the LLM’s ability to differentiate between test takers with varying levels of oral proficiency.

Determine the level of agreement between the annotations and scoring performed by the LLM and those done by human raters.

The evaluation of the results encompassed several metrics, including: precision, recall, F-Score, quadratically-weighted kappa, proportional reduction of mean squared error, Pearson correlation, and multi-faceted Rasch measurement.

Inter-annotator agreement (human–human annotator agreement)

We started with an agreement test of the two human annotators. Two trained annotators were recruited to determine the writing task data measures. A total of 714 scripts, as the test data, was utilized. Each analysis lasted 300–360 min. Inter-annotator agreement was evaluated using the standard measures of precision, recall, and F-score and QWK. Table 7 presents the inter-annotator agreement for the various indicators. As shown, the inter-annotator agreement was fairly high, with F-scores ranging from 1.0 for sentence and word number to 0.666 for grammatical errors.

The findings from the QWK analysis provided further confirmation of the inter-annotator agreement. The QWK values covered a range from 0.950 ( p = 0.000) for sentence and word number to 0.695 for synonym overlap number (keyword) and grammatical errors ( p = 0.001).

Agreement of annotation outcomes between human and LLM

To evaluate the consistency between human annotators and LLM annotators (BERT, GPT, OCLL) across the indices, the same test was conducted. The results of the inter-annotator agreement (F-score) between LLM and human annotation are provided in Appendix B-D. The F-scores ranged from 0.706 for Grammatical error # for OCLL-human to a perfect 1.000 for GPT-human, for sentences, clauses, T-units, and words. These findings were further supported by the QWK analysis, which showed agreement levels ranging from 0.807 ( p = 0.001) for metadiscourse markers for OCLL-human to 0.962 for words ( p = 0.000) for GPT-human. The findings demonstrated that the LLM annotation achieved a significant level of accuracy in identifying measurement units and counts.

Reliability of LLM-driven AES’s scoring and discriminating proficiency levels

This section examines the reliability of the LLM-driven AES scoring through a comparison of the scoring outcomes produced by human raters and the LLM ( Reliability of LLM-driven AES scoring ). It also assesses the effectiveness of the LLM-based AES system in differentiating participants with varying proficiency levels ( Reliability of LLM-driven AES discriminating proficiency levels ).

Reliability of LLM-driven AES scoring

Table 8 summarizes the QWK coefficient analysis between the scores computed by the human raters and the GPT-4 for the individual essays from I-JAS Footnote 7 . As shown, the QWK of all measures ranged from k = 0.819 for lexical density (number of lexical words (tokens)/number of words per essay) to k = 0.644 for word2vec cosine similarity. Table 9 further presents the Pearson correlations between the 16 writing proficiency measures scored by human raters and GPT 4 for the individual essays. The correlations ranged from 0.672 for syntactic complexity to 0.734 for grammatical accuracy. The correlations between the writing proficiency scores assigned by human raters and the BERT-based AES system were found to range from 0.661 for syntactic complexity to 0.713 for grammatical accuracy. The correlations between the writing proficiency scores given by human raters and the OCLL-based AES system ranged from 0.654 for cohesion to 0.721 for grammatical accuracy. These findings indicated an alignment between the assessments made by human raters and both the BERT-based and OCLL-based AES systems in terms of various aspects of writing proficiency.

Reliability of LLM-driven AES discriminating proficiency levels

After validating the reliability of the LLM’s annotation and scoring, the subsequent objective was to evaluate its ability to distinguish between various proficiency levels. For this analysis, a dataset of 686 individual essays was utilized. Table 10 presents a sample of the results, summarizing the means, standard deviations, and the outcomes of the one-way ANOVAs based on the measures assessed by the GPT-4 model. A post hoc multiple comparison test, specifically the Bonferroni test, was conducted to identify any potential differences between pairs of levels.

As the results reveal, seven measures presented linear upward or downward progress across the three proficiency levels. These were marked in bold in Table 10 and comprise one measure of lexical richness, i.e. MATTR (lexical diversity); four measures of syntactic complexity, i.e. MDD (mean dependency distance), MLC (mean length of clause), CNT (complex nominals per T-unit), CPC (coordinate phrases rate); one cohesion measure, i.e. word2vec cosine similarity and GER (grammatical error rate). Regarding the ability of the sixteen measures to distinguish adjacent proficiency levels, the Bonferroni tests indicated that statistically significant differences exist between the primary level and the intermediate level for MLC and GER. One measure of lexical richness, namely LD, along with three measures of syntactic complexity (VPT, CT, DCT, ACC), two measures of cohesion (SOPT, SOPK), and one measure of content elaboration (IMM), exhibited statistically significant differences between proficiency levels. However, these differences did not demonstrate a linear progression between adjacent proficiency levels. No significant difference was observed in lexical sophistication between proficiency levels.

To summarize, our study aimed to evaluate the reliability and differentiation capabilities of the LLM-driven AES method. For the first objective, we assessed the LLM’s ability to differentiate between test takers with varying levels of oral proficiency using precision, recall, F-Score, and quadratically-weighted kappa. Regarding the second objective, we compared the scoring outcomes generated by human raters and the LLM to determine the level of agreement. We employed quadratically-weighted kappa and Pearson correlations to compare the 16 writing proficiency measures for the individual essays. The results confirmed the feasibility of using the LLM for annotation and scoring in AES for nonnative Japanese. As a result, Research Question 1 has been addressed.

Comparison of BERT-, GPT-, OCLL-based AES, and linguistic-feature-based computation methods

This section aims to compare the effectiveness of five AES methods for nonnative Japanese writing, i.e. LLM-driven approaches utilizing BERT, GPT, and OCLL, linguistic feature-based approaches using Jess and JWriter. The comparison was conducted by comparing the ratings obtained from each approach with human ratings. All ratings were derived from the dataset introduced in Dataset . To facilitate the comparison, the agreement between the automated methods and human ratings was assessed using QWK and PRMSE. The performance of each approach was summarized in Table 11 .

The QWK coefficient values indicate that LLMs (GPT, BERT, OCLL) and human rating outcomes demonstrated higher agreement compared to feature-based AES methods (Jess and JWriter) in assessing writing proficiency criteria, including lexical richness, syntactic complexity, content, and grammatical accuracy. Among the LLMs, the GPT-4 driven AES and human rating outcomes showed the highest agreement in all criteria, except for syntactic complexity. The PRMSE values suggest that the GPT-based method outperformed linguistic feature-based methods and other LLM-based approaches. Moreover, an interesting finding emerged during the study: the agreement coefficient between GPT-4 and human scoring was even higher than the agreement between different human raters themselves. This discovery highlights the advantage of GPT-based AES over human rating. Ratings involve a series of processes, including reading the learners’ writing, evaluating the content and language, and assigning scores. Within this chain of processes, various biases can be introduced, stemming from factors such as rater biases, test design, and rating scales. These biases can impact the consistency and objectivity of human ratings. GPT-based AES may benefit from its ability to apply consistent and objective evaluation criteria. By prompting the GPT model with detailed writing scoring rubrics and linguistic features, potential biases in human ratings can be mitigated. The model follows a predefined set of guidelines and does not possess the same subjective biases that human raters may exhibit. This standardization in the evaluation process contributes to the higher agreement observed between GPT-4 and human scoring. Section Prompt strategy of the study delves further into the role of prompts in the application of LLMs to AES. It explores how the choice and implementation of prompts can impact the performance and reliability of LLM-based AES methods. Furthermore, it is important to acknowledge the strengths of the local model, i.e. the Japanese local model OCLL, which excels in processing certain idiomatic expressions. Nevertheless, our analysis indicated that GPT-4 surpasses local models in AES. This superior performance can be attributed to the larger parameter size of GPT-4, estimated to be between 500 billion and 1 trillion, which exceeds the sizes of both BERT and the local model OCLL.

Prompt strategy

In the context of prompt strategy, Mizumoto and Eguchi ( 2023 ) conducted a study where they applied the GPT-3 model to automatically score English essays in the TOEFL test. They found that the accuracy of the GPT model alone was moderate to fair. However, when they incorporated linguistic measures such as cohesion, syntactic complexity, and lexical features alongside the GPT model, the accuracy significantly improved. This highlights the importance of prompt engineering and providing the model with specific instructions to enhance its performance. In this study, a similar approach was taken to optimize the performance of LLMs. GPT-4, which outperformed BERT and OCLL, was selected as the candidate model. Model 1 was used as the baseline, representing GPT-4 without any additional prompting. Model 2, on the other hand, involved GPT-4 prompted with 16 measures that included scoring criteria, efficient linguistic features for writing assessment, and detailed measurement units and calculation formulas. The remaining models (Models 3 to 18) utilized GPT-4 prompted with individual measures. The performance of these 18 different models was assessed using the output indicators described in Section Criteria (output indicator) . By comparing the performances of these models, the study aimed to understand the impact of prompt engineering on the accuracy and effectiveness of GPT-4 in AES tasks.

Based on the PRMSE scores presented in Fig. 4 , it was observed that Model 1, representing GPT-4 without any additional prompting, achieved a fair level of performance. However, Model 2, which utilized GPT-4 prompted with all measures, outperformed all other models in terms of PRMSE score, achieving a score of 0.681. These results indicate that the inclusion of specific measures and prompts significantly enhanced the performance of GPT-4 in AES. Among the measures, syntactic complexity was found to play a particularly significant role in improving the accuracy of GPT-4 in assessing writing quality. Following that, lexical diversity emerged as another important factor contributing to the model’s effectiveness. The study suggests that a well-prompted GPT-4 can serve as a valuable tool to support human assessors in evaluating writing quality. By utilizing GPT-4 as an automated scoring tool, the evaluation biases associated with human raters can be minimized. This has the potential to empower teachers by allowing them to focus on designing writing tasks and guiding writing strategies, while leveraging the capabilities of GPT-4 for efficient and reliable scoring.

PRMSE scores of the 18 AES models.

This study aimed to investigate two main research questions: the feasibility of utilizing LLMs for AES and the impact of prompt engineering on the application of LLMs in AES.

To address the first objective, the study compared the effectiveness of five different models: GPT, BERT, the Japanese local LLM (OCLL), and two conventional machine learning-based AES tools (Jess and JWriter). The PRMSE values indicated that the GPT-4-based method outperformed other LLMs (BERT, OCLL) and linguistic feature-based computational methods (Jess and JWriter) across various writing proficiency criteria. Furthermore, the agreement coefficient between GPT-4 and human scoring surpassed the agreement among human raters themselves, highlighting the potential of using the GPT-4 tool to enhance AES by reducing biases and subjectivity, saving time, labor, and cost, and providing valuable feedback for self-study. Regarding the second goal, the role of prompt design was investigated by comparing 18 models, including a baseline model, a model prompted with all measures, and 16 models prompted with one measure at a time. GPT-4, which outperformed BERT and OCLL, was selected as the candidate model. The PRMSE scores of the models showed that GPT-4 prompted with all measures achieved the best performance, surpassing the baseline and other models.

In conclusion, this study has demonstrated the potential of LLMs in supporting human rating in assessments. By incorporating automation, we can save time and resources while reducing biases and subjectivity inherent in human rating processes. Automated language assessments offer the advantage of accessibility, providing equal opportunities and economic feasibility for individuals who lack access to traditional assessment centers or necessary resources. LLM-based language assessments provide valuable feedback and support to learners, aiding in the enhancement of their language proficiency and the achievement of their goals. This personalized feedback can cater to individual learner needs, facilitating a more tailored and effective language-learning experience.

There are three important areas that merit further exploration. First, prompt engineering requires attention to ensure optimal performance of LLM-based AES across different language types. This study revealed that GPT-4, when prompted with all measures, outperformed models prompted with fewer measures. Therefore, investigating and refining prompt strategies can enhance the effectiveness of LLMs in automated language assessments. Second, it is crucial to explore the application of LLMs in second-language assessment and learning for oral proficiency, as well as their potential in under-resourced languages. Recent advancements in self-supervised machine learning techniques have significantly improved automatic speech recognition (ASR) systems, opening up new possibilities for creating reliable ASR systems, particularly for under-resourced languages with limited data. However, challenges persist in the field of ASR. First, ASR assumes correct word pronunciation for automatic pronunciation evaluation, which proves challenging for learners in the early stages of language acquisition due to diverse accents influenced by their native languages. Accurately segmenting short words becomes problematic in such cases. Second, developing precise audio-text transcriptions for languages with non-native accented speech poses a formidable task. Last, assessing oral proficiency levels involves capturing various linguistic features, including fluency, pronunciation, accuracy, and complexity, which are not easily captured by current NLP technology.

Data availability

The dataset utilized was obtained from the International Corpus of Japanese as a Second Language (I-JAS). The data URLs: [ https://www2.ninjal.ac.jp/jll/lsaj/ihome2.html ].

J-CAT and TTBJ are two computerized adaptive tests used to assess Japanese language proficiency.

SPOT is a specific component of the TTBJ test.

J-CAT: https://www.j-cat2.org/html/ja/pages/interpret.html

SPOT: https://ttbj.cegloc.tsukuba.ac.jp/p1.html#SPOT .

The study utilized a prompt-based GPT-4 model, developed by OpenAI, which has an impressive architecture with 1.8 trillion parameters across 120 layers. GPT-4 was trained on a vast dataset of 13 trillion tokens, using two stages: initial training on internet text datasets to predict the next token, and subsequent fine-tuning through reinforcement learning from human feedback.

https://www2.ninjal.ac.jp/jll/lsaj/ihome2-en.html .

http://jhlee.sakura.ne.jp/JEV/ by Japanese Learning Dictionary Support Group 2015.

We express our sincere gratitude to the reviewer for bringing this matter to our attention.

On February 7, 2023, Microsoft began rolling out a major overhaul to Bing that included a new chatbot feature based on OpenAI’s GPT-4 (Bing.com).

Appendix E-F present the analysis results of the QWK coefficient between the scores computed by the human raters and the BERT, OCLL models.

Attali Y, Burstein J (2006) Automated essay scoring with e-rater® V.2. J. Technol., Learn. Assess., 4

Barkaoui K, Hadidi A (2020) Assessing Change in English Second Language Writing Performance (1st ed.). Routledge, New York. https://doi.org/10.4324/9781003092346

Bentz C, Tatyana R, Koplenig A, Tanja S (2016) A comparison between morphological complexity. measures: Typological data vs. language corpora. In Proceedings of the workshop on computational linguistics for linguistic complexity (CL4LC), 142–153. Osaka, Japan: The COLING 2016 Organizing Committee

Bond TG, Yan Z, Heene M (2021) Applying the Rasch model: Fundamental measurement in the human sciences (4th ed). Routledge

Brants T (2000) Inter-annotator agreement for a German newspaper corpus. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece, 31 May-2 June, European Language Resources Association

Brown TB, Mann B, Ryder N, et al. (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems, Online, 6–12 December, Curran Associates, Inc., Red Hook, NY

Burstein J (2003) The E-rater scoring engine: Automated essay scoring with natural language processing. In Shermis MD and Burstein JC (ed) Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Čech R, Miroslav K (2018) Morphological richness of text. In Masako F, Václav C (ed) Taming the corpus: From inflection and lexis to interpretation, 63–77. Cham, Switzerland: Springer Nature

Çöltekin Ç, Taraka, R (2018) Exploiting Universal Dependencies treebanks for measuring morphosyntactic complexity. In Aleksandrs B, Christian B (ed), Proceedings of first workshop on measuring language complexity, 1–7. Torun, Poland

Crossley SA, Cobb T, McNamara DS (2013) Comparing count-based and band-based indices of word frequency: Implications for active vocabulary research and pedagogical applications. System 41:965–981. https://doi.org/10.1016/j.system.2013.08.002

Article Google Scholar

Crossley SA, McNamara DS (2016) Say more and be more coherent: How text elaboration and cohesion can increase writing quality. J. Writ. Res. 7:351–370

CyberAgent Inc (2023) Open-Calm series of Japanese language models. Retrieved from: https://www.cyberagent.co.jp/news/detail/id=28817

Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota, 2–7 June, pp. 4171–4186. Association for Computational Linguistics

Diez-Ortega M, Kyle K (2023) Measuring the development of lexical richness of L2 Spanish: a longitudinal learner corpus study. Studies in Second Language Acquisition 1-31

Eckes T (2009) On common ground? How raters perceive scoring criteria in oral proficiency testing. In Brown A, Hill K (ed) Language testing and evaluation 13: Tasks and criteria in performance assessment (pp. 43–73). Peter Lang Publishing

Elliot S (2003) IntelliMetric: from here to validity. In: Shermis MD, Burstein JC (ed) Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Google Scholar

Engber CA (1995) The relationship of lexical proficiency to the quality of ESL compositions. J. Second Lang. Writ. 4:139–155

Garner J, Crossley SA, Kyle K (2019) N-gram measures and L2 writing proficiency. System 80:176–187. https://doi.org/10.1016/j.system.2018.12.001

Haberman SJ (2008) When can subscores have value? J. Educat. Behav. Stat., 33:204–229

Haberman SJ, Yao L, Sinharay S (2015) Prediction of true test scores from observed item scores and ancillary data. Brit. J. Math. Stat. Psychol. 68:363–385

Halliday MAK (1985) Spoken and Written Language. Deakin University Press, Melbourne, Australia

Hirao R, Arai M, Shimanaka H et al. (2020) Automated essay scoring system for nonnative Japanese learners. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pp. 1250–1257. European Language Resources Association

Hunt KW (1966) Recent Measures in Syntactic Development. Elementary English, 43(7), 732–739. http://www.jstor.org/stable/41386067

Ishioka T (2001) About e-rater, a computer-based automatic scoring system for essays [Konpyūta ni yoru essei no jidō saiten shisutemu e − rater ni tsuite]. University Entrance Examination. Forum [Daigaku nyūshi fōramu] 24:71–76

Hochreiter S, Schmidhuber J (1997) Long short- term memory. Neural Comput. 9(8):1735–1780

Article CAS PubMed Google Scholar

Ishioka T, Kameda M (2006) Automated Japanese essay scoring system based on articles written by experts. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17–18 July 2006, pp. 233-240. Association for Computational Linguistics, USA

Japan Foundation (2021) Retrieved from: https://www.jpf.gp.jp/j/project/japanese/survey/result/dl/survey2021/all.pdf

Jarvis S (2013a) Defining and measuring lexical diversity. In Jarvis S, Daller M (ed) Vocabulary knowledge: Human ratings and automated measures (Vol. 47, pp. 13–44). John Benjamins. https://doi.org/10.1075/sibil.47.03ch1

Jarvis S (2013b) Capturing the diversity in lexical diversity. Lang. Learn. 63:87–106. https://doi.org/10.1111/j.1467-9922.2012.00739.x

Jiang J, Quyang J, Liu H (2019) Interlanguage: A perspective of quantitative linguistic typology. Lang. Sci. 74:85–97

Kim M, Crossley SA, Kyle K (2018) Lexical sophistication as a multidimensional phenomenon: Relations to second language lexical proficiency, development, and writing quality. Mod. Lang. J. 102(1):120–141. https://doi.org/10.1111/modl.12447

Kojima T, Gu S, Reid M et al. (2022) Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, New Orleans, LA, 29 November-1 December, Curran Associates, Inc., Red Hook, NY

Kyle K, Crossley SA (2015) Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Q 49:757–786

Kyle K, Crossley SA, Berger CM (2018) The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behav. Res. Methods 50:1030–1046. https://doi.org/10.3758/s13428-017-0924-4

Article PubMed Google Scholar

Kyle K, Crossley SA, Jarvis S (2021) Assessing the validity of lexical diversity using direct judgements. Lang. Assess. Q. 18:154–170. https://doi.org/10.1080/15434303.2020.1844205

Landauer TK, Laham D, Foltz PW (2003) Automated essay scoring and annotation of essays with the Intelligent Essay Assessor. In Shermis MD, Burstein JC (ed), Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 159–174

Laufer B, Nation P (1995) Vocabulary size and use: Lexical richness in L2 written production. Appl. Linguist. 16:307–322. https://doi.org/10.1093/applin/16.3.307

Lee J, Hasebe Y (2017) jWriter Learner Text Evaluator, URL: https://jreadability.net/jwriter/

Lee J, Kobayashi N, Sakai T, Sakota K (2015) A Comparison of SPOT and J-CAT Based on Test Analysis [Tesuto bunseki ni motozuku ‘SPOT’ to ‘J-CAT’ no hikaku]. Research on the Acquisition of Second Language Japanese [Dainigengo to shite no nihongo no shūtoku kenkyū] (18) 53–69

Li W, Yan J (2021) Probability distribution of dependency distance based on a Treebank of. Japanese EFL Learners’ Interlanguage. J. Quant. Linguist. 28(2):172–186. https://doi.org/10.1080/09296174.2020.1754611

Article MathSciNet Google Scholar

Linacre JM (2002) Optimizing rating scale category effectiveness. J. Appl. Meas. 3(1):85–106

PubMed Google Scholar

Linacre JM (1994) Constructing measurement with a Many-Facet Rasch Model. In Wilson M (ed) Objective measurement: Theory into practice, Volume 2 (pp. 129–144). Norwood, NJ: Ablex

Liu H (2008) Dependency distance as a metric of language comprehension difficulty. J. Cognitive Sci. 9:159–191

Liu H, Xu C, Liang J (2017) Dependency distance: A new perspective on syntactic patterns in natural languages. Phys. Life Rev. 21. https://doi.org/10.1016/j.plrev.2017.03.002

Loukina A, Madnani N, Cahill A, et al. (2020) Using PRMSE to evaluate automated scoring systems in the presence of label noise. Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, WA, USA → Online, 10 July, pp. 18–29. Association for Computational Linguistics

Lu X (2010) Automatic analysis of syntactic complexity in second language writing. Int. J. Corpus Linguist. 15:474–496

Lu X (2012) The relationship of lexical richness to the quality of ESL learners’ oral narratives. Mod. Lang. J. 96:190–208

Lu X (2017) Automated measurement of syntactic complexity in corpus-based L2 writing research and implications for writing assessment. Lang. Test. 34:493–511

Lu X, Hu R (2022) Sense-aware lexical sophistication indices and their relationship to second language writing quality. Behav. Res. Method. 54:1444–1460. https://doi.org/10.3758/s13428-021-01675-6

Ministry of Health, Labor, and Welfare of Japan (2022) Retrieved from: https://www.mhlw.go.jp/stf/newpage_30367.html

Mizumoto A, Eguchi M (2023) Exploring the potential of using an AI language model for automated essay scoring. Res. Methods Appl. Linguist. 3:100050

Okgetheng B, Takeuchi K (2024) Estimating Japanese Essay Grading Scores with Large Language Models. Proceedings of 30th Annual Conference of the Language Processing Society in Japan, March 2024

Ortega L (2015) Second language learning explained? SLA across 10 contemporary theories. In VanPatten B, Williams J (ed) Theories in Second Language Acquisition: An Introduction

Rae JW, Borgeaud S, Cai T, et al. (2021) Scaling Language Models: Methods, Analysis & Insights from Training Gopher. ArXiv, abs/2112.11446

Read J (2000) Assessing vocabulary. Cambridge University Press. https://doi.org/10.1017/CBO9780511732942

Rudner LM, Liang T (2002) Automated Essay Scoring Using Bayes’ Theorem. J. Technol., Learning and Assessment, 1 (2)

Sakoda K, Hosoi Y (2020) Accuracy and complexity of Japanese Language usage by SLA learners in different learning environments based on the analysis of I-JAS, a learners’ corpus of Japanese as L2. Math. Linguist. 32(7):403–418. https://doi.org/10.24701/mathling.32.7_403

Suzuki N (1999) Summary of survey results regarding comprehensive essay questions. Final report of “Joint Research on Comprehensive Examinations for the Aim of Evaluating Applicability to Each Specialized Field of Universities” for 1996-2000 [shōronbun sōgō mondai ni kansuru chōsa kekka no gaiyō. Heisei 8 - Heisei 12-nendo daigaku no kaku senmon bun’ya e no tekisei no hyōka o mokuteki to suru sōgō shiken no arikata ni kansuru kyōdō kenkyū’ saishū hōkoku-sho]. University Entrance Examination Section Center Research and Development Department [Daigaku nyūshi sentā kenkyū kaihatsubu], 21–32

Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, 1–5 November, pp. 1882–1891. Association for Computational Linguistics

Takeuchi K, Ohno M, Motojin K, Taguchi M, Inada Y, Iizuka M, Abo T, Ueda H (2021) Development of essay scoring methods based on reference texts with construction of research-available Japanese essay data. In IPSJ J 62(9):1586–1604

Ure J (1971) Lexical density: A computational technique and some findings. In Coultard M (ed) Talking about Text. English Language Research, University of Birmingham, Birmingham, England

Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Long Beach, CA, 4–7 December, pp. 5998–6008, Curran Associates, Inc., Red Hook, NY

Watanabe H, Taira Y, Inoue Y (1988) Analysis of essay evaluation data [Shōronbun hyōka dēta no kaiseki]. Bulletin of the Faculty of Education, University of Tokyo [Tōkyōdaigaku kyōiku gakubu kiyō], Vol. 28, 143–164

Yao S, Yu D, Zhao J, et al. (2023) Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36

Zenker F, Kyle K (2021) Investigating minimum text lengths for lexical diversity indices. Assess. Writ. 47:100505. https://doi.org/10.1016/j.asw.2020.100505

Zhang Y, Warstadt A, Li X, et al. (2021) When do you need billions of words of pretraining data? Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, pp. 1112-1125. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.90

Download references

This research was funded by National Foundation of Social Sciences (22BYY186) to Wenchao Li.

Author information

Authors and affiliations.

Department of Japanese Studies, Zhejiang University, Hangzhou, China

Department of Linguistics and Applied Linguistics, Zhejiang University, Hangzhou, China

You can also search for this author in PubMed Google Scholar

Contributions

Wenchao Li is in charge of conceptualization, validation, formal analysis, investigation, data curation, visualization and writing the draft. Haitao Liu is in charge of supervision.

Corresponding author

Correspondence to Wenchao Li .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethical approval

Ethical approval was not required as the study did not involve human participants.

Informed consent

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental material file #1, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Li, W., Liu, H. Applying large language models for automated essay scoring for non-native Japanese. Humanit Soc Sci Commun 11 , 723 (2024). https://doi.org/10.1057/s41599-024-03209-9

Download citation

Received : 02 February 2024

Accepted : 16 May 2024

Published : 03 June 2024

DOI : https://doi.org/10.1057/s41599-024-03209-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

Explore articles by subject
Guide to authors
Editorial policies

IMAGES

How to "Holt Online Essay Scoring"
Automated Essay Scoring Explained
ASAP Benchmark (Automated Essay Scoring)
Essay Scoring Rubric
Criteria For Marking Essay
PPT

VIDEO

How to write a high-scoring KCSE essay
The best essay writer I Online essay writer
ENGLISH ESSAY & PRECIS SESSION
Automated Essay Scoring
Essay On Iran With Easy Language In English
On Writing a Paragraph

COMMENTS

Free Online Paper and Essay Checker
PaperRater's online essay checker is built for easy access and straightforward use. Get quick results and reports to turn in assignments and essays on time. 2. Advanced Checks. Experience in-depth analysis and detect even the most subtle errors with PaperRater's comprehensive essay checker and grader. 3.
Free Paper Grader: Improve Your Writing With Essay Rater
02. Get results. Online essay revision free is done automatically in the background. After evaluation, results and grades will appear on the screen. 03. Make corrections. Evaluate your mistakes, correct them, and improve your writing skills! Feel free to edit your essay right in the input window.
Online Essay Grading App that Scores Essays and Papers
Scoring with rubrics. Rubrics establish a guide for evaluating the quality of student work. Whether scoring an essay or research paper, a live performance or art project, or other student-constructed responses, rubrics clearly delineate the various components of the assignment to be graded and the degree of success achieved within each of those areas.
Free Paper Grader
Most high school or college-level essays, research papers, term papers, and similar documents are eligible for Kibin's free grading service. Your paper should: have between 225 and 3000 words. include a single essay/piece of writing. have a single author (you!)
Free Essay and Paper Checker
Scribbr is committed to protecting academic integrity. Our plagiarism checker, AI Detector, Citation Generator, proofreading services, paraphrasing tool, grammar checker, summarizer, and free Knowledge Base content are designed to help students produce quality academic papers. We make every effort to prevent our software from being used for ...
Grammar Check
The Virtual Writing Tutor can provide automated essay evaluation with a score and formative feedback on a variety of writing assignments. Try the Virtual Writing tutor's essay checker scoring system. This website is an opinion essay checker, film-analysis essay checker, argument essay checker, cover letter checker, IELTS essay checker, and self-scoring pen pal exchange system.
AI Grader
Our AI grader matches human scores 82% of the time*AI Scores are 100% consistent**. Standard AI Advanced AI. Deviation from real grade (10 point scale) Real grade. Graph: A dataset of essays were graded by professional graders on a range of 1-10 and cross-referenced against the detailed criteria within the rubric to determine their real scores.
Automated Essay Scoring
Essay scoring: **Automated Essay Scoring** is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.
About the e-rater Scoring Engine
The e-rater automated scoring engine uses AI technology and Natural Language Processing (NLP) to evaluate the writing proficiency of student essays by providing automatic scoring and feedback. The engine provides descriptive feedback on the writer's grammar, mechanics, word use and complexity, style, organization and more.
SmartMarq: Essay marking with rubrics and AI
SmartMarq will streamline your essay marking process. SmartMarq makes it easy to implement large-scale, professional essay scoring. Reduce timelines for marking. Increase convenience by managing fully online. Implement business rules to ensure quality. Once raters are done, run the results through our AI to train a custom machine learning model ...
Free IELTS Essay Checker
To practice writing under timed conditions: The IELTS writing checker can simulate the conditions of the actual exam by giving you a prompt and a time limit to complete your essay. You should complete section 1 in 20 minutes, and write at least 150 words. Section 2 should take 40 minutes, with at least 250 words.
Scribbr
Help you achieve your academic goals. Whether we're proofreading and editing, checking for plagiarism or AI content, generating citations, or writing useful Knowledge Base articles, our aim is to support students on their journey to become better academic writers. We believe that every student should have the right tools for academic success.
The e-rater Scoring Engine
ETS is a global leader in educational assessment, measurement and learning science. Our AI technology, such as the e-rater ® scoring engine, informs decisions and creates opportunities for learners around the world. The e-rater engine automatically: assess and nurtures key writing skills. scores essays and provides feedback on writing using a ...
Check your IELTS essay online. Correction and Evaluation Service
Check your IELTS essay in a couple of minutes. Get a clear and detailed report about your IELTS Essay you need to discover your weak areas and fix them. writing9. Pricing; ... Our service is designed to provide a scoring estimation based on IELTS standards, but the final score you get on the actual IELTS exam might be slightly different due to ...
What is Automated Essay Scoring, Marking, Grading?
Nathan Thompson, PhDApril 25, 2023. Automated essay scoring (AES) is an important application of machine learning and artificial intelligence to the field of psychometrics and assessment. In fact, it's been around far longer than "machine learning" and "artificial intelligence" have been buzzwords in the general public!
Project Score
This course features Score, the free OER Project essay-scoring service. When combined with writing prompts and scaffolded pre- and post-writing activities, this powerful tool provides consistent formative feedback and loads of data to inform instruction and discussion. Prompts: A variety of writing prompts address enduring historical questions ...
An automated essay scoring systems: a systematic literature review
Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. . PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade ...
Automated essay scoring
Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting.It is a form of educational assessment and an application of natural language processing.Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades, for example, the numbers 1 to 6.
Free Online GRE AWA Essay Grader
Grade my GRE Essay. There is no software to download. You can use our free online GRE essay immediately. All you need to do is: Type or paste your GRE essay in the box below. [Wait for the text box to load. If it's taking too long, refresh the page.] Click on the 'Check' button. Your essay grade along with the breakup across 3 dimensions ...
10 Online Test-Grading Jobs (Plus Definition and Benefits)
10 online test-grading jobs Here's a list of companies that offer remote test-grading jobs for you to consider: 1. Pearson Company size: More than 10,000 employees Description: Founded in 1844, Pearson is a publishing and learning company. It offers its customers a place to purchase and rent textbooks while also proving a learning platform for users to explore.
Automatic essay scoring using NLP
Essay evaluation is considered as a systematic way to give rating to the essays written. Automatic essay scoring is a process of grading essays without human intervention. The computer systems are trained using technical, artificial intelligence architectures where natural language processing comes into picture.
Explainable Automated Essay Scoring: Deep Learning Really Has
Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of ...
Online Scoring
The Online Network for Evaluation (ONE) is ETS's proprietary software application that enables Raters to score test-taker responses and/or other submitted work via secure internet access. Raters typically score from their homes or offices, in accordance with ETS guidelines. Online scoring sessions may occasionally be conducted at regional sites.
Applying large language models for automated essay scoring for non
Recent advancements in artificial intelligence (AI) have led to an increased use of large language models (LLMs) for language assessment tasks such as automated essay scoring (AES), automated ...