ByDidi SukyadiEnglish Education DepartmentIndonesia University of Education
Is not excessively expensiveStays within appropriate time constraintsIs relatively easy to administerHas a scoring/evaluation procedure that is specific and time efficientitems can be replicated in terms of resources needed e.g. time, materials, peoplecan be administeredcan be gradedresults can be interpreted
A reliable test is consistent and dependable.Related to accuracy, dependability and consistency e.g. 20°C here today, 20°C in North Italy – are they the same?According to Henning , reliability isa measure of accuracy, consistency, dependability, or fairness of scores resulting from the administration of a particular examination e.g. 75% on a test today, 83% tomorrow – problem with reliability.
Student Related reliability: the deviation of an observed score from one’s true score because of temporary ilness, fatigue, anxiety, bad day, etc.Rater reliability: two or more scores yield an inconsistent scores of the same test because of lack attention on scoring criteria, inexperience, inattention, or preconceived bias.Administration reliability: unreliable results because of testing environment such as noise, poor quality of cassettee tape, etc.Test reliability: measurement errors because the test is too long.
To Make Test More Reliable
Take enough sample of behaviourExclude items which do not discriminate well between weaker and stronger studentsDo not allow candidate too much freedom.Provide clear and explicit instructionsMake sure that the tests were perfectly laid out and legibleMake candidates familiar with format and testing techniques
To Make Test More Reliable
Provide uniform and undistracted conditions of administrationUse items that pemit objective scoringProvide a detailed scoring keyTrain scorersIdentify candidate by number, not by nameEmploy multiple, independent scoring
Test retest reliability:administer whatever the test involved two times.Equivalent–forms reliability/parallel-forms reliability: administering two different bu equal tests to a single group of students (e.g. Form A and B)Internal consistency reliability: estimate the consistency of a test using only information internal to a test, available in one administration of a single test. This procedure is called Split-half method.
Criterion related validity: the degree to which results on the test agree with those provided by some independent and highly dependable assessment of the candidates’ ability.Construct validity:any theory, hypothesis, or model that attempts to explain observed phenomena in our universe and perception; Proficiency and communicative competence are linguistic constructs; self-esteem and motivation are psychological constructs.
Validity coefficient to compare the reliability of different tests.Lado: vocabulary, structure, reading (0,9-0,99), auditory comprehension (0,80-0,89), oral production (0,70-0,79)Standard error: how far an individual test taker’s actual score is likely to diverge from their true scoreClassical analysis: gives us a single estimatefor all test takersItem Response theory: gives estimate for each individual, basing this estimate on that individual’s performance
The extent to which the inferences made from assessment results are appropriate, meaningful and useful in terms of the purpose of the assessment.Content validity: requires the test taker to perform the behaviour that is being measured.Content validity: Its content constitutes a representative sample of the language skills, structures, etc. With which it is meant to be measured
Consequential validity: accuracy in measuring intended criteria, its impacts on the preparation of test takers, its effects on the learner, and social consequences of test interpretation and use.Face validity:the degree to which the test looks right and appears to the knowledge and ability it claims to measure based on the subjective judgement of examinees who take it and the administrative personnel who decide on its use and other psychometrical observers.
Response validity [internal]the extent to which test takers respond in the way expected by the test developersConcurrent validity [external]the extent to which test takers' scores on one test relate to those on another externally recognised test or measurePredictive validity [external]the extent to which scores on test Y predict test takers' ability to do X e.g. IELTS + success in academic studies at university
'Validity is not a characteristic of a test, but a feature of the inferences made on the basis of test scores and the uses to which a test is put.'To make test more valid:Write explicit test specificationUse direct testingScoring of responses related directly to what is being tested.Make the test reliable.
The quality of the relationship between a test and associated teaching.We have positive effect and negative effect.Test is valid when it has a good washbackStudents have ready access to discuss the feedback and evaluation you have given.
The effect of testing on teaching and learningThe effect of test on instruction in terms of how students prepare for the testFormative test: provides washback in the form of information to the learner on progress toward goals, while Summative test is always the beginning of further pursuits, more learning, more goalsTo improve washback: use direct testing, use criterion reference-testing, base achievement tests on objectives, and make sure that the tests are understood by students and teachers.
Evaluation of Classroom Tests
Are the test procedures practical?Is the test reliable?Does the procedure demonstrate content validity?Is the procedure face valid and biased for best?Are the test tasks as authentic as possible?Does the test give beneficial washback?
NRT and CRT
Is designed to measure the global language abilities such as overall English Proficiency, academic listening ability, reading comprehension, and so on.Each student’s score on such a test is interpreted relative to the scores of all other students who took the test with reference to normal distributionCriterion reference test is usually produced to measure well-defined and failrly specific instructional objectivesThe interpretation of CRT is considered as absolute in a sense that each student’s score is meaningful without reference to the other students’ scores
NRT and CRT
Test and Decision Purposes
TYPES OF DECISION
Characteristics of communicative tests
Communicative test setting requirements:Meaningful communicationAuthentic situationUnpredictable language inputCreative language outputAll language skillsBases for ratingsSuccess in getting meaning acrossUse focus rather than usageNew components to be rated
Components of Communicative competence
Grammatical competence (phonology, orthography, vocabulary, word formation, sentence formation)Sociolinguistic competence (social meanings, grammatical forms in different sociolinguistic contexts)Discourse competence (cohesion in different genres, cohesion in different genres)Strategic competence (grammatical difficulties, sociolinguistic difficulties, discourse difficulties, performance factors)
Discrete point: measures the small bits and pieces of a language as in a multiple choice test made up of questions constructed to measure students’ knowledge of different structureIntegrative test: measures several skills at one time such as dictation
Fairness issue: a test treats every student the same.The cost issueEase of test constructionEase of test administrationEase of test scoringInteractions of theoretical issues
General Guidelines for Item Formats
correctly matched to the purpose and content of the itemonly one correct answer?written at the students’ level of proficiencyAvoiding ambiguous terms and statementsAvoiding negarives and double negativesAvoid giving clues that could be used in answering other itemsAll parts of the item on the same pageOnly relevant information presentedAvoidingbias of race, gender and nationalityLet another person look over the item
More than one correct answer
The apple is located on or aroundA) a table C) the tableB) an table D) table- Two correct answers (A and C), wordy (somewhere around), repeat the word table inefficiently
Do you see the chair and table? The apple is on _____ table.A c) theAn d) (no article)Option d (no article) will be easily detected as a wrong option so it is not a good distracter.
According to the passage, antidisestablismentarianism diverges fundamentally from the conventional proceedings and traditions of the Church of England* Containing too difficult vocabulary.
Why are statistical studies inaccessible to language teachers in Brazil according to the reading passage?Accessible: language teachers get very little training in mathematics and/or such teachers are averse to numbersAccessible: the libraries may be far away.
One theory that is not unassociated with Noam Chomsky is:A. Transformational generative grammarB. Case grammarC. Non-universal phonologyD. Acoustic phonologyUse one negative onlyEmphasize it by underline, upper case, or bold-face. For example:not, NEVER,inconsistent
Receptive response items
True-Falsethe statement worded carefully enough so it can be judged without ambiguityabsoluteness clues are avoidedMultiple ChoiceUnintentional clues are avoidedThe distracters are plausibleNeedless redundancy in the options is avoidedOrdering of the option is carefully consideredThe correct answers are randomly assignedMatchingMore options than premisesOptions shorter than premises to reduce readingOption and premise lists r elated to one central theme
Items should be worded carefully enough so it can be judged without ambiguityAvoidabsolutenessThisbook is always crystal clear in all itsexplanation: T F-allowthe students to answer correctly without knowing the correct response.- Absolute clues: all, always, absolutely, never, rarely, most often
Avoid unintentionalcluesThefruit that Adam ate in the Bible wasan____A.Pear C. AppleB. Banana D. PapayaUnintentionalclues: grammatical, phonological, morphological, etc.
Are all distracters plausible?Adam ate _______An apple C. an apricotA banana D.atire
Avoid needless redundancyThe boy on his way to the store, walking down the street, when he stepped on a piece of cold wet ice andA. fell flat on his faceB. fall flat on his faceC. felled flat on his faceD.falled flat on his face
Moreeffective:The boy stepped on a piece of ice and ______ flat on his face.A. fellB. fallC. felledD. falled
Correct answers should be randomly assignedDistracters like “none of the above”, “A and B only”, “all of the above should be avoided
Present the students with two columns of information; the students then must find and identify matches between the two sets of information.The information on the left-hand column is called matching-item premiseOn the right hand column is called option
More options should be supplied than premises so the students can narrow down the choices as they progress through the test simply by keeping track of the options they have used.Options should be shorter than premises because most students will read a premise then search through the optionsThe options and premises should relate to one central theme that is obvious to students
Fill in Items
The required response should be conciseBad item:John walked down the street ________ (slowly, quickly, angrily, carefully, etc.)Good item:John stepped onto the ice and immediately ____ down hard (fell)
Fill in Items
There should be a sufficient context to convey the intent of the question to the students.The blanks should be standard in lengthThe main body of the question should precede the blankDevelop a list of acceptable responses
Items that the students can answer in a few phrases or sentences.The item should be formatted that only one relatively concices answer is possible.The item is framed as a clear and direct itemE.g. According to the reading passage, what are the three steps in doing research?
Task item is any of a group of fairly-open ended item types that require students to perform a task in the language that is being tested.The task should be clearly definedThe task should be sufficiently narrow for the time available.A scoring procedure should be worked out in advance in regard to the approach that will be used.A scoring procedure should be worked out in advance in regard to the categories of language that will be rated.The scoring procedure should be clearly defined in terms of what each scores within each category means.The scoring should be anonymous
Analytic Score for Rating Composition Tasks
Holistic Version of the Scale for Rating Composition Tasks
Personal Response Items
The response allows the students to communicate in ways and about things that are interesting to them personallyPersonal Responses include: self assessment, conferences, porfolio
Decide on a scoring typeDecide what aspect of students’ language performance they will be assessingDevelop a written rating for the learnersThe rating scale should decide concrete language and behaviours in simple termsPlan the logistics of how the students will assess themselvesThe students should the self-scoring proceduresHave another student/teacher do the same scoring
Introduce and explain conferences to the studentsGive the students the sense that they are in control of the conferenceFocus the discussion on the students’ views concerning the learning processWork with the students concerning self-image issueElicit performances on specific skills that need to be reviewed.The conferences should be scheduled regularly
Explain the portfolios to the studentsDecide who will take responsibility for whatSelect and collect meaningful work.The students periodically reflect in writing on their portfoliosHave other students, teachers, outsiders periodically examined the portfolios.