CHARACTERISTICS OFMODIFIEDMULTIPLE-CHOICE INSTRUMENT TOMEASURE HIGH ORDER THINKING SKILLS FOR ECOSYSTEM SUBJECT

The demand in the 21st century emphasizes that student shall possess high-order thinking skills. One of the alternative assessments that can be used to practice higher-order thinking skills is modified multiple-choice. Modified multiple-choice consists of two levels; the first level resembles traditional multiple-choice, while the second level is students’ reasons in answering the first level, aiming to encourage higher-order thinking skills. The study aims to describe the characteristics of modified multiple-choice for ecosystem subject. The development of the assessment instrument referred to the ADDIE model, which consisted of five stages, including analysis, design, development, implementation, and evaluation. The developed items are then validated by education and material experts to check the validity of items. The items of a multiplechoice assessment instrument are tested to 32 students of 10th graders at SMAN 19 Surabaya. Data analysis technique employed was descriptive analysis technique, including the analysis of validity and test results to determine the reliability value and level of difficulty. The results suggest that the modified multiple-choice assessment instrument was declared valid with a mode of 4, which was included in the “highly valid” category. The reliability value using Cronbach’s Alpha formula reached 0.63. Meanwhile, the difficulty index was evenly distributed, with percentages of 20% (easy), 53% (medium), and 27% (difficult). The modified multiple-choice questions for ecosystem subject are declared valid and reliable to measure the students’ abilities.


INTRODUCTION
Education has a critical role in creating flexible, creative, and proactive young generation in facing nationbuilding challenges in Indonesia in the 21 st century whose learning process reflects the four learning objectives (4C), i.e., critical thinking, creativity, communication, and collaboration (Susilo, 2015). Young generations need to be formed to be skilful in resolving problems, making the right decisions, being creative in thinking, and being capable of expressing their ideas effectively (Warsono and Hariyanto, 2012).
The demand in the 21 st century emphasizes that students shall possess high-order thinking skill, and it needs to be taught to students. Higher-order thinking skills are thinking skills that include the ability to analyze, evaluate, and create. This definition is in accordance with the revised version of Bloom's Taxonomy which uses the terms of analysis, evaluation, and creation (Anderson & Krathwohl, 2001).
The reason that reinforces the importance of higherorder thinking skill in the 21 st -century education (particularly in Indonesia), is to make Indonesian people become accustomed to thinking critically and have a strong work ethic so that they can compete positively at international level (Poedjiadi, 2010). The importance of mastering higher-order thinking skills is also contained in the points of Graduate Competency Standards (SKL) for High School. SKL on the Regulation of the Minister of Education and Culture Number 20 of 2016 concerning Graduate Competency Standards for elementary and secondary education, among others, also states that Senior High School (SMA/MA) graduates must have creative, productive, critical, independent, collaborative, and communicative thinking and acting skills through a scientific approach. For the sake of the implementation of the assigned SKL, assessment instruments should be aligned with the higher-order thinking skills to motivate students in developing their thinking ability and being capable of following the development of knowledge and technology.

Berkala Ilmiah Pendidikan Biologi
In fact, in general, students' higher-order thinking skills in Indonesia are still far below the rank compared with other countries. The survey results of PISA and TIMS prove that Indonesian students are only capable of reaching the second cognitive level out of the six degrees of thinking on tested questions. This finding indicates that the logical and rational thinking skills possessed by Indonesian students are still low so that the Indonesian annual achievement is ranked low among the participating countries (Sani, 2016). The 2018 PISA study showed that Indonesia ranked 70 th out of 78 countries participating in PISA (OECD, 2018). In addition, based on the 2015 TIMSS survey, Indonesia ranked at 44 th place out of 49 countries with an average score of 397 for the IPA achievements (IEA, 2015).
The improved quality of the National Exam (UN) applied in Indonesia started to implement HOTS-based questions in 2018. Although the HOTS-based questions had only emerged around 10-15% of the total questions to answer. According to Susetyo (2019), the implementation of HOTS questions in the UN affected the UN results. The students' average score decreased. The average score of the National Exam in 2017 amounted to 53.47, while the average score in 2018 reached 51.76. Meanwhile, the average result of the National Exam in 2019 increased, amounting to 53.00. however, the result was still in the lower average compared with the results of the 2017 National Exam (Puspendik, 2018).
Various facts mentioned earlier indicate that the students' thinking skills in Indonesia remain at a low level. As a reflection of the learning results, according to Suprayitno (2019), as the Head of Research and Development Agency of the Ministry of Education and Culture, it is expected that the results of National Examination can become feedback to increase the quality of classroom learning, such as the evaluation of teaching and learning activities, as well as the development of assessment instruments that are capable of training students to attain higher-order thinking skills.
Based on the observation and interview results conducted in schools, the assessment instruments frequently used by the teachers was written test with multiple-choice items. The reason in applying multiplechoice questions is to ease the scoring process and to obtain more objective score. In this type of question, there are only two possibilities; when they answer correctly, they will obtain a score of 1, and when they answer incorrectly, they will obtain a score of 0, but students have a huge possibility that the students choose the answers by chance. Hence, it can be concluded that multiple-choice questions are less effective to measure higher-order thinking skills (Purwanto, 2010).
One of the assessment alternatives that can be employed to practice the higher-order thinking skill is the two-tier multiple-choice questions whose form of questions was developed by Treagust (2006). According to Adodo (2013), a two-tier multiple-choice is a form of questions that is more sophisticated than multiple-choice questions. The first level resembles the traditional multiple-choice questions, commonly associated with the knowledge statements. The second level resembles the forms of conventional multiple-choice questions, aiming to encourage higher-order thinking skills.
According to Chandrasegaran (2007), the use of the multiple-choice instrument is only to assess content knowledge without considering the reason behind the selected answer. As an improvement of this concept, a multiple-choice instrument was developed by including students' responses and alternative conceptions. Students are required to justify their answers by providing reasons. The provision of reason when answering Modified Multiple-Choice items is a sensitive and effective method to assess meaningful learning. Cullinane (2011) suggested that the provision of reasons on the second level of two-tier multiple-choice questions can improve higher-order thinking skills and observe students' abilities in providing reasons. The provision of reason on the second level of questions can reduce the habit of answering questions by chance that often becomes the weakness of multiple-choice questions in general. The two-tier multiple-choice question provides an objective, easy, and quick scoring compared with other tests for higher-order thinking skill, such as an essay.
The teachers are capable of identifying students' skills well when they employ the appropriate measurement instrument. According to Arikunto (2008), the test requirements to be included as a "good" category must fulfil some criteria, including validity, reliability, objectivity, practical and economical aspects. Out of the five criteria, an assessment instrument can be categorized as "good" when the instrument is, at least, valid and reliable. Validity is the accuracy of the assessment instrument for any matter to assess; therefore, the instrument precisely assesses what to assess (Sudjana, 2010). Validity can be interpreted to the extent of the accuracy or the precision of the measuring instrument in carrying out its measurement function. A valid instrument produces valid data, as well (Widoyoko, 2009).
In addition to validity, reliability is another factor to determine that an instrument is in the "good" category.

Berkala Ilmiah Pendidikan Biologi
Reliability is related to the consistency of test results (Arikunto, 2015). A reliable assessment instrument generates a relatively equal or consistent assessment, even when used repeatedly. The difficulty level of questions on the developed assessment instrument also needs to be identified. It aims to determine whether the questions developed are too easy or too difficult for students. In general, a question with a good category has a medium level of difficulty. According to Widoyoko (2014), the percentage of proper levels of difficulties for test comprised of 25% of difficult questions, 50% of medium questions, and 25% of easy questions, or in other words, the ratio of easy: medium: difficult questions is 1:2:1. Therefore, if an assessment instrument could be answered correctly by all students, it is considered as an easy test with improper questions, and vice versa. (Bagiyono, 2017) Based on the explanation, this study aims to describe the modified multiple-choice assessment instrument to measure valid and reliable higher-order thinking skills for Ecosystem subject at 10 th grade of Senior High School, including the questions' difficulty levels.

METHOD
This research was development research, i.e., developing modified multiple-choice questions to measure students' higher-order thinking skills. The development of the assessment instrument referred to the ADDIE model, which consisted of five stages, including analysis, design, development, implementation, and evaluation. This research was conducted in November 2019 -March 2020. The research object was 15 modified multiple-choice questions that had been validated by experts and declared as valid questions. The assessment instrument trial was conducted to 32 10 th graders of SMA Negeri 19 Surabaya who were considered to represent the population to measure the reliability of modified multiple-choice assessment instrument.
The data collection methods used were validation and test methods. The assessment instrument validity was measured based on the results of the validation carried out by education and material experts. Validation activities carried out to determine the validity of the developed assessment instrument. While the test method is carried out by testing the modified multiple choice instrument items on student. From the results of these tests will be known reliability and difficulty level of items developed.
The data analysis technique employed was descriptive analysis technique, including the validity analysis of modified multiple-choice assessment instrument using a Likert scale. The assesment instrument was categorized valid if the average score obtained > 3 by the Likert scale. The analysis of modified multiple-choice test results to determine reliability value was carried out by using the Cronbach's alpha formula. The reliability value generated from Cronbach's Alfa formula was then interpreted using the criteria of test reliability level. The developed assesment instrument was categorizes reliabel if the score obtained by > 0,60. The value of the difficulty levels was obtained by comparing the number of students who answered correctly divided by the number of students taking the test, then multiplied by 100%. Then the value of the calculation results obtained is interpreted using the item difficulty level criteria according to Arikunto (2015).
The specifications of modified multiple-choice questions to measure higher-order thinking skills for Ecosystem subject is presented in

RESULTS AND DISCUSSION
The questions developed are Modified Multiple-choice type, which consists of two levels. The first level is in the form of stimulus, and each question is accompanied by choices. On the other hand, the second level consists of the reasons underlying the answers chosen at the first level. The stimulus questions developed contain data from research results, images, graphics, and cases that are often encountered in daily life. Thus accustoming students to develop critical and creative thinking skills.
The preparation of the Modified Multiple-choice assessment instrument was conducted by elaborating basic competencies into some indicators and followed with developing the questions by exploring references for the stimulus questions developed from various sources and in accordance with the indicators. Subsequently, the questions were made along with the answer choices that fit the context to let the students explain the reasons for choosing the answer by describing them briefly The questions that are successfully developed in the  Look carefully at the following research results, then choose the right answer! A student majoring in Biology conducted a study on "The Antagonism Ability of Pseudomonas sp. and Penicillium sp. against Cercospora nicotianae in Vitro". Based on the results of the Antagonism test and zone of inhibition measurement, the results obtained are as follows:  The stimulus used in these questions were research results from Biology-majoring college students. The developed Modified Multiple-choice assessment instrument required the students to think further since they would not only choose one correct answer. Meanwhile, in the reason section, the students were required to provide reasons for the answers chosen at the first level. This is in accordance with Cullinane (2011) who suggests that the inclusion of reason at the second level of two-tier multiple-choice question form can be used to improve the thinking skills and identify the students' capability in providing reasons.
The item instruments that have been compiled cannot be stated directly either well, therefore a review of the item instrument is needed (Rahmani et al, 2015). A good question should be valid and reliable. Validity is the accuracy of the assessment instrument for everything that is assessed so that it actually assesses what should be assessed (Sudjana, 2010). Validity can be obtained from the results of the accuracy and the measurement results by both education and material experts using pre-designed validation instruments. Validity provides an understanding that the evaluation results must be in line or consistent with what has been evaluated (Agustini, 2016). A test is considered invalid when it fails to provide accurate information regarding the attributes it measures (Azwar, 2016).
The validity of the developed modified multiplechoice assessment instrument was obtained from the results of validation by the education and material experts. In compiling the developed modified multiple-choice validity assessment instrument, three aspects must be focused on, i.e., material aspects, construction aspects, and language aspects. The obtained multiple-choice assessment instrument validation resulted in an overall mode of 4. The resulted mode suggested that the modified multiple-choice assessment instrument was valid, while according to the Likert scale, it was very valid. The following data is the result recapitulation of the modified multiple-choice assessment instrument validation presented in Table 2. The first aspect was the material aspect, consisting of four criteria validated by material experts and education experts, i.e., 1) the suitability of the questions with the developed indicators, 2) the relevance of the questions with the truth of the concept, 3) the suitability of the questions with daily life, and 4) there was only one correct answer in each question. This aspect obtained the mode score of highly valid. The material aspects are related to science and students' cognitive level (Retnawati, 2016) The second aspect was the construction aspect, consisting of 8 categories, i.e., 1) easy-to-understand instructions, 2) two-level questions, 3) the stimuli used in questions attracted students to read, 4) the stimuli used were contextual in the form of images/graphics/text/etc., 5) questions measured cognitive level of C4 (analyze), C5 (evaluate), and C6 (create), 6) questions did not lead to multiple interpretations, 7) questions did not depend on previous questions, 8) answer choices did not use the statement "all answers are true/false" and the likes. The construction aspect was related to the technique of writing questions (Mardapi, 2017). This aspect obtained the total score with the category of highly valid. However, in point no. 3 (questions employ stimuli that attract the learners to read), there were several numbers of questions that were less appropriate, i.e., question number 1, 11, and 13.
The third aspect was the aspect of language consisting of 5 categories, including 1) using proper Indonesian, 2) using communicative and easy-to-understand language, 3) not containing words/expressions that led to double interpretation, 4) not containing ambiguous language, 5) the answer choices did not repeat the word/group of words unless it was a unity of meaning. This aspect obtained a total score mode in the category of highly valid. However, in the point of using proper Indonesian, there were several numbers of questions that were not entirely appropriate, i.e., questions number 2, 4, 5, 6, and 10. The language aspects were associated with the clarity of every aspect supporting the question preparation (Mardapi, 2017). The assessment instrument which was theoretically stated valid indicated that it had fulfilled all three aspects (Rachma and Ratnasari, 2015).
After being tested for its validity, the instrument development of 15 modified questions was then examined in a limited manner to be analyzed later. The analysis included the questions' reliability and difficulty levels. Reliability is related to the determination or the severity of the test results (Arikunto, 2015). The data obtained from the test results were utilized to determine the reliability value using the Cronbach Alpha formula. The reliability value was then interpreted using the test reliability level criteria. The calculation result of the reliability value using the Alfa Cronbach formula resulted in the score 0.63 with the high category. Thus, the resulting multiplechoice instrument can be considered as valid and reliable. A measuring instrument is declared to have a high coefficient of reliability if it provides an equal or almost equal value when it is used to assess the same object at different times.
The questions' reliability is related to the consistency of the questions to measure students' learning abilities (Masruroh, 2012). Several factors affect the reliability value, both directly and indirectly. The direct factors include the test implementation time, the questions' difficulty levels, the questions' length, the scoring objectivity, and the answer and score dissemination, whereas the indirect factors include clear implementation instructions and environmental conditions and supervision (Retnawati, 2016).
The question difficulty index can be employed as one of the parameters for analyzing a test since it has a function to determine the student's ability (Retnawati, 2016). The difficulty level calculation aims to discover whether the questions are too difficult or too easy for students through calculations by comparing the number of students who answered correctly with the total number of students who took the test. The questions difficulty level are determined by the number of participants who answered correctly divided by the number of students who took the test, then multiplied by 100% (Maenani and Oktova, 2015). The following table and diagram explain the recapitulation results of the modified questions difficulty level  The calculation of the difficulty level functions to determine whether the questions are too difficult or too easy for students. The calculation was made by comparing the number of students who answered correctly with the total number of students who took the test. The difficulty level of the questions item that has been obtained through subsequent calculations is interpreted using the difficulty level questions criteria developed by Arikunto (2015). The percentage of difficulty levels on the multiple-choice assessment instruments developed were 20% easy, 53% medium, and 27% difficult. According to Widoyoko (2014), a test given to students should have a distribution balance between easy, medium, and difficult questions with a ratio of 25%, 50%, and 25%. The following figure is the sample questions on multiple-choice assessment instruments that are categorized into difficult criteria. These questions are presented in Figure 3. The questions listed in Figure 3 could only be answered correctly by 8 out of the 32 students taking the test.Thus, the questions obtained a difficulty level of 0.25 with difficult interpretation. The difficulty level value is obtained by comparing the number of students who answered correctly with the total number of students taking the test. Therefore, if the difficulty index is high, the interpretation of the questions are classified into the "easy" category. In contrast, if the difficulty index obtained is low and close to zero, the questions are categorized as "difficult" for students.

Medium
Difficult 1. Look at the following picture, then choose the right answer! Biophysically, the sea area can be divided according to vertical and horizontal dimensions, physical factors, and the distribution of biota communities. Each zone has unique physical, chemical, and biological characteristics. In addition, ocean zoning can be divided into surface zone (pelagic zone) and bottom zone (benthic zone). Horizontally, the pelagic zone is divided into two zones, i.e., the neritic zone and the oceanic zone. Meanwhile, vertically, it is divided into photic zone and aphotic zone. The photic zone is also called as epipelagic zone, while the aphotic zone is divided into four zones, i.e., mesopelagic, batipelagic, abyssal and hadal pelagic zones.
Based on these zones, in which part of the zone are the most abundant producer-level components found? a. Based on the research results on the instrument development of the modified multiple-choice assessment to measure students' higher-order thinking skills on Ecosystem subject, it can be concluded that the modified multiple-choice assessment instrument is declared valid and reliable. Validity is obtained based on the results of validation by material experts and education experts by considering material aspects, construction aspects, and language aspects which obtains a mode of 4, which is included in the "highly valid" category. The reliability value is obtained by analyzing student test results using Cronbach's Alpha formula, reaching 0.63. Meanwhile, the difficulty index obtained is heterogeneous, i.e., 20% (easy), 53% (medium), and 27% (difficult). Based on the conducted research, it can be concluded that the items of the modified multiple-choice assessment instrument are valid and reliable.