Another approach to correspondence (useful if there are only two reviewers and the scale is continuous) is to calculate the differences between each pair of observations of the two reviewers. The mean value of these differences is called “bias” and the reference interval (mean value ± 1.96 × standard deviation) is called the conformity limit. The limitations of the agreement make it possible to determine the extent to which random variations can influence evaluations. Schmidt, A.M, and DeShon, R. P. (2003, April). “Problem in the Use of the Interrater Assessing Agreement,” in the paper presented at the 18th Annual Conference of the Society of Industrial and Organizational Psychology (Orlando, FL). Inter-board reliability is the degree of adequacy between evaluators or judges. If everyone agrees, IRR is 1 (or 100%) and if everyone does not agree, IRR is 0 (0%).
There are several methods of calculating the IRR, from the simple (e.g.B. percentage overset) to the most complex (e.g.B Cohen`s Kappa). What you choose depends largely on the type of data you have and the number of evaluators in your model. Cohen`s Kappa (1960) mathematical bases make this statistic suitable for only two programmers, therefore, DES statistics for nominal data with three or more encoders are usually formalized as extensions of Scott`s (1955) Pi statistic (z.B. Fleiss` 1971) or are calculated with the arithmetic mean of kappa or P(s) (z.B Light 1971); Davies and Fleiss, 1982). It can be interpreted as expressing the extent to which the agreement observed between evaluators exceeds what could be expected if all evaluators issued their evaluations by chance. Dunlap, W. P., Burke, M. J., and Smith-Crowe, K.
(2003). Detailed tests of the statistical significance of the rwg and mean interrater correspondence indices. J. Appl. Psychol. 88, 356-362. doi: 10.1037/0021-9010.88.2.356 Several formulas can be used to calculate compliance limits. The simple formula given in the previous paragraph, which works well for samples greater than 60, is Pasisz, D. J., and Hurtz, G.M (2009). Test the differences between intra-group agreements within the group. From the organ. Methods 12, 590-613 doi: 10.1177/1094428108319128 Kozlowski, S.
W. J., and Hattrup, K. (1992). A disagreement on the group agreement: separating the issues of coherence and consensus. J. Appl. Psychol. 77, 161-167. doi: 10.1037/0021-9010.77.2.161 In contrast, intra-consulting reliability is an assessment of the consistency of evaluations given by the same person on several instances. Inter-consulting and intra-consulting reliability are aspects of test validation. Evaluations of the latter are useful for refining the tools available to human judges, for example by determining whether a certain scale is suitable for measuring a given variable. If different evaluators disagree, either the scale is defective or the evaluators need to be retrained.
Cohens Kappa`s resulting estimate, averaged by pairs of encoders, is 0.68 (estimated pair of kappa coders = 0.62 [coders 1 and 2], 0.61 [encoders 2 and 3] and 0.80 [coders 1 and 3]), indicating an essential correspondence according to Landis and Koch (1977). . . .