안녕하세요.
Score 함수 관련해서 궁금한게 있는데 아래 오라클 문서에 보면
Score는 살톤의 공식에 따라 계산되고 예제에 대해서 설명이 되어 있는데
Document Set이 라는게 어떤 건지 정확히 모르겠습니다.
이게 Contains에 사용되는 조건 값인 가요?
아니면 테이블에 가지고 있는 필드값인가요?
아니면 다른 의미인가요?
To calculate a relevance score for a returned document in a word query, Oracle Text uses an inverse frequency algorithm based on Salton's formula.
Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.
The following table illustrates Oracle Text's inverse frequency scoring. The first column shows the number of documents in the document set, and the second column shows the number of terms in the document necessary to score 100.
This table assumes that only one document in the set contains the query term.
Number of Documents in Document Set | Occurrences of Term in Document Needed to Score 100 |
---|---|
1 | 34 |
5 | 20 |
10 | 17 |
50 | 13 |
100 | 12 |
500 | 10 |
1,000 | 9 |
10,000 | 7 |
100,000 | 6 |
1,000,000 | 5 |
Note that the score varies, depending on the set size. For example, if only one document in the set contains the query term, and there are five documents in the set, then the term must occur 20 times in the document to score 100. If 1,000,000 documents are in the set, then the term can occur only 5 times in the document to score 100.
You have 5000 documents dealing with chemistry in which the term chemical occurs at least once in every document. The term chemical thus occurs frequently in the document set.
You have a document that contains 5 occurrences of chemical and 5 occurrences of the term hydrogen. No other document contains the term hydrogen. The term hydrogen thus occurs infrequently in the document set.
Because chemical occurs so frequently in the document set, its score for the document is lower with respect to hydrogen, which is infrequent is the document set as a whole. The score for hydrogen is therefore higher than that of chemical. This is so even though both terms occur 5 times in the document.
Note:
Even if the relatively infrequent term hydrogen occurred 4 times in the document, and chemical occurred 5 times in the document, the score for hydrogen might still be higher, because chemical occurs so frequently in the document set (at least 5000 times).Inverse frequency scoring also means that adding documents that contain hydrogen lowers the score for that term in the document, and adding more documents that do not contain hydrogen raises the score.