Thank you for making the code and paper publicly available.
I have recently been reproducing the anomaly detection results in the paper. At the moment, I can run the code successfully on the EWJ and KR datasets, but the Aff-F I obtain is still lower than the results reported in the paper. So I would like to ask two questions.
-
About the evaluation protocol for Aff-F
I noticed that in the code, Aff-F is computed under multiple anomaly ratios and then averaged as the final reported result.
However, in my experiments, I found that when the anomaly ratio is set to relatively low values such as 1, 2, or 3, the corresponding Aff-F is also usually quite low.
So I would like to ask: if the performance under these low anomaly ratios is not high, why are these cases still included in the final average?
What is the main motivation behind designing the final evaluation in this way?
-
About reproduction details on EWJ and KR
I would also like to ask whether there are any settings that are particularly important for reproduction on the EWJ and KR datasets, but were not described in detail in the paper, such as the number of training epochs, threshold-related settings, auxiliary loss weights, or other implementation details.
To make the question clearer, I have also attached some of my current experimental results and figures.
If you have time, I would greatly appreciate any clarification or suggestions. Thank you very much for your help.
For example, my current reproduced final Aff-F on EWJ is 0.7034, V-ROC is 0.8018, V-PR is 0.4495.
It shows the reproduced Aff-F under different anomaly ratios on the EWJ dataset.

Thank you for making the code and paper publicly available.
I have recently been reproducing the anomaly detection results in the paper. At the moment, I can run the code successfully on the EWJ and KR datasets, but the Aff-F I obtain is still lower than the results reported in the paper. So I would like to ask two questions.
About the evaluation protocol for Aff-F
I noticed that in the code, Aff-F is computed under multiple anomaly ratios and then averaged as the final reported result.
However, in my experiments, I found that when the anomaly ratio is set to relatively low values such as 1, 2, or 3, the corresponding Aff-F is also usually quite low.
So I would like to ask: if the performance under these low anomaly ratios is not high, why are these cases still included in the final average?
What is the main motivation behind designing the final evaluation in this way?
About reproduction details on EWJ and KR
I would also like to ask whether there are any settings that are particularly important for reproduction on the EWJ and KR datasets, but were not described in detail in the paper, such as the number of training epochs, threshold-related settings, auxiliary loss weights, or other implementation details.
To make the question clearer, I have also attached some of my current experimental results and figures.
If you have time, I would greatly appreciate any clarification or suggestions. Thank you very much for your help.
For example, my current reproduced final Aff-F on EWJ is 0.7034, V-ROC is 0.8018, V-PR is 0.4495.
It shows the reproduced Aff-F under different anomaly ratios on the EWJ dataset.