Individual Identification via Track-wise Aggregation: A Scalable Approach for Re-identifying Animals in the Wild
This repository contains the code implementation for the individual identification chapter of a PhD thesis, focusing on re-identifying lemurs in social groups using full-body crops and track-wise aggregation strategies.
The identification of individuals in a group of animals is a foundational task in ethology, with applications ranging from animal welfare monitoring, to analysis of social groups and individual behavior profiling, and ultimately to conservation efforts.
In the computer vision literature, this task is known as re-identification. Most benchmarks and methods are developed for people; however, the requirements for animal re-identification are considerably different from those of humans. While person re-identification has been shown to rely mostly on body and clothing, the majority of primate datasets work directly on the primates' faces.
Our experiments with redfronted lemurs have the following characteristics:
- Collar-based identification: All lemurs wear colored collars around their neck, which are used by field biologists for identification.
- Frame-level uncertainty: Lemurs are not identifiable on all frames—they may be too far away from the camera, occluded, or with their backs to the camera, making identification impossible.
- Track-level aggregation: Rather than requiring correct predictions on every frame, we combine the model with tracks generated by PriMAT. Confident predictions on a few key moments suffice, and the labels can be propagated through the whole track.
- Variable viewing conditions: The model must be robust to diverse lighting conditions and distances. Lemurs move freely and can come arbitrarily close to cameras.
- Full-body crops: We work on full-body crops of lemurs, excluding their long tails, rather than face crops as in most primate detection methods.
The above figure shows the motivation for track-wise predictions via aggregation. The lemur Genovesa is not correctly identified on all frames; however, when taking the evidence of the whole track into consideration, the model reaches the correct conclusion.
The above figure displays examples showing the diversity of our individual identification data. Each row shows 10 equidistantly sampled frames from one track. The ground truth label and track length are annotated on each track.
-
Systematic evaluation of temporal aggregation strategies: We demonstrate that entropy-based filtering—where frames with low predictive uncertainty are prioritized—outperforms other aggregation techniques for track-level re-identification.
-
Scalable data curation framework: We present a novel framework for animal re-identification on videos, where full-track annotation is followed by automated filtering using a lightweight scoring model. Our best-performing model achieves 84.2% accuracy and 72.1% class-balanced accuracy on the training set, and 81.7% accuracy and 74.1% balanced accuracy on the hold-out test set.
Our model is image-based, but we are interested in performance on videos. To produce exactly one identity label per track, the information from individual frame predictions must be aggregated. We compare several aggregation techniques, and our results show that the choice of aggregation technique substantially impacts performance.
| Aggregation Technique | Accuracy | Balanced Accuracy |
|---|---|---|
| Frame-wise baseline | 0.404 ± 0.021 | 0.358 ± 0.009 |
| Threshold filtering | 0.500 ± 0.034 | 0.417 ± 0.024 |
| Mean logits | 0.534 ± 0.040 | 0.442 ± 0.027 |
| Mean probabilities | 0.545 ± 0.040 | 0.451 ± 0.035 |
| Double threshold filtering | 0.595 ± 0.048 | 0.494 ± 0.046 |
| Top k | 0.591 ± 0.049 | 0.491 ± 0.040 |
| Exponential weighting | 0.601 ± 0.057 | 0.522 ± 0.084 |
| Entropy weighting | 0.601 ± 0.057 | 0.522 ± 0.084 |
| Max probability | 0.611 ± 0.025 | 0.535 ± 0.071 |
| Entropy filtering | 0.612 ± 0.042 | 0.532 ± 0.071 |
Simple approaches such as thresholding and counting achieve accuracies of 50–55%. Averaging logits or probability vectors yields similar results. In contrast, methods that select a well-chosen subset of frames reach 60–62% accuracy. The same holds for methods that weight frames based on entropy or maximum probability. All aggregation methods outperform a frame-wise prediction baseline (40.4% accuracy).
We use entropy filtering with a threshold of 0.3 as the aggregation technique for the remaining experiments.
We train the model with different datasets, measuring both performance and human annotation time. The manual annotation model serves as a baseline (61.2% accuracy). We use it to generate pseudo-labels from unseen videos, and the accuracy increases slightly to 64.0%. However, many pseudo-labels are incorrect due to imprecise tracking or wrong predicted identities. Revising those pseudo-labels takes 30 minutes and increases performance to 74.8%.
Annotating identities on complete tracks helps generate larger amounts of data. Filtering the most promising tracks and limiting to at most 10 per track to avoid near duplicates yields 80.8% accuracy, while taking 100 per track provides the strongest model at 84.2% accuracy. This is even slightly higher than training on the full unfiltered dataset.
| Dataset | Accuracy | Balanced Acc. | Annot. Time | #Labels |
|---|---|---|---|---|
| Manual annotations | 0.612 ± 0.042 | 0.532 ± 0.071 | 4.5h | 274 |
| + pseudo-labels | 0.640 ± 0.027 | 0.525 ± 0.026 | +0h | +1,133 |
| + Revised pseudo-labels | 0.748 ± 0.016 | 0.608 ± 0.031 | +0.5h | +647 |
| Unfiltered | 0.833 ± 0.022 | 0.721 ± 0.014 | 9h | 260,057 |
| Score filter (10 per track) | 0.808 ± 0.026 | 0.661 ± 0.024 | 9.5h | 3,619 |
| Score filter (100 per track) | 0.842 ± 0.031 | 0.721 ± 0.035 | 9.5h | 35,314 |
The manual annotations take on average one minute per annotated individual (including video opening, frame selection, bounding box annotation and saving). The top-down approaches with the scoring model require approximately twice the annotation time, as all tracks across all videos need to be labeled. However, they yield substantially larger amounts of data.
The confusion matrices above show the performance of the best performing model on validation videos (left) and test videos (right). The model achieves high accuracy across most individuals. Some individuals with limited training data show lower accuracy; notably, the individual Floreana appeared in only 3 tracks during training, resulting in increased confusion with other individuals in both validation and test sets. Final test accuracy: 81.7%, balanced accuracy: 74.1%.
Main training script: scripts/train_id.py
Train an identification model with specified hyperparameters and dataset configuration.
Data preparation: scripts/datasets.py
Handles dataset loading and preprocessing, including:
PrecomputedCropDataset: Loads precomputed crops fromimages/directory with corresponding labels from CSV files- Image augmentation and normalization
Inference script: scripts/eval_closedset_id.py
Run inference on individual crops or full tracks using trained models, with support for various aggregation techniques.
Evaluation tools:
- Track-wise aggregation and evaluation
- Confusion matrix visualization
The code expects the following directory structure:
data/
├── images/ # Precomputed crop images
│ ├── A_e1_c1_1718_0.png
│ ├── A_e1_c1_1749_0.png
│ └── ...
└── *labels.csv # Label files with columns: image_name, lemur_id
Label CSV format (filename and label are important for training, track_id additionally is needed for video validation, the rest helps for precise image retrieval):
filename,experiment,label,track_id,frame_num
A_e1_c1_1718_0.png,A_e1_c1,5,0,1718
A_e1_c1_1749_0.png,A_e1_c1,5,0,1749
A_e1_c1_1819_0.png,A_e1_c1,1,0,1819
...
sh experiments/train_closedset_id.shsh experiments/evaluate_closedset_id.sh


