Skip to content

ecker-lab/efficient-individual-identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Individual Identification via Track-wise Aggregation: A Scalable Approach for Re-identifying Animals in the Wild

This repository contains the code implementation for the individual identification chapter of a PhD thesis, focusing on re-identifying lemurs in social groups using full-body crops and track-wise aggregation strategies.

Motivation and Problem Statement

The identification of individuals in a group of animals is a foundational task in ethology, with applications ranging from animal welfare monitoring, to analysis of social groups and individual behavior profiling, and ultimately to conservation efforts.

In the computer vision literature, this task is known as re-identification. Most benchmarks and methods are developed for people; however, the requirements for animal re-identification are considerably different from those of humans. While person re-identification has been shown to rely mostly on body and clothing, the majority of primate datasets work directly on the primates' faces.

Unique Challenges in Lemur Re-identification

Our experiments with redfronted lemurs have the following characteristics:

  • Collar-based identification: All lemurs wear colored collars around their neck, which are used by field biologists for identification.
  • Frame-level uncertainty: Lemurs are not identifiable on all frames—they may be too far away from the camera, occluded, or with their backs to the camera, making identification impossible.
  • Track-level aggregation: Rather than requiring correct predictions on every frame, we combine the model with tracks generated by PriMAT. Confident predictions on a few key moments suffice, and the labels can be propagated through the whole track.
  • Variable viewing conditions: The model must be robust to diverse lighting conditions and distances. Lemurs move freely and can come arbitrarily close to cameras.
  • Full-body crops: We work on full-body crops of lemurs, excluding their long tails, rather than face crops as in most primate detection methods.

Task Overview

Example illustrating track-wise identification via aggregation

The above figure shows the motivation for track-wise predictions via aggregation. The lemur Genovesa is not correctly identified on all frames; however, when taking the evidence of the whole track into consideration, the model reaches the correct conclusion.

Visual examples of the diversity of the individual identification data

The above figure displays examples showing the diversity of our individual identification data. Each row shows 10 equidistantly sampled frames from one track. The ground truth label and track length are annotated on each track.

Main Contributions

  1. Systematic evaluation of temporal aggregation strategies: We demonstrate that entropy-based filtering—where frames with low predictive uncertainty are prioritized—outperforms other aggregation techniques for track-level re-identification.

  2. Scalable data curation framework: We present a novel framework for animal re-identification on videos, where full-track annotation is followed by automated filtering using a lightweight scoring model. Our best-performing model achieves 84.2% accuracy and 72.1% class-balanced accuracy on the training set, and 81.7% accuracy and 74.1% balanced accuracy on the hold-out test set.

Results

Aggregation Techniques

Our model is image-based, but we are interested in performance on videos. To produce exactly one identity label per track, the information from individual frame predictions must be aggregated. We compare several aggregation techniques, and our results show that the choice of aggregation technique substantially impacts performance.

Aggregation Technique Accuracy Balanced Accuracy
Frame-wise baseline 0.404 ± 0.021 0.358 ± 0.009
Threshold filtering 0.500 ± 0.034 0.417 ± 0.024
Mean logits 0.534 ± 0.040 0.442 ± 0.027
Mean probabilities 0.545 ± 0.040 0.451 ± 0.035
Double threshold filtering 0.595 ± 0.048 0.494 ± 0.046
Top k 0.591 ± 0.049 0.491 ± 0.040
Exponential weighting 0.601 ± 0.057 0.522 ± 0.084
Entropy weighting 0.601 ± 0.057 0.522 ± 0.084
Max probability 0.611 ± 0.025 0.535 ± 0.071
Entropy filtering 0.612 ± 0.042 0.532 ± 0.071

Simple approaches such as thresholding and counting achieve accuracies of 50–55%. Averaging logits or probability vectors yields similar results. In contrast, methods that select a well-chosen subset of frames reach 60–62% accuracy. The same holds for methods that weight frames based on entropy or maximum probability. All aggregation methods outperform a frame-wise prediction baseline (40.4% accuracy).

We use entropy filtering with a threshold of 0.3 as the aggregation technique for the remaining experiments.

Data Quality and Quantity

We train the model with different datasets, measuring both performance and human annotation time. The manual annotation model serves as a baseline (61.2% accuracy). We use it to generate pseudo-labels from unseen videos, and the accuracy increases slightly to 64.0%. However, many pseudo-labels are incorrect due to imprecise tracking or wrong predicted identities. Revising those pseudo-labels takes 30 minutes and increases performance to 74.8%.

Annotating identities on complete tracks helps generate larger amounts of data. Filtering the most promising tracks and limiting to at most 10 per track to avoid near duplicates yields 80.8% accuracy, while taking 100 per track provides the strongest model at 84.2% accuracy. This is even slightly higher than training on the full unfiltered dataset.

Dataset Accuracy Balanced Acc. Annot. Time #Labels
Manual annotations 0.612 ± 0.042 0.532 ± 0.071 4.5h 274
+ pseudo-labels 0.640 ± 0.027 0.525 ± 0.026 +0h +1,133
+ Revised pseudo-labels 0.748 ± 0.016 0.608 ± 0.031 +0.5h +647
Unfiltered 0.833 ± 0.022 0.721 ± 0.014 9h 260,057
Score filter (10 per track) 0.808 ± 0.026 0.661 ± 0.024 9.5h 3,619
Score filter (100 per track) 0.842 ± 0.031 0.721 ± 0.035 9.5h 35,314

The manual annotations take on average one minute per annotated individual (including video opening, frame selection, bounding box annotation and saving). The top-down approaches with the scoring model require approximately twice the annotation time, as all tracks across all videos need to be labeled. However, they yield substantially larger amounts of data.

Confusion Matrices

Confusion matrices for the best performing model Confusion matrices for the best performing model

The confusion matrices above show the performance of the best performing model on validation videos (left) and test videos (right). The model achieves high accuracy across most individuals. Some individuals with limited training data show lower accuracy; notably, the individual Floreana appeared in only 3 tracks during training, resulting in increased confusion with other individuals in both validation and test sets. Final test accuracy: 81.7%, balanced accuracy: 74.1%.

Code Structure

Training

Main training script: scripts/train_id.py

Train an identification model with specified hyperparameters and dataset configuration.

Data preparation: scripts/datasets.py

Handles dataset loading and preprocessing, including:

  • PrecomputedCropDataset: Loads precomputed crops from images/ directory with corresponding labels from CSV files
  • Image augmentation and normalization

Inference and Evaluation

Inference script: scripts/eval_closedset_id.py

Run inference on individual crops or full tracks using trained models, with support for various aggregation techniques.

Evaluation tools:

  • Track-wise aggregation and evaluation
  • Confusion matrix visualization

Dataset Format

The code expects the following directory structure:

data/
├── images/              # Precomputed crop images
│   ├── A_e1_c1_1718_0.png
│   ├── A_e1_c1_1749_0.png
│   └── ...
└── *labels.csv          # Label files with columns: image_name, lemur_id

Label CSV format (filename and label are important for training, track_id additionally is needed for video validation, the rest helps for precise image retrieval):

filename,experiment,label,track_id,frame_num
A_e1_c1_1718_0.png,A_e1_c1,5,0,1718
A_e1_c1_1749_0.png,A_e1_c1,5,0,1749
A_e1_c1_1819_0.png,A_e1_c1,1,0,1819
...

Usage

Train an identification model

sh experiments/train_closedset_id.sh

Run inference on a track

sh experiments/evaluate_closedset_id.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors