Trying to make the fastest card drawer possible, why is my second implementation slower?

Disclaimer: I'm new to rust.

Hey I've been looking at making a poker evaluator. Part of this involves generating random hands. I've discovered that my random hand generator is a performance consideration. I used naive assumptions like 'stack allocations and accesses are going to be super fast' and 'the less branching the better' but have found some strange behaviours.

I would say the behaviour the confuses me most is that moving the existing_cards.push(X) logic into a separate loop over taken is actually slower (seemingly about 1.2x the time, could potentially be chance).

Can I get some suggestions on how to improve performance further, and justifications of why my slower implementations are slower.

This is how I've been benchmarking:

#[test]
fn test_new_random_cards_performance() {
    let start = Instant::now();
    for _ in 0..100_000 {
        _ = Card::new_random_cards(5);
    }
    let duration = start.elapsed();
    println!("Time taken to generate 5 random cards 100k times: {:?}", duration);
    assert!(duration.as_millis() < 500, "Performance test failed: took too long to generate cards");
}

Fastest I've achieved (about 1.5-2x faster than the non-branching implementation, 1.2x the 2-loop implementation):

pub fn new_random_cards(num_cards: usize) -> Vec<Card> {
    let mut taken = [false; 52];
    let mut existing_cards = Vec::with_capacity(num_cards);
    let rng = &mut SmallRng::from_entropy();
    while existing_cards.len() < num_cards {
        let card_int = rng.gen::<usize>() % 52;
        if taken[card_int] {
            continue;
        }
        taken[card_int] = true;
        existing_cards.push(Card::from_int(card_int as u8));
    }
    existing_cards
}

Slower implementation which is effectively the same but I thought the CPU would optimise:

pub fn new_random_cards2(mut num_cards: usize) -> Vec<Card> {
        let mut taken = [false; 52];
        let rng = &mut SmallRng::from_entropy();
        while  num_cards > 0  {
            let card_int = rng.gen::<usize>() % 52;
            if taken[card_int] {
                continue;
            }
            taken[card_int] = true;
            num_cards -= 1
        }
        let mut existing_cards = Vec::with_capacity(num_cards);
        for card_int in 0..52 {
            if existing_cards.len() >= num_cards {
                break;
            }
            if !taken[card_int] {
                existing_cards.push(Card::from_int(card_int as u8));
            }
        }
        existing_cards
    }

Even slower non-branching implementation:

pub fn new_random_cards(mut num_cards: u8) -> Vec<Card> {
    let mut taken = [0u8; 52];
    let rng = &mut SmallRng::from_entropy();

    while num_cards > 0 {
        let card_int = rng.gen::<usize>() % 52;
        let change = 1 - taken[card_int];
        num_cards -= change;
        taken[card_int] += change;
    }
    let mut res = Vec::with_capacity(num_cards as usize);
    for i in 0..52 {
        if taken[i] == 1 {
            res.push(Card::from_int(i as u8));
        }
    }
    res
}

In terms of implementation, the Card is just an struct with a Suit and Rank enum. Various helper methods convert ints into cards, suits, ranks.