Underwhelming LLMs
Some people, when confronted with a problem, think “I know, I’ll use an LLM.” Now they have two problems.
Today I was writing a relatively simple PowerShell script to alert before secrets have expired, this was my first attempt:
$config = Get-Content $configFile | ConvertFrom-Json -Depth 100 -AsHashtable
$config.GetEnumerator() | ForEach-Object {
$alert = $_.Name
$ExpiryDate = [datetime]::ParseExact($_.Value.ExpiryDate, "yyyy-MM-dd", $null)
$description = $_.Value.Description
$NoticePeriodInDays = $_.Value.NoticePeriodInDays
$daysToExpire = ($ExpiryDate - (get-date)).Days
$shouldNotify = (get-date).AddDays($NoticePeriodInDays) -gt $ExpiryDate
if ($shouldNotify) {
Invoke-RestMethod -Method post -ContentType 'application/json' -uri $TeamsWebhookUrl `
-Body "{""text"":""$alert is expiring on $($ExpiryDate.ToString("yyyy-MM-dd")). This is $daysToExpire days away. See further details here: $description""}"
}
else {
Write-Host "$alert is expiring on $($ExpiryDate.ToString("yyyy-MM-dd")). This is $daysToExpire days away."
}
}
I then proceeded to ask various LLMs out there (Gemini, Deep Seek, OpenAI and Claude) to refactor and write tests for it.
The refactors went well but not a single one managed to provide working tests the first time around.
I then tried feeding the refactored code, as generated by that LLM, and simply asked it to write tests for the script and it wasn’t great:
- DeepSeek required minor changes (dot sourcing required the script parameters) to get to 3 failures out of 14 tests.
- ChatGTP worked out of the box but all 11 tests failed.
- Gemini required minor changes (dot sourcing required the script parameters) to get to 16 failed tests out 16.
- Claude worked out of the box and got 4 failures out of 11 tests.
I didn’t engage in a deep analysis of the tests but some were a bit questionable, essentially testing basic powershell functionality, but hey if you aim for 100% coverage, then
This is the way
I suppose
I can’t cease to be both amazed and immensely frustated by these LLMs