Welcome, Developers! đź‘‹
CSS flexibility, bloated web pages, and the brutal reality of shipping production apps are all on the table this week. We're also looking at why AI-generated pull requests fail human review and what a database design interview reveals about real engineering thinking. | | |
| | |
| How to Test AI Agents That Never Produce the Same Output Twice Same input. Same prompt. Different output. That's the reality of testing AI agents that write code, and most teams are shipping without solving it.
Nick Nisi from WorkOS tackled this by building eval systems for two AI tools: npx workos, a CLI agent that installs AuthKit into your project, and WorkOS's agent skills that power LLM responses about SSO, directory sync, and RBAC.
The post covers how to test against real project structures, score output that's different every time, and catch when your agent makes up methods that don't exist. |
| Learn more about evals | | |
|
đź”– The Reading Room
Articles we have hand-picked for you: | | |
There Is No "Wrong" in CSS
CSS is more flexible than many developers realize, and labeling code as "wrong" often misses the bigger picture. If your CSS works without causing real UX or DX problems, there is no meaningful reason to change it. The web platform prioritizes backwards-compatibility, keeping even older code safe and functional.
By Jens Oliver Meiert → | |
The 49MB Web Page
Modern news websites have become hostile to readers. The New York Times homepage loads 49MB of data across 422 network requests, equivalent to the entire Windows 95 operating system or 10-12 MP3 songs. Before users can read a headline, their browsers process dozens of ad auctions, tracking scripts, and intrusive modals that prioritize short-term ad revenue over user experience.
By Shubham → | |
The 100 Hour Gap Between a Vibecoded Prototype and a Working Product
​Building a working prototype took one hour, but deploying a production-ready app required 100X more time. The author navigated AWS infrastructure setup, smart contract security with Safe wallets, rate limiting for concurrent users, and Farcaster Mini App integration. Despite using advanced LLMs like Claude 4.6 and Codex 5.3, critical issues like nonce handling were missed, causing launch day crashes. Mac's experience proves that infrastructure complexity and edge cases remain the hidden time sinks of modern app development.
By Mac Budkowski → | |
Many SWE-Bench-Passing PRs Would Not Be Merged Into Main
​METR research reveals that roughly half of AI-generated pull requests passing SWE-bench Verified tests would be rejected by actual repository maintainers. Maintainer merge rates averaged 24 percentage points lower than automated grader scores across five AI models tested on 296 PRs. While this doesn't indicate a fundamental capability limitation, it suggests that benchmark scores may mislead estimates of real-world agent usefulness without additional iteration or human feedback.
By Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush → | |
“Design Me a Highly Resilient Database”
An seasoned engineer failed a job interview not for lacking knowledge, but for having too much of it. Asked to design a "highly resilient database" without any context, they asked critical questions about data types, query patterns, and failure modes before proposing a solution. The interviewer wanted a simple answer (Cassandra), but the candidate understood that database selection is a product decision that depends entirely on the system it serves, not a one-size-fits-all choice.
By Nik Ogura → | | |
| |
| |
đź”— The Link Lounge Unordered finds from around the web:
Find something cool? You can send us links to feature here via email. | | | | | |
đź§° The Toolbox
Tools and products we're excited about today: | | | |
Vite 8
Vite 8 ships Rolldown, a Rust-based bundler replacing both esbuild and Rollup, delivering 10-30x faster builds. It also adds integrated Devtools, native tsconfig paths support, and browser console forwarding. Node.js 20.19+ is required. Learn more → | |
Neko
Neko is a self-hosted virtual browser running in Docker, streamed via WebRTC to multiple simultaneous users. It enables collaborative browsing, watch parties, and interactive presentations with real-time screen control sharing. Learn more → | |
Aegis
Aegis is an open-source, local-first EDR (Endpoint Detection & Response) for AI agents. It monitors processes, file access, network activity, and behavior of 107 known AI agents in real time, with no telemetry or cloud dependency. Learn more → | |
Can I Run AI locally?
This site estimates what local models your device can run according to your hardware. Learn more → |
| | |
How to Test AI Agents That Never Produce the Same Output Twice
Same input. Same prompt. Different output. That's the reality of testing AI agents that write code, and most teams are shipping without solving it.
Nick Nisi from WorkOS tackled this by building eval systems for two AI tools: npx workos, a CLI agent that installs AuthKit into your project, and WorkOS's agent skills that power LLM responses about SSO, directory sync, and RBAC.
The post covers how to test against real project structures, score output that's different every time, and catch when your agent makes up methods that don't exist.
Learn more about evals → |
| | |
🎤 Your Voice Your feedback shapes what comes next! We read every email, so simply hit reply and tell us what's on your mind.
| | | | | |
|