Fixing Flaky Android UI Tests With Claude

I use Claude Code regularly at work, and one of the things that I like to do is ask Claude to analyze videos to help me debug issues. The trick though is that you have to instruct Claude to convert the video into individual frames first because Claude by itself does not have the ability to analyze videos. On my Mac laptop, ffmpeg is pre-installed so here is a sample prompt:

Using ffmpeg, can you convert this video into frames at 1 frame per second and give me a summary of what is in the video.

Once the video is converted into individual frames or images, Claude can then analyze the individual frames to understand the content of the video. Using this technique, I have used it to fix flaky tests in our codebase.

At work we run thousands of Android UI tests for every PR, and inevitably we have flaky UI tests. When a UI test starts to flake over 5%, there is an automated process to ignore the UI test, and a JIRA ticket is then created. The problem is that fixing flaky UI tests is tedious and time consuming. Everyone agrees that fixing flaky UI tests is important, but it is hard to prioritize the work.

That is where Claude video analysis comes in. For the past couple of weeks, I’ve been experimenting with a Claude skill to fix flaky UI tests.

Here is an overview of the steps in the skill:

Run the test 100 times on the CI server.
Analyze the build logs for failed and successful tests. Download the logs and videos for failed tests. Download the videos for successful tests.
Convert the videos to individual frames for analysis.
Analyze the failed and successful videos to understand the failure and the difference.
Analyze the UI test and the code to understand the test failure.
Based on the video and code analysis, fix the flaky UI test.

The results have been surprising, and Claude has been able to fix UI tests which would have taken me hours to solve. The fix varies for each test. For some tests, Claude used a different test assertion to solve the flake because the old assertion was not as reliable. Sometimes Claude added additional assertions to check the state before asserting on the flaky assertion. Overall, I have been happy with the results and it has been helpful in resolving our flaky UI tests.

I am constantly iterating and improving the skill, and it can now process a list of JIRA tickets, open PRs to fix the flaky tests, and update the JIRA tickets. At the end of the day, I like to get a list of flaky UI test tickets and let the skill fix the UI tests overnight. The next morning, I double check the fixes and open the PRs for review.

Leave a Comment Cancel Reply