Apollo 11's control panel | Fall 2023 @ National Air and Space Museum, Washington D.C.

AI can generate code. Most of it's terrible, and here's how to "vibe code" properly.


(spoiler: AI is a bricklayer. You're still the architect; actual code and engineering knowledge required)
 
As AI adoption and usage grows, so does the doomer-speak in various circles on how the days of developers are numbered. It is easy to make such connections if you see a machine blurt out 80% plausible and good-looking code by giving it simple instructions. This, however, overlooks the intricate dance of logic, creativity, and problem solving skills that lie at the heart of programming. As a developer myself, I would like to showcase herein how to best use AI to help teach yourself development for a platform you have never laid your hands on. I do compilers and low level systems; in this article, I will be making a minimal prototype Android app that uses a LLM-like interface to help you navigate your phone. For the sake of brevity, I have only documented what I did on the first couple days of this project, as I figure and polish the app further for a proper OSS release.

Oh, and I have never touched Android development before, barring one app in 2017 which was actually just a website iframe masquerading as one. Moreover, this current app is still in-progress, but this article lays out my thoughts on how to best use AI and how it has helped me given the domain specific knowledge and context I have.

Before we begin: Some Context and Thoughts

 See, the thing about programming is that no matter what end product you're aiming for, the fundamentals largely remain the same. At its core: programming is breaking a complex task into multiple doable chunks of small ones, and then completing each of them before you bring it all together. This isn't just about spitting lines of code; it's about architectural thinking, foresight, and a deep understanding of the "why behind the what". This requires a game plan, deep knowledge of algorithms to optimize your app, and clarity on every small step to take to get to your desired end product.  

Herein lies a crucial distinction: AI, in its current form, doesn't understand in the human sense. It excels at pattern recognition and generation based on the vast datasets it's trained on, but it lacks genuine comprehension, intuition, and the ability to navigate ambiguity or novel situations that weren't in its training data. It can generate a snippet, but it can't truly innovate or strategize the overarching solution to a problem it has never encountered.

If you pretend that AI is just a "free developer" who will do whatever you want of them, that does not work out well. What you have to do, instead, is to use it as the most excellent search engine that has ever existed, which can filter out the noise that traditional search engines bring with them and get straight to the point. Ask AI "please make me a website with an animated background, search functionality, main page as a blog, and a dark/light theme toggle", and you will get something very rudimentary. Enter some logic in the mix, especially complex or novel business rules, and you will get a plethora of errors, or worse, code that looks plausible but is fundamentally flawed in its approach or riddled with subtle bugs that only surface under specific, unforeseen conditions. AI doesn't possess common sense, nor can it critically evaluate its own output in the context of a larger, evolving system without explicit, detailed human guidance. 

Human developers are needed to translate vague ideas into concrete, actionable plans, to make judgment calls when there are multiple viable paths, to debug issues that require a deep understanding of the system's behavior, and to ensure the final product is not just functional but also ethical, secure, and user-friendly. AI might generate the bricks, but the human developer is the architect, the engineer, and the quality assurance, all rolled into one. Developers who can effectively prompt, guide, and critically evaluate AI-generated code will be even more valuable.

In fact, when I gave AI full freedom in a way that a novice might to build one aspect of the app, a chat interface that uses Gemini as a backend to talk to the user, alongside an hour of feeding it all the errors that it's own code gave and even hinting on how it might solve those errors, this is what I ended up with:

Do you see how elementary some of these errors are?

This is in fact, the later version after I hinted AI to clear most bugs, which it did so inefficiently. Rudimentary errors: like using wrong variable names, not knowing Kotlin convention, making up variables, hallucinating, using bad algorithms (O(n^2) text detection algorithm that could have been a simple regex, and then giving every bad regex solution imaginable after countless hints), and so on and so forth. The first build had over 95 errors for just a chat UI and a send button that initiates a POST on the Gemini API and parses it's message in a different text bubble, all using system components and UI elements (this is barely 100 lines of code). Even the algorithm part: the overhead of an unoptimized search algorithm will not be felt as much in your 100-item test case as it would in billions of real life search queries.  

Which is all the reason as to why developers will always be needed, and why you should learn to code. Anyone can whip out an unoptimized, API-blasted-out-for-free, heats-up-your-phone, crashes-if-over-200-people-use app. You, on the other hand, create art.

The App

I am trying to create a proof of concept app that helps people with accessibility needs use their phones through LLMs. The concept is simple:

  • The user interfaces with their phone through an LLM (called "Automator"), can be voice-based
  • The "Automator" breaks down what the user wants to do in simple custom-defined YAML steps, and passes it to the "Actor"
  • The "Actor" takes the YAML, gets system-level accessibility privileges, and executes those actions one by one.
  • The "Automator" checks if the action is complete, and tries again if it is not yet complete.

(And yes, the concept is incredibly similar to TestDriver, which I checked out and LOVED. However, that product is meant for QA testing on UI-based apps, while this is an accessibility tool for people who do not know computers and have a language barrier. I have reached out to the people behind TestDriver and await their response, we both just came up with the concept independently. Their product is amazing in it's use case, check it out [Footnote 1])

This project, for me, is fundamentally an exploration in applied AI for human-computer interaction, specifically targeting accessibility for my elderly, Hindi-speaking family members. The aim is to create a more intuitive bridge to smartphone functionality by allowing them to use natural voice commands in their native language, which an LLM (the "Automator") then translates into actionable operations.

My approach to building this proof-of-concept accessibility app wasn't to ask AI to write the entire application. Instead, I focused on decomposing the overarching challenge into its constituent computer science problems—finite state machines, domain-specific languages, NLP pipeline stages, inter-process communication, and UI event handling

For each sub-problem, I drilled down to the smallest logical "brick" or function I needed. Then, armed with the specific CS concept I was trying to implement, I effectively prompted AI to generate these individual bricks. This methodology allowed AI to handle much of the boilerplate Kotlin syntax and Android API calls, significantly boosting my productivity and helping me avoid elementary coding errors. 

While complex, system-level logic errors are an inherent part of any development process, AI helped ensure the foundational components were syntactically sound and aligned with common patterns, allowing me to focus on the architectural integration and the core CS challenges.

1. The Core Interaction Loop: Automator & Actor as a Coordinated Pair

From my developer's vantage point, the Automator and Actor components operate as a tightly coupled system, best modeled as a finite state machine (FSM). This isn't just a conceptual FSM; it will have formally defined states (e.g., AWAITING_INPUT, AUTOMATOR_LLM_QUERY, ACTOR_YAML_PARSE, ACTOR_STEP_EXECUTE, AUTOMATOR_VERIFY_RESULT), explicit transition conditions triggered by events or data, and defined outputs for each state. The interaction between them is inherently asynchronous; the Automator might dispatch a YAML payload and then enter a state awaiting completion or update from the Actor, potentially involving polling or a callback mechanism. The integrity of this loop relies on robust state transition logic to prevent issues like deadlocks or race conditions, especially if the Automator needs to handle new user input while the Actor is still processing a previous command. The "contract" or protocol for data exchange between them (the YAML structure) is critical for this distributed system behavior.

How it works for me: I'm essentially designing a distributed control flow. The FSM ensures predictable behavior. My task is to ensure that all state transitions are handled, including error states (e.g., LLM_TIMEOUT, ACTOR_PERMISSION_DENIED, YAML_PARSE_ERROR), and that the system can gracefully recover or guide the user. This orchestration is key to the app's reliability. For instance, when designing the state transitions, I'd identify a specific need, like "create a Kotlin enum for FSM states" or "implement a function to transition from AWAITING_INPUT to AUTOMATOR_LLM_QUERY based on user input event." AI was instrumental in generating the Kotlin syntax for these enums, data classes for state payloads, or even basic function skeletons for state handlers, allowing me to rapidly prototype the FSM structure based on automata theory principles I already understood.

Example prompts I gave to AI:

  • "Generate a Kotlin sealed class named AppState representing the possible states of my application FSM. Include the following states as objects or data classes: Idle (initial state), AwaitingUserInput, ProcessingAutomatorQuery(val query: String) which holds the user's query, ExecutingActorPlan(val planId: String) which holds an identifier for the current YAML plan, AwaitingActorConfirmation(val planId: String), and ErrorOccurred(val errorMessage: String, val previousState: AppState?). Ensure ErrorOccurred can optionally hold the state before the error."
  • "Write a Kotlin function determineNextState(currentState: AppState, event: AppEvent): AppState. AppEvent is a sealed class with subtypes like UserInputReceived(val text: String), AutomatorProcessingComplete(val success: Boolean, val planId: String?), ActorExecutionComplete(val success: Boolean, val planId: String). Implement basic logic: if currentState is AwaitingUserInput and event is UserInputReceived with non-empty text, return ProcessingAutomatorQuery. If currentState is ProcessingAutomatorQuery and event is AutomatorProcessingComplete with success, return ExecutingActorPlan. Handle other sensible transitions and a default case."
  • "Show me a robust Kotlin implementation of a generic state machine class using a Map to define transitions (e.g., Map<Pair<S, E>, S> where S is state type and E is event type). The class should have methods to register states, register transitions, process an event, and get the current state. Include error handling for invalid transitions."

2. The "Automator": My Intelligent Front-End

The Automator serves as the primary interface, managing the dialogue with the user and orchestrating the LLM interaction. This involves more than just a chat UI; it's an NLP pipeline. User input (voice or text) undergoes initial processing, potentially including speech-to-text conversion which itself involves acoustic modeling and language models. This natural language input is then passed to my custom-hosted LLM. The LLM's task is a sophisticated form of semantic parsing: it must perform intent recognition (what does the user want to do?) and entity extraction (what are the key parameters, like "daughter" in "Call my daughter"?), and then map this understanding to a sequence of operations defined in my custom YAML format. This YAML generation is akin to template-based code generation, where the LLM fills slots in predefined YAML structures based on the parsed intent and entities. The API calls to this custom LLM will need careful design (likely RESTful, with structured JSON request/response bodies) to handle the conversational context and manage the LLM's state, if any.

How it works for me: My core challenge here is the "intelligence" part – training the LLM. This involves not just providing examples (few-shot or fine-tuning on a base model) but also designing the output constraints for the LLM to adhere strictly to my YAML schema. I need to ensure the LLM handles linguistic ambiguity, synonyms, and the nuances of Hindi. The Automator must also manage conversation history to resolve anaphora ("call her back") and maintain context for multi-turn interactions. In building the Automator's scaffolding, I prompted AI for things like "Kotlin function to make an HTTP POST request with a JSON body and handle the response asynchronously" for LLM communication, or "Android code to capture microphone input and send it to a speech-to-text API." These well-defined, smaller problems were perfect for AI to generate functional Kotlin code, which I then integrated into the larger NLP pipeline logic I was architecting.

Example prompts I gave to AI:

  • "Provide a complete Kotlin function using Ktor client for Android to make an asynchronous POST request to a specified URL. The function should take the URL, a Map<String, String> for headers, and a serializable Kotlin data class instance as the JSON body. It should handle potential exceptions (e.g., network errors, timeouts) and return a Result<ResponseType> where ResponseType is another serializable data class. Include setting a connection timeout of 15 seconds and a request timeout of 30 seconds. Show how to use Kotlinx Serialization for the request and response bodies."
  • "Android Kotlin example: Implement a class that uses android.speech.SpeechRecognizer. It should have methods to startListening(callback: (String) -> Unit) and stopListening(). The callback should be invoked with the recognized text. Ensure necessary permissions (RECORD_AUDIO, INTERNET) are mentioned and provide basic error handling for recognizer errors (e.g., onError listener)."
  • "In Jetpack Compose, create a reusable Composable function for a chat message input bar. It should include a TextField for typing the message, and an IconButton with a 'Send' icon. The 'Send' button should be enabled only when the TextField is not empty. The function should take a lambda (String) -> Unit as a parameter, which is invoked when the send button is clicked, passing the current text. The TextField should clear after sending."

3. The "Actor": My Behind-the-Scenes Worker

The Actor component is the system's effector arm. Upon receiving a YAML payload from the Automator, its first task is YAML parsing. This requires a robust parser capable of validating the input against my defined YAML schema (lexical analysis to tokenize the input, followed by syntax analysis to build an internal representation, like an Abstract Syntax Tree or a simpler object model). It then iterates through the parsed actions, executing them sequentially. This execution involves translating each abstract YAML action into concrete calls to the Android Accessibility Service APIs. This is an intricate process, as accessibility services provide powerful but low-level hooks into the OS UI tree, requiring traversal, element identification (often based on text, content-description, or resource IDs), and programmatic event dispatching (clicks, swipes, text input). Requesting and verifying accessibility privileges involves interacting with the Android permission model, an OS-level security feature.

How it works for me: I'm essentially building an interpreter for my YAML-based DSL. Each YAML action is an opcode that the Actor decodes and executes. Error handling is paramount here: an action might fail because an expected UI element isn't found, or the screen has changed unexpectedly. The Actor needs to detect such failures, potentially log diagnostic information, and report the status (success, failure, specific error code) back to the Automator for higher-level decision-making (e.g., retry, abort, ask user for help). Efficiently querying the UI tree via accessibility APIs without causing performance lag is also a consideration. For the Actor, I leveraged AI for tasks like "Kotlin code to parse a simple YAML string into a list of action objects" (perhaps using a lightweight library suggested by AI if I didn't want to write a full parser), or more specifically, "Android Accessibility Service code to find a UI element by its text content and perform a click action." These prompts, based on my understanding of parsing algorithms and OS interaction, yielded specific, usable Kotlin code snippets for Android, forming the building blocks of my action interpreter.

Example prompts I gave to AI:

  • "Write a Kotlin function that takes a YAML string as input. The YAML represents a list of actions, where each action is a map with an 'action_type' key and a 'parameters' key (which is another map of key-value string pairs). Parse this YAML string into a List<ActionStep> where ActionStep is a data class with actionType: String and parameters: Map<String, String>. Implement this using only standard Kotlin string manipulation and collection functions, without relying on external YAML parsing libraries. Handle basic malformed entries gracefully by skipping them and perhaps logging a warning."
  • "Within an Android AccessibilityService class, provide a detailed Kotlin function performTapOnText(targetText: String): Boolean. This function should get the root node in the active window, recursively search for an AccessibilityNodeInfo whose text exactly matches targetText and is clickable. If found, it should perform AccessibilityNodeInfo.ACTION_CLICK and return true. If not found or not clickable, return false. Include necessary null checks and error logging."
  • "Demonstrate how to perform a programmatic swipe gesture in an Android AccessibilityService using dispatchGesture. The function should take start coordinates (x1, y1), end coordinates (x2, y2), and a duration in milliseconds. Show the creation of Path and GestureDescription. Ensure the gesture is dispatched correctly and handle potential exceptions."
  • "Provide a Kotlin utility function for an Android app that checks if a specific Accessibility Service (identified by its class name string, e.g., 'com.example.MyAccessibilityService') is currently enabled. If not enabled, it should create and return an Intent that navigates the user directly to the Accessibility settings screen where they can enable it. Handle different Android API levels if the intent action has changed."

4. The YAML Definition File: My Custom "Language"

This file is more than a simple dictionary; it formally defines a Domain-Specific Language (DSL) tailored for phone automation via accessibility services. It specifies the grammar of this language – the valid action types, their required and optional parameters, and the data types for those parameters. For example, action.pull_down_notification_bar is a zero-argument command, while action.type_text would require parameters like text_content: string and perhaps an optional target_element_query: string. The design of this DSL involves balancing expressiveness (can it represent all the actions I need?) with simplicity (is it easy for the LLM to generate and for the Actor to parse reliably and unambiguously?).

How it works for me: This DSL design is a foundational architectural decision. It dictates the capabilities of the entire system. Changes here ripple through LLM training (to teach it new "vocabulary" or "syntax") and the Actor's implementation (to add new action handlers). I need to think about versioning this DSL if I plan to expand functionality significantly over time, ensuring backward compatibility or clear migration paths. The clarity and precision of this DSL are crucial for minimizing misinterpretations by either the LLM or the Actor. While AI didn't design the DSL's semantics (that was my architectural task based on app requirements), once I defined the structure, I could ask AI to, for instance, "generate Kotlin data classes to represent these YAML action types: open_app(appName: String), type_text(text: String, elementId: String)." This helped create the object model for my DSL quickly and accurately within the Kotlin environment, forming the data structures the Actor would work with.

Example prompts I gave to AI:

  • "Generate a Kotlin sealed class hierarchy to represent various UI automation actions for my DSL. The base sealed class should be UiAction. Include the following concrete action classes: OpenApplication(val applicationName: String), TypeTextAction(val textToType: String, val targetElementDescription: String?) where targetElementDescription is optional, TapElementAction(val targetElementDescription: String, val tapType: TapType = TapType.SINGLE) where TapType is an enum (SINGLE, LONG_PRESS), and SwipeGesture(val startXPercent: Float, val startYPercent: Float, val endXPercent: Float, val endYPercent: Float, val durationMs: Int) using screen percentage for coordinates. Ensure all properties are vals and classes are data classes where appropriate."
  • "Given a Kotlin data class RawActionStep(val actionType: String, val parameters: Map<String, Any>), write a factory function createUiAction(rawStep: RawActionStep): UiAction? that attempts to convert rawStep into one of the concrete UiAction types defined previously. Perform type checking and casting for parameters (e.g., ensure 'durationMs' is an Int for SwipeGesture). Return null if conversion fails or parameters are invalid/missing. For example, if actionType is 'open_app', it expects 'applicationName' in parameters."
  • "How would I structure a YAML file that represents a sequence of these UiAction types? Provide an example YAML snippet for opening 'Settings', then tapping an element described as 'Network & internet', then typing 'WiFi password' into an element described as 'Search settings'."

5. The Main View: My Control Center and User Interface

The main UI layer doesn't just present information; it's an event-driven system reacting to state changes within the underlying FSM. When the Automator transitions to LISTENING, the UI might update an icon or text prompt. When the Actor begins executing a complex sequence, the UI could display a progress indicator or the current step. This requires careful binding of UI elements to the FSM's state variables. The logging mechanisms I'll build for debugging (e.g., displaying the generated YAML, Actor execution traces) involve structured data presentation and potentially temporary data persistence. The manual override capability in the menu bar is a critical aspect of testability and fault isolation, allowing me to inject test YAML directly into the Actor or observe the Automator's raw LLM output. Architecturally, even for this PoC, I'll likely use a pattern like MVVM (Model-View-ViewModel) or a simplified version to keep UI logic separate from the business logic of the Automator/Actor FSM, promoting modularity and maintainability.

How it works for me: This view is my window into the app's complex internal operations. I need to ensure the UI provides clear feedback not just to the end-user, but also to myself during development for diagnostics. Implementing the UI updates in response to asynchronous events from the Automator/Actor FSM without creating UI freezes or race conditions requires careful use of threading or asynchronous programming constructs appropriate for Android (e.g., Coroutines, LiveData). For the UI, my CS understanding of event handling and UI patterns guided my requests to AI. For example: "Android Jetpack Compose code for a simple chat bubble UI," "Kotlin function to update a Text composable based on a LiveData stream," or "XML layout for an Android options menu with three items." AI excelled at generating these UI "bricks" according to standard Android practices, which I then wired into the FSM's state management logic to reflect the application's current operations.

Example prompts I gave to AI:

  • "Create a Jetpack Compose screen for a chat interface. It should have a TopAppBar with the title 'Accessibility Automator'. Below that, a LazyColumn should display a list of Message data class instances (Message(text: String, isUserMessage: Boolean)), styled differently for user vs. AI messages (e.g., alignment, background color). At the bottom, include the message input bar Composable (from a previous prompt). The LazyColumn should automatically scroll to the bottom when a new message is added. Show how to manage the list of messages in a ViewModel with MutableStateList."
  • "How to implement an Android Activity options menu (using onCreateOptionsMenu and onOptionsItemSelected) with three items: 'Manual YAML Input' (opens a dialog to paste YAML), 'View Actor Execution Log' (navigates to a new screen/fragment showing a list of log strings), and 'App Settings' (navigates to a settings screen). Provide the XML for the menu resource and the Kotlin code for handling selections."
  • "Demonstrate using Kotlin Coroutines and StateFlow in an Android ViewModel to expose the current AppState (from a previous FSM prompt). Then, show how a Jetpack Compose Composable function can collect this StateFlow and conditionally display different UI elements (e.g., a 'Listening...' Text, a 'Processing...' spinner, or an error message Text) based on the current AppState value."
  • "Provide a Jetpack Compose Composable function StatusIndicator(appState: AppState) that displays different icons and/or text messages based on the type of AppState. For example, a microphone icon for AwaitingUserInput, a spinning progress indicator for ProcessingAutomatorQuery or ExecutingActorPlan, and a red error icon with the message for ErrorOccurred."

And guess what? We have a largely working Android client for this whole thing, which I am still figuring out some edge cases for. A custom defined YAML syntax, an Automator that talks to a LLM instance through POST requests, an Actor that takes my YAML syntax + gets accessibility permissions + executes those actions, and a FSM architecture tying all of them together. The core of this project, before it gets released, now solely lies on how well I train the LLM and in introducing guardrails so that unintended actions are not executed at all.

Know your stuff, and use AI as the best search engine + brick forge in existence. It excels in syntax, you need to excel in the semantics.

 

Coming Soon

----

Cover image: Olympia Drive, Amherst, MA | 11PM on a random fall evening in 2023 

Footnotes:

1: For full transparency, here's the message I sent a week ago when I stumbled across TestDriver in a Reddit ad and tried it out:



Popular Posts