AI can generate code. Most of it's terrible, and here's how to "vibe code" properly.
Oh, and I have never touched Android development before, barring one app in 2017 which was actually just a website iframe masquerading as one. Moreover, this current app is still in-progress, but this article lays out my thoughts on how to best use AI and how it has helped me given the domain specific knowledge and context I have.
Before we begin: Some Context and Thoughts
See, the thing about programming is that no matter what end product you're aiming for, the fundamentals largely remain the same. At its core: programming is breaking a complex task into multiple doable chunks of small ones, and then completing each of them before you bring it all together. This isn't just about spitting lines of code; it's about architectural thinking, foresight, and a deep understanding of the "why behind the what". This requires a game plan, deep knowledge of algorithms to optimize your app, and clarity on every small step to take to get to your desired end product.
Herein lies a crucial distinction: AI, in its current form, doesn't understand in the human sense. It excels at pattern recognition and generation based on the vast datasets it's trained on, but it lacks genuine comprehension, intuition, and the ability to navigate ambiguity or novel situations that weren't in its training data. It can generate a snippet, but it can't truly innovate or strategize the overarching solution to a problem it has never encountered.
Human developers are needed to translate vague ideas into concrete, actionable plans, to make judgment calls when there are multiple viable paths, to debug issues that require a deep understanding of the system's behavior, and to ensure the final product is not just functional but also ethical, secure, and user-friendly. AI might generate the bricks, but the human developer is the architect, the engineer, and the quality assurance, all rolled into one. Developers who can effectively prompt, guide, and critically evaluate AI-generated code will be even more valuable.
In fact, when I gave AI full freedom in a way that a novice might to build one aspect of the app, a chat interface that uses Gemini as a backend to talk to the user, alongside an hour of feeding it all the errors that it's own code gave and even hinting on how it might solve those errors, this is what I ended up with:
Do you see how elementary some of these errors are?
This is in fact, the later version after I hinted AI to clear most bugs, which it did so inefficiently. Rudimentary errors: like using wrong variable names, not knowing Kotlin convention, making up variables, hallucinating, using bad algorithms (O(n^2) text detection algorithm that could have been a simple regex, and then giving every bad regex solution imaginable after countless hints), and so on and so forth. The first build had over 95 errors for just a chat UI and a send button that initiates a POST on the Gemini API and parses it's message in a different text bubble, all using system components and UI elements (this is barely 100 lines of code). Even the algorithm part: the overhead of an unoptimized search algorithm will not be felt as much in your 100-item test case as it would in billions of real life search queries.
Which is all the reason as to why developers will always be needed, and why you should learn to code. Anyone can whip out an unoptimized, API-blasted-out-for-free, heats-up-your-phone, crashes-if-over-200-people-use app. You, on the other hand, create art.
The App
I am trying to create a proof of concept app that helps people with accessibility needs use their phones through LLMs. The concept is simple:
- The user interfaces with their phone through an LLM (called "Automator"), can be voice-based
- The "Automator" breaks down what the user wants to do in simple custom-defined YAML steps, and passes it to the "Actor"
- The "Actor" takes the YAML, gets system-level accessibility privileges, and executes those actions one by one.
- The "Automator" checks if the action is complete, and tries again if it is not yet complete.
(And yes, the concept is incredibly similar to TestDriver, which I checked out and LOVED. However, that product is meant for QA testing on UI-based apps, while this is an accessibility tool for people who do not know computers and have a language barrier. I have reached out to the people behind TestDriver and await their response, we both just came up with the concept independently. Their product is amazing in it's use case, check it out [Footnote 1])
This project, for me, is fundamentally an exploration in applied AI for human-computer interaction, specifically targeting accessibility for my elderly, Hindi-speaking family members. The aim is to create a more intuitive bridge to smartphone functionality by allowing them to use natural voice commands in their native language, which an LLM (the "Automator") then translates into actionable operations.
My approach to building this proof-of-concept accessibility app wasn't to ask AI to write the entire application. Instead, I focused on decomposing the overarching challenge into its constituent computer science problems—finite state machines, domain-specific languages, NLP pipeline stages, inter-process communication, and UI event handling.
For each sub-problem, I drilled down to the smallest logical "brick" or function I needed. Then, armed with the specific CS concept I was trying to implement, I effectively prompted AI to generate these individual bricks. This methodology allowed AI to handle much of the boilerplate Kotlin syntax and Android API calls, significantly boosting my productivity and helping me avoid elementary coding errors.
While complex, system-level logic errors are an inherent part of any development process, AI helped ensure the foundational components were syntactically sound and aligned with common patterns, allowing me to focus on the architectural integration and the core CS challenges.
1. The Core Interaction Loop: Automator & Actor as a Coordinated Pair
From my developer's vantage point, the Automator and Actor components operate as a tightly coupled system, best modeled as a finite state machine (FSM). This isn't just a conceptual FSM; it will have formally defined states (e.g., AWAITING_INPUT
, AUTOMATOR_LLM_QUERY
, ACTOR_YAML_PARSE
, ACTOR_STEP_EXECUTE
, AUTOMATOR_VERIFY_RESULT
), explicit transition conditions triggered by events or data, and defined outputs for each state. The interaction between them is inherently asynchronous; the Automator might dispatch a YAML payload and then enter a state awaiting completion or update from the Actor, potentially involving polling or a callback mechanism. The integrity of this loop relies on robust state transition logic to prevent issues like deadlocks or race conditions, especially if the Automator needs to handle new user input while the Actor is still processing a previous command. The "contract" or protocol for data exchange between them (the YAML structure) is critical for this distributed system behavior.
How it works for me: I'm essentially designing a distributed control flow. The FSM ensures predictable behavior. My task is to ensure that all state transitions are handled, including error states (e.g., LLM_TIMEOUT
, ACTOR_PERMISSION_DENIED
, YAML_PARSE_ERROR
), and that the system can gracefully recover or guide the user. This orchestration is key to the app's reliability. For instance, when designing the state transitions, I'd identify a specific need, like "create a Kotlin enum for FSM states" or "implement a function to transition from AWAITING_INPUT
to AUTOMATOR_LLM_QUERY
based on user input event." AI was instrumental in generating the Kotlin syntax for these enums, data classes for state payloads, or even basic function skeletons for state handlers, allowing me to rapidly prototype the FSM structure based on automata theory principles I already understood.
Example prompts I gave to AI:
- "Generate a Kotlin sealed class named
AppState
representing the possible states of my application FSM. Include the following states as objects or data classes:Idle
(initial state),AwaitingUserInput
,ProcessingAutomatorQuery(val query: String)
which holds the user's query,ExecutingActorPlan(val planId: String)
which holds an identifier for the current YAML plan,AwaitingActorConfirmation(val planId: String)
, andErrorOccurred(val errorMessage: String, val previousState: AppState?)
. EnsureErrorOccurred
can optionally hold the state before the error." - "Write a Kotlin function
determineNextState(currentState: AppState, event: AppEvent): AppState
.AppEvent
is a sealed class with subtypes likeUserInputReceived(val text: String)
,AutomatorProcessingComplete(val success: Boolean, val planId: String?)
,ActorExecutionComplete(val success: Boolean, val planId: String)
. Implement basic logic: ifcurrentState
isAwaitingUserInput
and event isUserInputReceived
with non-empty text, returnProcessingAutomatorQuery
. IfcurrentState
isProcessingAutomatorQuery
and event isAutomatorProcessingComplete
with success, returnExecutingActorPlan
. Handle other sensible transitions and a default case." - "Show me a robust Kotlin implementation of a generic state machine class using a
Map
to define transitions (e.g.,Map<Pair<S, E>, S>
where S is state type and E is event type). The class should have methods to register states, register transitions, process an event, and get the current state. Include error handling for invalid transitions."
2. The "Automator": My Intelligent Front-End
The Automator serves as the primary interface, managing the dialogue with the user and orchestrating the LLM interaction. This involves more than just a chat UI; it's an NLP pipeline. User input (voice or text) undergoes initial processing, potentially including speech-to-text conversion which itself involves acoustic modeling and language models. This natural language input is then passed to my custom-hosted LLM. The LLM's task is a sophisticated form of semantic parsing: it must perform intent recognition (what does the user want to do?) and entity extraction (what are the key parameters, like "daughter" in "Call my daughter"?), and then map this understanding to a sequence of operations defined in my custom YAML format. This YAML generation is akin to template-based code generation, where the LLM fills slots in predefined YAML structures based on the parsed intent and entities. The API calls to this custom LLM will need careful design (likely RESTful, with structured JSON request/response bodies) to handle the conversational context and manage the LLM's state, if any.
How it works for me: My core challenge here is the "intelligence" part – training the LLM. This involves not just providing examples (few-shot or fine-tuning on a base model) but also designing the output constraints for the LLM to adhere strictly to my YAML schema. I need to ensure the LLM handles linguistic ambiguity, synonyms, and the nuances of Hindi. The Automator must also manage conversation history to resolve anaphora ("call her back") and maintain context for multi-turn interactions. In building the Automator's scaffolding, I prompted AI for things like "Kotlin function to make an HTTP POST request with a JSON body and handle the response asynchronously" for LLM communication, or "Android code to capture microphone input and send it to a speech-to-text API." These well-defined, smaller problems were perfect for AI to generate functional Kotlin code, which I then integrated into the larger NLP pipeline logic I was architecting.
Example prompts I gave to AI:
- "Provide a complete Kotlin function using Ktor client for Android to make an asynchronous POST request to a specified URL. The function should take the URL, a
Map<String, String>
for headers, and a serializable Kotlin data class instance as the JSON body. It should handle potential exceptions (e.g., network errors, timeouts) and return a Result<ResponseType> whereResponseType
is another serializable data class. Include setting a connection timeout of 15 seconds and a request timeout of 30 seconds. Show how to use Kotlinx Serialization for the request and response bodies." - "Android Kotlin example: Implement a class that uses
android.speech.SpeechRecognizer
. It should have methods tostartListening(callback: (String) -> Unit)
andstopListening()
. The callback should be invoked with the recognized text. Ensure necessary permissions (RECORD_AUDIO, INTERNET) are mentioned and provide basic error handling for recognizer errors (e.g.,onError
listener)." - "In Jetpack Compose, create a reusable Composable function for a chat message input bar. It should include a
TextField
for typing the message, and anIconButton
with a 'Send' icon. The 'Send' button should be enabled only when theTextField
is not empty. The function should take a lambda(String) -> Unit
as a parameter, which is invoked when the send button is clicked, passing the current text. The TextField should clear after sending."
3. The "Actor": My Behind-the-Scenes Worker
The Actor component is the system's effector arm. Upon receiving a YAML payload from the Automator, its first task is YAML parsing. This requires a robust parser capable of validating the input against my defined YAML schema (lexical analysis to tokenize the input, followed by syntax analysis to build an internal representation, like an Abstract Syntax Tree or a simpler object model). It then iterates through the parsed actions, executing them sequentially. This execution involves translating each abstract YAML action into concrete calls to the Android Accessibility Service APIs. This is an intricate process, as accessibility services provide powerful but low-level hooks into the OS UI tree, requiring traversal, element identification (often based on text, content-description, or resource IDs), and programmatic event dispatching (clicks, swipes, text input). Requesting and verifying accessibility privileges involves interacting with the Android permission model, an OS-level security feature.
How it works for me: I'm essentially building an interpreter for my YAML-based DSL. Each YAML action
is an opcode that the Actor decodes and executes. Error handling is paramount here: an action might fail because an expected UI element isn't found, or the screen has changed unexpectedly. The Actor needs to detect such failures, potentially log diagnostic information, and report the status (success, failure, specific error code) back to the Automator for higher-level decision-making (e.g., retry, abort, ask user for help). Efficiently querying the UI tree via accessibility APIs without causing performance lag is also a consideration. For the Actor, I leveraged AI for tasks like "Kotlin code to parse a simple YAML string into a list of action objects" (perhaps using a lightweight library suggested by AI if I didn't want to write a full parser), or more specifically, "Android Accessibility Service code to find a UI element by its text content and perform a click action." These prompts, based on my understanding of parsing algorithms and OS interaction, yielded specific, usable Kotlin code snippets for Android, forming the building blocks of my action interpreter.
Example prompts I gave to AI:
- "Write a Kotlin function that takes a YAML string as input. The YAML represents a list of actions, where each action is a map with an 'action_type' key and a 'parameters' key (which is another map of key-value string pairs). Parse this YAML string into a
List<ActionStep>
whereActionStep
is a data class withactionType: String
andparameters: Map<String, String>
. Implement this using only standard Kotlin string manipulation and collection functions, without relying on external YAML parsing libraries. Handle basic malformed entries gracefully by skipping them and perhaps logging a warning." - "Within an Android
AccessibilityService
class, provide a detailed Kotlin functionperformTapOnText(targetText: String): Boolean
. This function should get the root node in the active window, recursively search for anAccessibilityNodeInfo
whose text exactly matchestargetText
and is clickable. If found, it should performAccessibilityNodeInfo.ACTION_CLICK
and return true. If not found or not clickable, return false. Include necessary null checks and error logging." - "Demonstrate how to perform a programmatic swipe gesture in an Android AccessibilityService using
dispatchGesture
. The function should take start coordinates (x1, y1), end coordinates (x2, y2), and a duration in milliseconds. Show the creation ofPath
andGestureDescription
. Ensure the gesture is dispatched correctly and handle potential exceptions." - "Provide a Kotlin utility function for an Android app that checks if a specific Accessibility Service (identified by its class name string, e.g., 'com.example.MyAccessibilityService') is currently enabled. If not enabled, it should create and return an Intent that navigates the user directly to the Accessibility settings screen where they can enable it. Handle different Android API levels if the intent action has changed."
4. The YAML Definition File: My Custom "Language"
This file is more than a simple dictionary; it formally defines a Domain-Specific Language (DSL) tailored for phone automation via accessibility services. It specifies the grammar of this language – the valid action types, their required and optional parameters, and the data types for those parameters. For example, action.pull_down_notification_bar
is a zero-argument command, while action.type_text
would require parameters like text_content: string
and perhaps an optional target_element_query: string
. The design of this DSL involves balancing expressiveness (can it represent all the actions I need?) with simplicity (is it easy for the LLM to generate and for the Actor to parse reliably and unambiguously?).
How it works for me: This DSL design is a foundational architectural decision. It dictates the capabilities of the entire system. Changes here ripple through LLM training (to teach it new "vocabulary" or "syntax") and the Actor's implementation (to add new action handlers). I need to think about versioning this DSL if I plan to expand functionality significantly over time, ensuring backward compatibility or clear migration paths. The clarity and precision of this DSL are crucial for minimizing misinterpretations by either the LLM or the Actor. While AI didn't design the DSL's semantics (that was my architectural task based on app requirements), once I defined the structure, I could ask AI to, for instance, "generate Kotlin data classes to represent these YAML action types: open_app(appName: String), type_text(text: String, elementId: String)." This helped create the object model for my DSL quickly and accurately within the Kotlin environment, forming the data structures the Actor would work with.
Example prompts I gave to AI:
- "Generate a Kotlin sealed class hierarchy to represent various UI automation actions for my DSL. The base sealed class should be
UiAction
. Include the following concrete action classes:OpenApplication(val applicationName: String)
,TypeTextAction(val textToType: String, val targetElementDescription: String?)
where targetElementDescription is optional,TapElementAction(val targetElementDescription: String, val tapType: TapType = TapType.SINGLE)
where TapType is an enum (SINGLE, LONG_PRESS), andSwipeGesture(val startXPercent: Float, val startYPercent: Float, val endXPercent: Float, val endYPercent: Float, val durationMs: Int)
using screen percentage for coordinates. Ensure all properties are vals and classes are data classes where appropriate." - "Given a Kotlin data class
RawActionStep(val actionType: String, val parameters: Map<String, Any>)
, write a factory functioncreateUiAction(rawStep: RawActionStep): UiAction?
that attempts to convertrawStep
into one of the concreteUiAction
types defined previously. Perform type checking and casting for parameters (e.g., ensure 'durationMs' is an Int for SwipeGesture). Return null if conversion fails or parameters are invalid/missing. For example, ifactionType
is 'open_app', it expects 'applicationName' in parameters." - "How would I structure a YAML file that represents a sequence of these
UiAction
types? Provide an example YAML snippet for opening 'Settings', then tapping an element described as 'Network & internet', then typing 'WiFi password' into an element described as 'Search settings'."
5. The Main View: My Control Center and User Interface
The main UI layer doesn't just present information; it's an event-driven system reacting to state changes within the underlying FSM. When the Automator transitions to LISTENING
, the UI might update an icon or text prompt. When the Actor begins executing a complex sequence, the UI could display a progress indicator or the current step. This requires careful binding of UI elements to the FSM's state variables. The logging mechanisms I'll build for debugging (e.g., displaying the generated YAML, Actor execution traces) involve structured data presentation and potentially temporary data persistence. The manual override capability in the menu bar is a critical aspect of testability and fault isolation, allowing me to inject test YAML directly into the Actor or observe the Automator's raw LLM output. Architecturally, even for this PoC, I'll likely use a pattern like MVVM (Model-View-ViewModel) or a simplified version to keep UI logic separate from the business logic of the Automator/Actor FSM, promoting modularity and maintainability.
How it works for me: This view is my window into the app's complex internal operations. I need to ensure the UI provides clear feedback not just to the end-user, but also to myself during development for diagnostics. Implementing the UI updates in response to asynchronous events from the Automator/Actor FSM without creating UI freezes or race conditions requires careful use of threading or asynchronous programming constructs appropriate for Android (e.g., Coroutines, LiveData). For the UI, my CS understanding of event handling and UI patterns guided my requests to AI. For example: "Android Jetpack Compose code for a simple chat bubble UI," "Kotlin function to update a Text composable based on a LiveData stream," or "XML layout for an Android options menu with three items." AI excelled at generating these UI "bricks" according to standard Android practices, which I then wired into the FSM's state management logic to reflect the application's current operations.
Example prompts I gave to AI:
- "Create a Jetpack Compose screen for a chat interface. It should have a
TopAppBar
with the title 'Accessibility Automator'. Below that, aLazyColumn
should display a list ofMessage
data class instances (Message(text: String, isUserMessage: Boolean)
), styled differently for user vs. AI messages (e.g., alignment, background color). At the bottom, include the message input bar Composable (from a previous prompt). TheLazyColumn
should automatically scroll to the bottom when a new message is added. Show how to manage the list of messages in aViewModel
withMutableStateList
." - "How to implement an Android
Activity
options menu (usingonCreateOptionsMenu
andonOptionsItemSelected
) with three items: 'Manual YAML Input' (opens a dialog to paste YAML), 'View Actor Execution Log' (navigates to a new screen/fragment showing a list of log strings), and 'App Settings' (navigates to a settings screen). Provide the XML for the menu resource and the Kotlin code for handling selections." - "Demonstrate using Kotlin Coroutines and
StateFlow
in an AndroidViewModel
to expose the currentAppState
(from a previous FSM prompt). Then, show how a Jetpack Compose Composable function can collect thisStateFlow
and conditionally display different UI elements (e.g., a 'Listening...' Text, a 'Processing...' spinner, or an error message Text) based on the currentAppState
value." - "Provide a Jetpack Compose Composable function
StatusIndicator(appState: AppState)
that displays different icons and/or text messages based on the type ofAppState
. For example, a microphone icon forAwaitingUserInput
, a spinning progress indicator forProcessingAutomatorQuery
orExecutingActorPlan
, and a red error icon with the message forErrorOccurred
."
And guess what? We have a largely working Android client for this whole thing, which I am still figuring out some edge cases for. A custom defined YAML syntax, an Automator that talks to a LLM instance through POST requests, an Actor that takes my YAML syntax + gets accessibility permissions + executes those actions, and a FSM architecture tying all of them together. The core of this project, before it gets released, now solely lies on how well I train the LLM and in introducing guardrails so that unintended actions are not executed at all.
Know your stuff, and use AI as the best search engine + brick forge in existence. It excels in syntax, you need to excel in the semantics.
Coming Soon
----
Cover image: Olympia Drive, Amherst, MA | 11PM on a random fall evening in 2023
Footnotes:
1: For full transparency, here's the message I sent a week ago when I stumbled across TestDriver in a Reddit ad and tried it out: