STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
Streamline your development workflow with Visual Studio Code and Github Copilot—fast, lightweight, and customizable for all your coding needs, from generating SQL queries to managing projects.
Last updated
Overview
Visual Studio Code (VS Code) is one of the most popular, lightweight, and versatile code editors available today. Developed by Microsoft, it’s designed to work with a wide range of programming languages and tools, offering support for extensions that make it even more powerful. When paired with SQL, VS Code becomes a powerful environment for querying databases, managing data, and writing efficient SQL scripts.
Why Visual Studio Code for SQL?
Download JSON: DBVisualizer and many other tools charge for downloading JSON data as part of their premium offerings, which can be a significant limitation, especially when working with Data Distiller. This editor, however, provides much more flexibility without those restrictions.
Lightweight and Fast: VS Code is designed to be lightweight, allowing you to run it efficiently on various systems, including machines with limited resources. Despite its small footprint, it packs powerful features.
Run Python Notebooks: Visual Studio Code supports Jupyter-style Python notebooks through the Jupyter extension. This extension allows you to create, edit, and run Jupyter notebooks (.ipynb files) directly within the VS Code interface, offering a similar experience to JupyterLab. Here’s what it provides:
IntelliSense for SQL: With SQL extensions installed, you get features like IntelliSense (auto-completion, syntax highlighting, and error checking), making writing complex SQL queries faster and less error-prone.
Integration with GitHub Copilot: With GitHub Copilot integrated into VS Code, SQL developers can benefit from AI-powered code suggestions while writing queries. Copilot can assist by auto-completing SQL statements, suggesting optimized queries, or even generating complex SQL code based on comments or partial queries, dramatically speeding up query development and reducing errors. This is particularly useful for beginners or when working with complex SQL joins, aggregations, and subqueries.
Enhance Secure Access of Visual Studio Code with IP Whitelisting in Data Distiller
Since this client is installed on your local machine, if your administrator has concerns about accessing data in AEP from a different machine, you should request that they enable the Data Distiller whitelisting feature. This ensures that only IP addresses from the corporate network are allowed access. Even if someone attempts to spoof an IP and send a query, the responses would still be restricted to the corporate network, preventing unauthorized access.
Download Visual Studio Code
You should download the version that matches your operating system.
Prerequisites
This is not required but if you have not done so, download the JSON data
Connect to Data Distiller
Launch the Visual Studio Code editor
Click on Extensions to the left:
First you need to install the Postgres driver. To enable the Microsoft Visual Studio Code editor to communicate with the Data Distiller engine, you need a PostgreSQL driver. This driver allows the editor to interact with Data Distiller using PostgreSQL's protocol and syntax, which Data Distiller supports. Search for "postgres cockroach". Link is here on the web.
It will also automatically install an extension called SQL Tools that gives us the tools to connect to databases, write and manage queries. Link is here on web.
Keep in mind that Data Distiller adheres to Postgres SQL syntax, a widely used standard due to its robustness, compatibility, and support for advanced features, making it a popular choice for many database applications.
Click on the icon on the UI and then click Add Connection and Choose PostgresSQL
You will need Data Distiller Credential parameters for the above by logging into the AEP UI and navigating to Queries->Credentials. You will need to copy the following:
The PSQL command box will look something like this:
host: The host is the endpoint where you send the requests, representing the server or service (in this case, Data Distiller) that you're trying to connect to. It's actually a URL/IP address.
dbname: The dbname refers to the specific database within Data Distiller that you're connecting to. Each database holds different sets of data, and the dbname helps you point to the correct one.
port: The port specifies the communication channel used to connect to the host. For Data Distiller, this can be be port 80 (typically used for HTTP) or 5432 (the default port for PostgreSQL). Depending on your environment or security setup, one of these ports will be used to facilitate the connection.
sslmode: This parameter ensures that the connection between your machine and the Data Distiller database is secured using SSL (Secure Sockets Layer), protecting the data in transit from unauthorized access.
Fill out the options on the connection screen with the folowing parameters:
Connection name: <any name you choose>
Connection group: Leave blank for now as this is for a logical group of multiple connections.
Connect using: Server and port
Server address: Copy host from Data Distiller Credentials UI shown above i.e. demo.platform-query.adobe.io in the example above
Port: You can use 5432 or 80.
Database: Copy dbname or Database from Data Distiller Credentials UI shown above i.e. prod:all in thee example above
Username: Copy user or Username from Data Distiller Credentials UI shown above i.e. Demo@AdobeOrg in the example above
Use password: Select Save as plaintext in settings, though other options are available depending on your intended use.
Password: Copy the password from the Data Distiller Credentials UI as shown above.
SSL: Change from Disabled to Enabled
If you scroll down, you will see an option Show Records Default Limit. By default, it shows 50 records, but you can go way higher as much as the memory on your machine allows. I chose 10,000.
Scroll to the bottom and click on Connect Now
Remember that Data Distiller passwords expire in 24 hours. Frequent password expiration reduces the likelihood of unauthorized access. Even if a password is compromised, the short validity window minimizes the risk by ensuring that the credentials cannot be used for long. Many industries and organizations follow strict regulatory requirements (e.g., HIPAA, ) that enforce regular password changes to protect sensitive data. Short password expiration cycles help meet these compliance standards.
Test Queries
Navigate to Connections pane and click on the icon for Create New SQL File
Type the following queries. Make sure you separate out each query with a semi colon. It will execute all the queries and show thee results in multiple tabs. You can also right click and choose Run selected query option as well.
SHOW TABLES;SELECT*FROM luma_web_data;
Let is now export this result as JSON. Click on Export and choose Save results as JSON
Once you save the file to your desktop or any other location, you will see the JSON file appear in Preview:
GitHub Copilot
GitHub is part of Microsoft. Microsoft acquired GitHub in 2018. While GitHub operates as a subsidiary and maintains its own identity and platform, the acquisition has led to closer integration between GitHub and other Microsoft tools and services, such as Azure and Visual Studio Code. This integration is particularly evident in the development of tools like GitHub Copilot, which leverages OpenAI's technology (another Microsoft partner) and works seamlessly with Microsoft products like Visual Studio Code. Through the acquisition, Microsoft has been able to strengthen its offerings for developers while supporting GitHub’s open-source community.
GitHub Copilot, an AI-powered coding assistant, has revolutionized the way developers write code by providing intelligent code suggestions directly within the development environment. When integrated with Visual Studio Code, Copilot becomes an invaluable tool for generating code snippets, speeding up development workflows, and improving productivity.
For SQL developers, GitHub Copilot offers the ability to generate SQL queries based on the context of the project, making it easier to write and refine complex queries with just a few keystrokes.
If you're someone who isn't interested in understanding the SQL being generated, there are clear limitations to what coding assistants—or any assistants—can do for you. Without grasping the nuances, you'll likely struggle with suggestions that may be incorrect or overly complex, making the process more difficult. This is why conversations about developer productivity assume that developers have the expertise to use these assistants to automate parts of the code they already understand. These tools are not a substitute for a skilled SQL developer.
The same principle applies to any AI assistants or task agents. If you don't understand what's happening and why, there's a high risk that a simple mistake could lead to system-wide issues. AI assistants can be very convincing, but they can also provide incorrect answers without you realizing it. This highlights a broader challenge in communication with AI—if you're not fact-checking or fully understanding the context, you could find yourself in serious trouble.
When working with coding assistants like GitHub Copilot, it's crucial to manage expectations, especially when dealing with complex coding tasks. While Copilot excels at speeding up coding workflows and providing useful suggestions for standard operations, it encounters limitations in handling deeply nested data structures, such as arrays and maps. These structures, which are common in Data Distiller queries and data models, often require a nuanced understanding of context and relationships that current language models struggle to fully grasp. In our experiments with SQL, especially in scenarios involving complex subqueries, joins, and deeply nested data, we've found that Copilot may struggle to generate accurate or optimal code.
Retrieval-Augmented Generation (RAG) is a technique that can help alleviate some of these limitations. In a RAG-based system, the model augments its generation by retrieving relevant information from a knowledge base or external documents. This approach can improve accuracy in tasks like querying complex datasets because it combines generated content with factual information retrieval, making the model more context-aware.
However, there are trade-offs with using RAG-based approaches:
Restrictiveness: While RAG can help improve the precision of the generated SQL queries by providing more context, it also makes the query generation process more restrictive. The model's output becomes heavily dependent on the retrieved information, meaning it is less likely to generate creative or flexible queries. This can be beneficial when accuracy is critical, but it can also limit the flexibility needed for more exploratory or dynamic queries.
Dependency on Knowledge Base: RAG systems rely on a well-curated knowledge base or set of documents for retrieval. If the knowledge base is incomplete, outdated, or lacks detail on specific database schemas, the quality of the suggestions can be limited. This can still result in gaps when dealing with custom or less-documented data models.
Performance: RAG-based models require additional steps to retrieve and process information, which will increase the latency of generating suggestions or you have to put in more resources. This might not be an issue for smaller tasks, but for larger, complex queries, it could impact the overall efficiency and cost.
Copilot's training data, sourced from existing code repositories, and the context gathered by its large language model may introduce biases and errors that can be reflected in its suggestions. Additionally, Copilot Chat may favor SQL coding styles, potentially leading to suboptimal or incomplete code recommendations.
How GitHub Copilot Works in Visual Studio Code
When you install GitHub Copilot extension in Visual Studio Code, it seamlessly integrates into your development environment, offering code suggestions as you type. Here's how it works:
Context-Aware Code Generation: Copilot analyzes the code you're writing, the comments you provide, and the overall project context. For SQL, this means that it can generate queries based on your data structures, table names, and existing code patterns. For instance, if you begin writing a query or even a comment like -- Fetch top 10 sales records, Copilot can suggest an appropriate SQL statement to achieve that goal.
Query Auto-Completion: Copilot helps autocomplete SQL syntax, suggesting relevant SQL commands, such as SELECT, JOIN, WHERE, and GROUP BY, based on the schema you're working with. This reduces the need to constantly reference documentation or remember complex syntax.
Adaptive Learning: Over time, Copilot adapts to your coding style and project context, improving the relevance and accuracy of its suggestions. Whether you're managing a simple query or a complex data operation, Copilot aims to reduce errors and save you time by offering optimized code solutions.
GitHub Copilot isn't free, but it's highly affordable considering the significant time it can save during development. What is cool is that you can get a free trial for a month to try and see if it meets your needs.
GitHub Copilot takes user privacy and security seriously. Here's how your data is handled when using the tool based on the information available
Contextual Suggestions Only: Copilot uses the code in your current file and project to make context-aware suggestions. However, it does not access or share the broader content of your private repositories or any code outside the active session to provide its suggestions.
No Use of Private Code for Training: While GitHub Copilot was trained on public code repositories, it does not use your private repository data for training its underlying model. Private code remains private, and Copilot only leverages the context within your active session for generating recommendations.
Data Collection for Feedback: Copilot collects limited data, such as whether you accept or reject a suggestion, to help improve its performance. However, it does not collect or store the specific code you're working on unless you explicitly share it through feedback mechanisms.
Install the GitHub Copilot Extension
Go back to Extensions and search for Github Copilot. Install it.
After the extension is installed, it will request you to sign in to make sure that you have the right license.
It will also ask to link the GitHub Copilot to your Visual Studio Code environment
Go back to the editor and you will see thee GitHub icon in the editor itself. Start typing what you want it to do. In my case, I am trying to get to thee Profile snapshot data that contains valuable information about the Profile Store. Also, observe the chat icon that has now appeared in the left hand panel.
Open up the chat and start writing queries. You can see that I copy paste the values of maps and arrays from the results and give additional context:
You can also right click on a query and have it explained, fix it for syntax issues, generate docs or write tests.
Now copy paste the following cryptic Data Distiller error (every engine has a set of cryptic errors:))
ErrorCode: 08P01 queryId: c4556695-e655-49ac-be22-bd8adf6f60b8 Unknown error encountered. Reason: [Table not provisioned for dataset: 64109683dca32d1bd12960a9 at abfss://gen1@datalakel0mng6ia7umvu12d.dfs.core.windows.net/platform/64109683dca32d1bd12960a9. To prevent this error either add data to the dataset or specify withAutoProvisionTableEnabled(true) when calling this operation]
The error occurs because there are no rows of data in the dataset. This dataset was not generated in Data Distiller but was created using a different service within Adobe Experience Platform.
GitHub CoPilot has no knowledge of AEP or the documentation and so it tries to get creative. The heart of the answer is right at the center.
90% of the text above is incorrect, but the correct answer is still present. If you're a Data Distiller user familiar with how the system works, you'd be able to identify the mistake. This highlights one of the biggest challenges with AI assistants: they tend to benefit power users who can discern what the AI is saying correctly and where it’s making errors.