Me vs ChatGPT in SQL: Analyzing Fraud transactions in SQL
đź’»

Me vs ChatGPT in SQL: Analyzing Fraud transactions in SQL

Byline
Description
Analyzing ChatGPT SQL code against my SQL code
Tags
SQL
Projects/Hacakthons
Demo
 
notion image
I was creating an SQL snippet to identify patterns of fraud in transaction data, but a few minutes in I thought, “is this something that actually requires human knowledge or is it something that requires ChatGPT knowledge”. So i decided to compare my solution to ChatGPT’s.
Languages like SQL that are relatively low level have infinite paths to a solution for basic problems. For instance, if a group of people were to write code to replace a letter within a string, most people would converge to the same intuitive solution using Python (replace() function) but would probably have variation in their solution if using something like Assembly . Therefore, reconciling ChatGPT’s solution to a workable solution may be more intuitive with high level languages like Python, but would require more assessment with something like SQL. Think about when you look at someone else’s code. Sometimes it’s clear what they were trying to do, but often time it takes a bit of inquisition to understand why they made certain decisions in their approach. ChatGPT presents the same issues.
 
The TLDR:
đź’ˇ
Overall though, the solution ChatGPT provided could be updated with minimal lift to be accurate but my solution was much more concise, simpler and probably required lower computation. The question to answer in deciding if to use ChatGPT versus make a new solution i how complex is this problem and how long would it take to create a new solution versus assess and fix ChatGPT’s solution.
 
 
This notion is an overview of:
  1. The problem
  1. My solution
  1. ChatGPT’s solution
  1. What’s wrong with ChatGPT’s solution
 
The problem: Find Suspicious Transfers
This use case, data & definition of suspicious was influence by an ORACLE talk, however the solution is completely made by me (:
 
For the purpose of this analysis, I will define suspicious transfers as when
  1. 3 or more transfers with less than $2000 in volume occurs within 30 days
  1. A large transfer ≥1M occurs within 10 days of a small transfer
 
The pattern matching process needs to identify the following data points:
  1. Account
  1. Date of first small transfer
  1. Date of last small transfer
  1. Date of large transfer
  1. Amount of large transfer
 
My Solution (Google’s BigQuery)
WITH transactions as( SELECT *, -- Modifying date data type to allow for time period analysis UNIX_DATE(PARSE_DATE("%d-%b-%y", time_id)) AS parsed, -- defining small and large transactions as a variable (trans_amount < 2000 and `event_id` = 'Transfer') as is_small_transaction, (trans_amount >= 1000000000 and `event_id` = 'Transfer') as is_lg_transaction FROM `transaction_data.transactions` ), tx2 as( SELECT * , -- counting the large transactions in the last 30 days countif(is_small_transaction) over (partition by user_id ORDER BY Parsed Range between 30 PRECEDING AND 1 PRECEDING) AS num_sm_tx, -- counting the large transactions in the last 10 days countif(is_small_transaction) over (partition by user_id ORDER BY Parsed Range between 10 PRECEDING AND 1 PRECEDING) AS num_sm_tx_10 from transactions order by parsed ) -- checking the criteria for Fraus select * from tx2 WHERE (((is_lg_transaction and (num_sm_tx_10 >= 1)) or (num_sm_tx >= 3) )and (event_id = 'Transfer'))
 
ChatGPT’s Solution (Google’s BigQuery)
 
Prompt:
“Write a SQL query to identify fraud transactions. Fraud transactions are defined as 1) 3 or more small (<2K) money transfers within 30 days or 2) a large transfer (>=1M) within 10 days of last small transfer, in both these cases small transfers do not go to the same account twice in a row. Columns available are transaction amount, date of transaction, who sent the transaction, type of transaction (it can be transfer or withdrawal) and who received the transfer.”
SELECT t1.*, CASE WHEN t1.total_small_transfers >= 3 THEN 'Fraud: 3 or more small transfers in 30 days' WHEN t1.large_transfer_within_10_days = 1 THEN 'Fraud: Large transfer within 10 days of small transfer' ELSE 'Not Fraud' END AS fraud_status FROM ( SELECT transaction_amount, transaction_date, sender, transaction_type, receiver, COUNT(DISTINCT IF(transaction_amount < 2000, receiver, NULL)) AS small_transfers_count, COUNT(DISTINCT IF(transaction_amount >= 1000000, receiver, NULL)) AS large_transfers_count, MAX(IF(transaction_amount < 2000, transaction_date, NULL)) AS max_small_transfer_date, MAX(IF(transaction_amount >= 1000000, transaction_date, NULL)) AS max_large_transfer_date, DATE_DIFF(MAX(IF(transaction_amount < 2000, transaction_date, NULL)), MAX(IF(transaction_amount >= 1000000, transaction_date, NULL)), DAY) <= 10 AS large_transfer_within_10_days, COUNT(DISTINCT IF(transaction_amount < 2000, CONCAT(receiver, '-', sender), NULL)) AS small_transfers_to_different_accounts_count, COUNT(DISTINCT IF(transaction_amount < 2000, CONCAT(sender, '-', receiver), NULL)) AS small_transfers_from_different_accounts_count, COUNT(DISTINCT IF(transaction_amount < 2000, CONCAT(receiver, '-', sender), NULL)) + COUNT(DISTINCT IF(transaction_amount >= 1000000, receiver, NULL)) AS total_small_transfers FROM `mydataset.mytable` AS transactions GROUP BY transaction_amount, transaction_date, sender, transaction_type, receiver ) AS t1 WHERE (t1.total_small_transfers >= 3 OR t1.large_transfer_within_10_days = 1) AND t1.small_transfers_from_different_accounts_count > 1
 
Issues with ChatGPT’s solution
1, It does not account for the 30 day window for small transfers anywhere. There’s a statement that counts the 10 day window when benchmarking against large transfers but this logic is missing for looking solely at small transfers in the past 30 days
DATE_DIFF(MAX(IF(transaction_amount < 2000, transaction_date, NULL)), MAX(IF(transaction_amount >= 1000000, transaction_date, NULL)), DAY) <= 10 AS large_transfer_within_10_days,
2, it’s treating t1.large_transfer_within_10_days as an INT even though the way it is defined makes it a BOOL in the above snippet
WHERE (t1.total_small_transfers >= 3 OR t1.large_transfer_within_10_days = 1) AND t1.small_transfers_from_different_accounts_count > 1
 
3, when assessing for if the transaction is sent to the same account twice in a row, it just checks if an account has been sent to twice as opposed to specifically, twice in a row.
4, total_small_transfers is summing the count of small transfers AND large transfers
COUNT(DISTINCT IF(transaction_amount < 2000, CONCAT(receiver, '-', sender), NULL)) AS small_transfers_to_different_accounts_count, COUNT(DISTINCT IF(transaction_amount < 2000, CONCAT(sender, '-', receiver), NULL)) AS small_transfers_from_different_accounts_count,
 
5, It creates a lot of calculations it doesn’t need. In theory these calculations would be used as intermediate steps to build out the final query but instead chat gpt just makes several iterations which attempts to calculate each requirement independently (attempt because the individual calculations themselves are incorrect in many instances).
 
đź’ˇ
To restate my TLDR from above, I found that ChatGPT had a more complex approach and had a decent amount of inaccuracies. However, I did find it’s solution a good starting point and if I did not have a strong knowledge of SQL this would give me ideas of how to approach the problem.