如何编写一个通用过程来查找并删除 Oracle 中任意表和列的重复项?
问题描述:
您需要编写一个通用过程来查找并删除 Oracle 中任意表和列的重复项。
解决方案:
我们可以使用 Oracle 的内部 ROWID 值来唯一标识表中的行,并结合带有分区子句的 OLAP 函数 row_number。实现此操作的示例语法如下所示。
delete from table where rowid in (... query here ...)
为了演示用法,我们将首先创建示例数据。
示例
-- table with tennis player rankings DROP TABLE atp_stats; CREATE TABLE atp_stats ( player_rank NUMBER NOT NULL, player_name VARCHAR2(100) NOT NULL, time_range TIMESTAMP(6)); -- sample records INSERT INTO atp_stats VALUES (1,'ROGER FEDERER',CURRENT_TIMESTAMP); INSERT INTO atp_stats VALUES (2,'RAFAEL NADAL',CURRENT_TIMESTAMP); INSERT INTO atp_stats VALUES (3,'NOVAK DJOKOVIC',CURRENT_TIMESTAMP); INSERT INTO atp_stats VALUES (4,'ANDY MURRAY',CURRENT_TIMESTAMP); INSERT INTO atp_stats VALUES (1,'ROGER FEDERER',CURRENT_TIMESTAMP); INSERT INTO atp_stats VALUES (2,'RAFAEL NADAL',CURRENT_TIMESTAMP); INSERT INTO atp_stats VALUES (3,'NOVAK DJOKOVIC',CURRENT_TIMESTAMP); COMMIT;
查看我们刚刚创建的数据。
示例
SELECT * FROM atp_stats ORDER BY 2;
player_rank | player_name |
4 | ANDY MURRAY |
3 | NOVAK DJOKOVIC |
3 | NOVAK DJOKOVIC |
2 | RAFAEL NADAL |
2 | RAFAEL NADAL |
1 | ROGER FEDERER |
1 | ROGER FEDERER |
因此,我们插入了 3 个想要删除的重复项。在继续编写 Delete 语句之前,让我们先了解一下带有 ROWID 的内部查询。
示例
SELECT rowid FROM ( SELECT player_rank, player_rank, rowid , row_number() over (partition BY player_rank, player_name order by player_rank,player_name) AS rnk FROM atp_stats ) WHERE rnk > 1;
我特意在这个最内层子查询中添加了 player_rank 和 player_name 列,以便逻辑更清晰。理想情况下,最内层子查询可以省略这两个列,达到同样的效果。如果我们只执行这个最内层查询,并选择额外的列(为了更清晰起见),就会看到这些结果。
player_rank | player_name | rowid | rnk |
4 | ANDY MURRAY | AAAPHcAAAAAB/4TAAD | 1 |
3 | NOVAK DJOKOVIC | AAAPHcAAAAAB/4TAAC | 1 |
3 | NOVAK DJOKOVIC | AAAPHcAAAAAB/4TAAG | 2 |
2 | RAFAEL NADAL | AAAPHcAAAAAB/4TAAB | 1 |
2 | RAFAEL NADAL | AAAPHcAAAAAB/4TAAF | 2 |
1 | ROGER FEDERER | AAAPHcAAAAAB/4TAAE | 1 |
1 | ROGER FEDERER | AAAPHcAAAAAB/4TAAA | 2 |
该 SQL 返回表中所有行的 rowid。然后,ROW_NUMBER() 函数将对由 PARTITION BY 指令驱动的 id 和 player_name 集合进行操作。这意味着,对于每个唯一的 player_rank 和 player_name,ROW_NUMBER 将启动我们已将其别名为 rnk 的行的连续计数。当观察到新的 player_rank 和 player_name 组合时,rnk 计数器将重置为 1。
现在,我们可以应用 DELETE 运算符来删除重复值,如下所示。
SQL:删除重复项
示例
DELETE FROM atp_stats WHERE rowid IN ( SELECT rowid FROM( SELECT player_rank, player_name, rowid , row_number() over (partition BY player_rank, player_name order by player_rank,player_name) AS rnk FROM atp_stats ) WHERE rnk > 1 );
输出
3 rows deleted.
player_rank | player_name |
4 | ANDY MURRAY |
3 | NOVAK DJOKOVIC |
2 | RAFAEL NADAL |
1 | ROGER FEDERER |
由于删除重复项是程序员最常执行的任务之一,因此最好创建一个可重用的过程。以下过程将接受需要删除重复项的表名以及要搜索的列名。
首先,我们将创建一个表类型,用于传递要分组的动态列数。然后,我们将创建一个过程来动态删除数据。
代码:删除重复项的常用过程
示例
CREATE OR REPLACE TYPE tmp_args AS TABLE OF VARCHAR2(30); CREATE PROCEDURE remove_duplicates (p_table IN VARCHAR2, p_cols tmp_args) AS l_remve_dupl CLOB ; l_columns VARCHAR2(30); l_sql_count NUMBER; BEGIN -- get the columns and combine them as comma seperated value SELECT LISTAGG(COLUMN_VALUE, ',') WITHIN GROUP(ORDER BY COLUMN_VALUE) INTO l_columns FROM TABLE(p_cols); -- generate dynamic delete statement SELECT 'DELETE FROM ' || p_table || ' WHERE rowid IN ( SELECT rowid FROM( SELECT rowid , row_number() OVER (partition BY ' || l_columns || ' ORDER BY ' || l_columns || ') AS rnk FROM ' || p_table || ' ) WHERE rnk > 1 ) ' INTO l_remve_dupl FROM DUAL ; EXECUTE IMMEDIATE l_remve_dupl; l_sql_count := SQL%ROWCOUNT; COMMIT; END;
用法:
BEGIN remove_duplicates('ATP_STATS', tmp_args('PLAYER_RANK','PLAYER_NAME')); END;
输出
player_rank | player_name |
4 | ANDY MURRAY |
3 | NOVAK DJOKOVIC |
2 | RAFAEL NADAL |
1 | ROGER FEDERER |