如何使用 Tensorflow 和 Python 拆分 Unicode 字符串并指定字节偏移量?

pythonserver side programmingprogrammingtensorflow

可以拆分 Unicode 字符串，并分别使用 ‘unicode_split’ 方法和 ‘unicode_decode_with_offsets’ 方法指定字节偏移量。这些方法存在于 ‘tensorflow’ 的 ‘string’ 类中模块。

首先，使用 Python 表示 Unicode 字符串，并使用 Unicode 等效项来操作这些字符串。在标准字符串操作的 Unicode 等效项的帮助下，根据脚本检测将 Unicode 字符串分离为标记。

我们正在使用 Google Colaboratory 来运行以下代码。Google Colab 或 Colaboratory 有助于在浏览器上运行 Python 代码，并且不需要任何配置，并且可以免费访问 GPU(图形处理单元)。 Colaboratory 已在 Jupyter Notebook 上构建。

print("拆分 unicode 字符串")
tf.strings.unicode_split(thanks, 'UTF-8').numpy()
codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8')
print("打印字符的字节偏移量")
for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
   print("At byte offset {}: codepoint {}".format(offset, codepoint))

代码来源:https://www.tensorflow.org/tutorials/load_data/unicode

输出

拆分 unicode 字符串
打印字符的字节偏移量
At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882

解释

tf.strings.unicode_split 操作将 unicode 字符串拆分为单个字符的子字符串。
生成的字符张量必须通过 tf.strings.unicode_decode 与原始字符串对齐。
为此，需要知道每个字符开始的偏移量。
方法 tf.strings.unicode_decode_with_offsets 与 unicode_decode 方法类似，不同之处在于前者返回包含每个字符起始偏移量的第二个张量。

技术文章和资源

热门类别

如何使用 Tensorflow 和 Python 拆分 Unicode 字符串并指定字节偏移量?

输出

解释

相关文章

颜色选择器

读后有收获微信请站长喝咖啡

错误报告

您的建议:

感谢您的帮助！