Class UnicodeTranscode
errors attribute sets the policy for how to deal with them. If the default
error-handling policy is used, invalid formatting will be substituted in the
output by the replacement_char. If the errors policy is to ignore, any
invalid encoding positions in the input are skipped and not included in the
output. If it set to strict then any invalid formatting will result in an
InvalidArgument error.
This operation can be used with output_encoding = input_encoding to enforce
correct formatting for inputs even if they are already in the desired encoding.
If the input is prefixed by a Byte Order Mark needed to determine encoding (e.g. if the encoding is UTF-16 and the BOM indicates big-endian), then that BOM will be consumed and not emitted into the output. If the input encoding is marked with an explicit endianness (e.g. UTF-16-BE), then the BOM is interpreted as a non-breaking-space and is preserved in the output (including always for UTF-8).
The end result is that if the input is marked as an explicit endianness the transcoding is faithful to all codepoints in the source. If it is not marked with an explicit endianness, the BOM is not considered part of the string itself but as metadata, and so is not preserved in the output.
Examples:
tf.strings.unicode_transcode(["Hello", "TensorFlow", "2.x"], "UTF-8", "UTF-16-BE") <tf.Tensor: shape=(3,), dtype=string, numpy= array([b'\x00H\x00e\x00l\x00l\x00o', b'\x00T\x00e\x00n\x00s\x00o\x00r\x00F\x00l\x00o\x00w', b'\x002\x00.\x00x'], dtype=object)> tf.strings.unicode_transcode(["A", "B", "C"], "US ASCII", "UTF-8").numpy() array([b'A', b'B', b'C'], dtype=object)
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classstatic classOptional attributes forUnicodeTranscode -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringThe name of this op, as known by TensorFlow core engine -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionasOutput()Returns the symbolic handle of the tensor.static UnicodeTranscodecreate(Scope scope, Operand<TString> input, String inputEncoding, String outputEncoding, UnicodeTranscode.Options... options) Factory method to create a class wrapping a new UnicodeTranscode operation.static UnicodeTranscode.OptionsSets the errors option.output()Gets output.static UnicodeTranscode.OptionsreplaceControlCharacters(Boolean replaceControlCharacters) Sets the replaceControlCharacters option.static UnicodeTranscode.OptionsreplacementChar(Long replacementChar) Sets the replacementChar option.
-
Field Details
-
OP_NAME
The name of this op, as known by TensorFlow core engine- See Also:
-
-
Constructor Details
-
UnicodeTranscode
-
-
Method Details
-
create
@Endpoint(describeByClass=true) public static UnicodeTranscode create(Scope scope, Operand<TString> input, String inputEncoding, String outputEncoding, UnicodeTranscode.Options... options) Factory method to create a class wrapping a new UnicodeTranscode operation.- Parameters:
scope- current scopeinput- The text to be processed. Can have any shape.inputEncoding- Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples:"UTF-16", "US ASCII", "UTF-8".outputEncoding- The unicode encoding to use in the output. Must be one of"UTF-8", "UTF-16-BE", "UTF-32-BE". Multi-byte encodings will be big-endian.options- carries optional attribute values- Returns:
- a new instance of UnicodeTranscode
-
errors
Sets the errors option.- Parameters:
errors- Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with thereplacement_charcodepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character.- Returns:
- this Options instance.
-
replacementChar
Sets the replacementChar option.- Parameters:
replacementChar- The replacement character codepoint to be used in place of any invalid formatting in the input whenerrors='replace'. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.)Note that for UTF-8, passing a replacement character expressible in 1 byte, such as ' ', will preserve string alignment to the source since invalid bytes will be replaced with a 1-byte replacement. For UTF-16-BE and UTF-16-LE, any 1 or 2 byte replacement character will preserve byte alignment to the source.
- Returns:
- this Options instance.
-
replaceControlCharacters
Sets the replaceControlCharacters option.- Parameters:
replaceControlCharacters- Whether to replace the C0 control characters (00-1F) with thereplacement_char. Default is false.- Returns:
- this Options instance.
-
output
-
asOutput
Description copied from interface:OperandReturns the symbolic handle of the tensor.Inputs to TensorFlow operations are outputs of another TensorFlow operation. This method is used to obtain a symbolic handle that represents the computation of the input.
-