Module rspamd_mimepart
Module rspamd_textpart
This module provides different methods to manipulate text parts data. Text parts
could be obtained from the rspamd_task by using of method task:get_text_parts()
Example:
rspamd_config.R_EMPTY_IMAGE = function (task)
parts = task:get_text_parts()
if parts then
for _,part in ipairs(parts) do
if part:is_empty() then
texts = task:get_texts()
if texts then
return true
end
return false
end
end
end
return false
end
Brief content:
Methods:
| Method | Description |
|---|---|
text_part:is_utf() | Return TRUE if part is a valid utf text. |
text_part:has_8bit_raw() | Return TRUE if a part has raw 8bit characters. |
text_part:has_8bit() | Return TRUE if a part has raw 8bit characters. |
text_part:get_content([type]) | Get the text of the part (html tags stripped). |
text_part:get_raw_content() | Get the original text of the part. |
text_part:get_content_oneline() | Get the text of the part (html tags and newlines stripped). |
text_part:get_length() | Get length of the text of the part. |
mime_part:get_raw_length() | Get length of the raw content of the part (e.g. |
mime_part:get_urls_length() | Get length of the urls within the part. |
mime_part:get_lines_count() | Get lines number in the part. |
mime_part:get_stats() | Returns a table with the following data. |
text_part:get_cta_urls([max_urls]) | Get CTA (call-to-action) URLs from HTML part sorted by button weight. |
mime_part:get_words_count() | Get words number in the part. |
mime_part:get_words([how]) | Get words in the part. |
mime_part:filter_words(regexp, [how][, max]]) | Filter words using some regexp. |
text_part:is_empty() | Returns true if the specified part is empty. |
text_part:is_html() | Returns true if the specified part has HTML content. |
text_part:get_html() | Returns html content of the specified part. |
text_part:get_language() | Returns the code of the most used unicode script in the text part. |
text_part:get_charset() | Returns part real charset. |
text_part:get_languages() | Returns array of tables of all languages detected for a part. |
text_part:get_fuzzy_hashes(mempool) | Returns direct hash of textpart as a string and array [1..32] of shingles each represented as a following table. |
text_part:get_mimepart() | Returns the mime part object corresponding to this text part. |
text_part:get_html_fuzzy_hashes(mempool) | Generate fuzzy hashes for HTML structure (if part is HTML). |
Methods
The module rspamd_textpart defines the following methods.
Method text_part:is_utf()
Return TRUE if part is a valid utf text
Parameters:
No parameters
Returns:
{boolean}: true if part is validUTF8part
Back to module description.
Method text_part:has_8bit_raw()
Return TRUE if a part has raw 8bit characters
Parameters:
No parameters
Returns:
{boolean}: true if a part has raw 8bit characters
Back to module description.
Method text_part:has_8bit()
Return TRUE if a part has raw 8bit characters
Parameters:
No parameters
Returns:
{boolean}: true if a part has encoded 8bit characters
Back to module description.
Method text_part:get_content([type])
Get the text of the part (html tags stripped). Optional type defines type of content to get:
content(default): utf8 content with HTML tags stripped and newlines preservedcontent_oneline: utf8 content with HTML tags and newlines strippedraw: raw content, not mime decoded nor utf8 convertedraw_parsed: raw content, mime decoded, not utf8 convertedraw_utf: raw content, mime decoded, utf8 converted (but with HTML tags and newlines)
Parameters:
No parameters
Returns:
{text}:UTF8encoded content of the part (zero-copy if not converted to a lua string)
Back to module description.
Method text_part:get_raw_content()
Get the original text of the part
Parameters:
No parameters
Returns:
{text}:UTF8encoded content of the part (zero-copy if not converted to a lua string)
Back to module description.
Method text_part:get_content_oneline()
Get the text of the part (html tags and newlines stripped)
Parameters:
No parameters
Returns:
{text}:UTF8encoded content of the part (zero-copy if not converted to a lua string)
Back to module description.
Method text_part:get_length()
Get length of the text of the part
Parameters:
No parameters
Returns:
{integer}: length of part in bytes
Back to module description.
Method mime_part:get_raw_length()
Get length of the raw content of the part (e.g. HTML with tags unstripped)
Parameters:
No parameters
Returns:
{integer}: length of part in bytes
Back to module description.
Method mime_part:get_urls_length()
Get length of the urls within the part
Parameters:
No parameters
Returns:
{integer}: length of urls in bytes
Back to module description.
Method mime_part:get_lines_count()
Get lines number in the part
Parameters:
No parameters
Returns:
{integer}: number of lines in the part
Back to module description.
Method mime_part:get_stats()
Returns a table with the following data:
lines: number of linesspaces: number of spacesdouble_spaces: double spacesempty_lines: number of empty linesnon_ascii_characters: number of non ascii charactersascii_characters: number of ascii characters
Parameters:
No parameters
Returns:
{table}: table of stats
Back to module description.
Method text_part:get_cta_urls([max_urls])
Get CTA (call-to-action) URLs from HTML part sorted by button weight
Parameters:
max_urls {number}: optional maximum number of URLs to return
Returns:
{table}: array of URL objects sorted by importance (descending)
Back to module description.
Method mime_part:get_words_count()
Get words number in the part
Parameters:
No parameters
Returns:
{integer}: number of words in the part
Back to module description.
Method mime_part:get_words([how])
Get words in the part. Optional how argument defines type of words returned:
stem: stemmed words (default)norm: normalised words (utf normalised + lowercased)raw: raw words in utf (if possible)full: list of tables, each table has the following fields:- [1] - stemmed word
- [2] - normalised word
- [3] - raw word
- [4] - flags (table of strings)
Parameters:
No parameters
Returns:
{table/strings}: words in the part
Back to module description.
Method mime_part:filter_words(regexp, [how][, max]])
Filter words using some regexp:
stem: stemmed words (default)norm: normalised words (utf normalised + lowercased)raw: raw words in utf (if possible)full: list of tables, each table has the following fields:- [1] - stemmed word
- [2] - normalised word
- [3] - raw word
- [4] - flags (table of strings)
Parameters:
regexp {rspamd_regexp}: regexp to matchhow {string}: what words to extractmax {number}: maximum number of hits returned (all hits if <= 0 or nil)
Returns:
{table/strings}: words matching regexp
Back to module description.
Method text_part:is_empty()
Returns true if the specified part is empty
Parameters:
No parameters
Returns:
{bool}: whether a part is empty
Back to module description.
Method text_part:is_html()
Returns true if the specified part has HTML content
Parameters:
No parameters
Returns:
{bool}: whether a part is HTML part
Back to module description.
Method text_part:get_html()
Returns html content of the specified part
Parameters:
No parameters
Returns:
{html}: html content
Back to module description.
Method text_part:get_language()
Returns the code of the most used unicode script in the text part. Does not work with raw parts
Parameters:
No parameters
Returns:
{string}: short abbreviation (such asru) for the script's language
Back to module description.
Method text_part:get_charset()
Returns part real charset
Parameters:
No parameters
Returns:
{string}: charset of the part
Back to module description.
Method text_part:get_languages()
Returns array of tables of all languages detected for a part:
- 'code': language code (short string)
- 'prob': logarithm of probability
Parameters:
No parameters
Returns:
{array|tables}: all languages detected for the part
Back to module description.
Method text_part:get_fuzzy_hashes(mempool)
Returns direct hash of textpart as a string and array [1..32] of shingles each represented as a following table:
- [1] - 64 bit fuzzy hash represented as a string
- [2..4] - strings used to generate this hash
Parameters:
mempool {rspamd_mempool}: - memory pool (usually task pool)
Returns:
{string,array|tables}: fuzzy hashes calculated
Back to module description.
Method text_part:get_mimepart()
Returns the mime part object corresponding to this text part
Parameters:
No parameters
Returns:
{mimepart}: mimepart object
Back to module description.
Method text_part:get_html_fuzzy_hashes(mempool)
Generate fuzzy hashes for HTML structure (if part is HTML)
Parameters:
mempool {rspamd_mempool}: memory pool to use
Returns:
{digest, shingles}: hex digest and shingles table with metadata
Back to module description.
Back to top.
Module rspamd_mimepart
This module provides access to mime parts found in a message
Example:
rspamd_config.MISSING_CONTENT_TYPE = function(task)
local parts = task:get_parts()
if parts and #parts > 1 then
-- We have more than one part
for _,p in ipairs(parts) do
local ct = p:get_header('Content-Type')
-- And some parts have no Content-Type header
if not ct then
return true
end
end
end
return false
end
Brief content:
Methods:
| Method | Description |
|---|---|
mime_part:get_header(name[, case_sensitive]) | Get decoded value of a header specified with optional case_sensitive flag. |
mime_part:get_header_raw(name[, case_sensitive]) | Get raw value of a header specified with optional case_sensitive flag. |
mime_part:get_header_full(name[, case_sensitive]) | Get raw value of a header specified with optional case_sensitive flag. |
mimepart:get_header_count(name[, case_sensitive]) | Lightweight version if you need just a header's count. |
mimepart:get_raw_headers() | Get all undecoded headers of a mime part as a string. |
mimepart:get_headers() | Get all undecoded headers of a mime part as a string. |
mime_part:get_content() | Get the parsed content of part. |
mime_part:get_raw_content() | Get the raw content of part. |
mime_part:get_length() | Get length of the content of the part. |
mime_part:get_type() | Extract content-type string of the mime part. |
mime_part:get_type_full() | Extract content-type string of the mime part with all attributes. |
mime_part:get_detected_type() | Extract content-type string of the mime part. |
mime_part:get_detected_type_full() | Extract content-type string of the mime part with all attributes. |
mime_part:get_detected_ext() | Returns a msdos extension name according to lua_magic detection. |
mime_part:get_cte() | Extract content-transfer-encoding for a part. |
mime_part:get_filename() | Extract filename associated with mime part if it is an attachment. |
mime_part:is_image() | Returns true if mime part is an image. |
mime_part:get_image() | Returns rspamd_image structure associated with this part. |
mime_part:is_archive() | Returns true if mime part is an archive. |
mime_part:is_attachment() | Returns true if mime part looks like an attachment. |
mime_part:get_archive() | Returns rspamd_archive structure associated with this part. |
mime_part:is_multipart() | Returns true if mime part is a multipart part. |
mime_part:is_message() | Returns true if mime part is a message part (message/rfc822). |
mime_part:get_boundary() | Returns boundary for a part (extracted from parent multipart for normal parts and. |
mime_part:get_enclosing_boundary() | Returns an enclosing boundary for a part even for multiparts. |
mime_part:get_children() | Returns rspamd_mimepart table of part's childer. |
mime_part:is_text() | Returns true if mime part is a text part. |
mime_part:get_text() | Returns rspamd_textpart structure associated with this part. |
mime_part:get_digest() | Returns the unique digest for this mime part. |
mime_part:get_id() | Returns the order of the part in parts list. |
mime_part:is_broken() | Returns true if mime part has incorrectly specified content type. |
mime_part:headers_foreach(callback, [params]) | This method calls callback for each header that satisfies some condition. |
mime_part:get_parent() | Returns parent part for this part. |
mime_part:get_specific() | Returns specific lua content for this part. |
mime_part:set_specific(<any>) | Sets a specific content for this part. |
mime_part:is_specific(<any>) | Returns true if part has specific lua content. |
| [`mime_part:get_urls([need_emails | list_protos][, need_images])`](#m4a20e) |
text_part:get_html_fuzzy_hashes(mempool) | Generate fuzzy hashes for HTML content (if text part is HTML). |
text_part:get_cta_urls([max_urls]) | Get CTA (call-to-action) URLs from HTML part sorted by button weight. |
mime_part:get_stats() | Returns a table with the following data. |
Methods
The module rspamd_mimepart defines the following methods.
Method mime_part:get_header(name[, case_sensitive])
Get decoded value of a header specified with optional case_sensitive flag. By default headers are searched in caseless matter.
Parameters:
name {string}: name of header to getcase_sensitive {boolean}: case sensitiveness flag to search for a header
Returns:
{string}: decoded value of a header
Back to module description.
Method mime_part:get_header_raw(name[, case_sensitive])
Get raw value of a header specified with optional case_sensitive flag. By default headers are searched in caseless matter.
Parameters:
name {string}: name of header to getcase_sensitive {boolean}: case sensitiveness flag to search for a header
Returns:
{string}: raw value of a header
Back to module description.
Method mime_part:get_header_full(name[, case_sensitive])
Get raw value of a header specified with optional case_sensitive flag. By default headers are searched in caseless matter. This method returns more information about the header as a list of tables with the following structure:
name- name of a headervalue- raw value of a headerdecoded- decoded value of a headertab_separated-trueif a header and a value are separated bytabcharacterempty_separator-trueif there are no separator between a header and a value
Parameters:
name {string}: name of header to getcase_sensitive {boolean}: case sensitiveness flag to search for a header
Returns:
{list of tables}: all values of a header as specified above
Example:
function check_header_delimiter_tab(task, header_name)
for _,rh in ipairs(task:get_header_full(header_name)) do
if rh['tab_separated'] then return true end
end
return false
end
Back to module description.
Method mimepart:get_header_count(name[, case_sensitive])
Lightweight version if you need just a header's count
- By default headers are searched in caseless matter.
Parameters:
name {string}: name of header to getcase_sensitive {boolean}: case sensitiveness flag to search for a header
Returns:
{number}: number of header's occurrences or 0 if not found
Back to module description.
Method mimepart:get_raw_headers()
Get all undecoded headers of a mime part as a string
Parameters:
No parameters
Returns:
{rspamd_text}: all raw headers for a message as opaque text
Back to module description.
Method mimepart:get_headers()
Get all undecoded headers of a mime part as a string
Parameters:
No parameters
Returns:
{rspamd_text}: all raw headers for a message as opaque text
Back to module description.
Method mime_part:get_content()
Get the parsed content of part
Parameters:
No parameters
Returns:
{text}: opaque text object (zero-copy if not casted to lua string)
Back to module description.
Method mime_part:get_raw_content()
Get the raw content of part
Parameters:
No parameters
Returns:
{text}: opaque text object (zero-copy if not casted to lua string)
Back to module description.
Method mime_part:get_length()
Get length of the content of the part
Parameters:
No parameters
Returns:
{integer}: length of part in bytes
Back to module description.
Method mime_part:get_type()
Extract content-type string of the mime part
Parameters:
No parameters
Returns:
{string,string}: content type in form 'type','subtype'
Back to module description.
Method mime_part:get_type_full()
Extract content-type string of the mime part with all attributes
Parameters:
No parameters
Returns:
{string,string,table}: content type in form 'type','subtype', {attrs}
Back to module description.
Method mime_part:get_detected_type()
Extract content-type string of the mime part. Use lua_magic detection
Parameters:
No parameters
Returns:
{string,string}: content type in form 'type','subtype'
Back to module description.
Method mime_part:get_detected_type_full()
Extract content-type string of the mime part with all attributes. Use lua_magic detection
Parameters:
No parameters
Returns:
{string,string,table}: content type in form 'type','subtype', {attrs}
Back to module description.
Method mime_part:get_detected_ext()
Returns a msdos extension name according to lua_magic detection
Parameters:
No parameters
Returns:
{string}: detected extension (see lua_magic.types)
Back to module description.
Method mime_part:get_cte()
Extract content-transfer-encoding for a part
Parameters:
No parameters
Returns:
{string}: content transfer encoding (e.g.base64or7bit)
Back to module description.
Method mime_part:get_filename()
Extract filename associated with mime part if it is an attachment
Parameters:
No parameters
Returns:
{string}: filename ornilif no file is associated with this part
Back to module description.
Method mime_part:is_image()
Returns true if mime part is an image
Parameters:
No parameters
Returns:
{bool}: true if a part is an image
Back to module description.
Method mime_part:get_image()
Returns rspamd_image structure associated with this part. This structure has the following methods:
get_width- return width of an image in pixelsget_height- return height of an image in pixelsget_type- return string representation of image's type (e.g. 'jpeg')get_filename- return string with image's file nameget_size- return size in bytes
Parameters:
No parameters
Returns:
{rspamd_image}: image structure or nil if a part is not an image
Back to module description.
Method mime_part:is_archive()
Returns true if mime part is an archive
Parameters:
No parameters
Returns:
{bool}: true if a part is an archive
Back to module description.
Method mime_part:is_attachment()
Returns true if mime part looks like an attachment
Parameters:
No parameters
Returns:
{bool}: true if a part looks like an attachment
Back to module description.
Method mime_part:get_archive()
Returns rspamd_archive structure associated with this part. This structure has the following methods:
get_files- return list of strings with filenames inside archiveget_files_full- return list of tables with all information about filesis_encrypted- return true if an archive is encryptedget_type- return string representation of image's type (e.g. 'zip')get_filename- return string with archive's file nameget_size- return size in bytes
Parameters:
No parameters
Returns:
{rspamd_archive}: archive structure or nil if a part is not an archive
Back to module description.
Method mime_part:is_multipart()
Returns true if mime part is a multipart part
Parameters:
No parameters
Returns:
{bool}: true if a part is is a multipart part
Back to module description.
Method mime_part:is_message()
Returns true if mime part is a message part (message/rfc822)
Parameters:
No parameters
Returns:
{bool}: true if a part is is a message part
Back to module description.
Method mime_part:get_boundary()
Returns boundary for a part (extracted from parent multipart for normal parts and from the part itself for multipart)
Parameters:
No parameters
Returns:
{string}: boundary value or nil
Back to module description.
Method mime_part:get_enclosing_boundary()
Returns an enclosing boundary for a part even for multiparts. For normal parts
this method is identical to get_boundary
Parameters:
No parameters
Returns:
{string}: boundary value or nil
Back to module description.
Method mime_part:get_children()
Returns rspamd_mimepart table of part's childer. Returns nil if mime part is not multipart or a message part.
Parameters:
No parameters
Returns:
{rspamd_mimepart}: table of children
Back to module description.
Method mime_part:is_text()
Returns true if mime part is a text part
Parameters:
No parameters
Returns:
{bool}: true if a part is a text part
Back to module description.
Method mime_part:get_text()
Returns rspamd_textpart structure associated with this part.
Parameters:
No parameters
Returns:
{rspamd_textpart}: textpart structure or nil if a part is not an text
Back to module description.
Method mime_part:get_digest()
Returns the unique digest for this mime part
Parameters:
No parameters
Returns:
{string}: 128 characters hex string with digest of the part
Back to module description.
Method mime_part:get_id()
Returns the order of the part in parts list
Parameters:
No parameters
Returns:
{number}: index of the part (starting from 1 as it is Lua API)
Back to module description.
Method mime_part:is_broken()
Returns true if mime part has incorrectly specified content type
Parameters:
No parameters
Returns:
{bool}: true if a part has bad content type
Back to module description.
Method mime_part:headers_foreach(callback, [params])
This method calls callback for each header that satisfies some condition.
By default, all headers are iterated unless callback returns true. Nil or
false means continue of iterations.
Params could be as following:
full: header value is full table of all attributestask:get_header_fullfor detailsregexp: return headers that satisfies the specified regexp
Parameters:
callback {function}: function from header name and header valueparams {table}: optional parameters
Returns:
No return
Back to module description.
Method mime_part:get_parent()
Returns parent part for this part
Parameters:
No parameters
Returns:
{rspamd_mimepart}: parent part or nil
Back to module description.
Method mime_part:get_specific()
Returns specific lua content for this part
Parameters:
No parameters
Returns:
{any}: specific lua content
Back to module description.
Method mime_part:set_specific(<any>)
Sets a specific content for this part
Parameters:
No parameters
Returns:
{any}: previous specific lua content (or nil)
Back to module description.
Method mime_part:is_specific(<any>)
Returns true if part has specific lua content
Parameters:
No parameters
Returns:
{boolean}: flag
Back to module description.
Method mime_part:get_urls([need_emails|list_protos][, need_images])
Get all URLs found in a mime part. Telephone urls and emails are not included unless explicitly asked in list_protos
Parameters:
need_emails {boolean}: iftruethen return also email urls, this can be a comma separated string of protocols desired or a table (e.g.mailtoortelephone)need_images {boolean}: return urls from images () as well
Returns:
{table rspamd_url}: list of all urls found
Back to module description.
Method text_part:get_html_fuzzy_hashes(mempool)
Generate fuzzy hashes for HTML content (if text part is HTML). Returns digest and shingles table similar to get_fuzzy_hashes, but for HTML structure instead of text content.
HTML shingles include:
- Structure shingles (DOM tag sequence with domains)
- CTA domains hash (critical for phishing detection)
- All domains hash
- Statistical features hash
Parameters:
mempool {rspamd_mempool}: memory pool to use
Returns:
{digest, shingles}: digest is hex string, shingles is array of hashes + metadata
Back to module description.
Method text_part:get_cta_urls([max_urls])
Get CTA (call-to-action) URLs from HTML part sorted by button weight
Parameters:
max_urls {number}: optional maximum number of URLs to return
Returns:
{table}: array of URL objects sorted by importance (descending)
Back to module description.
Method mime_part:get_stats()
Returns a table with the following data:
- -
lines: number of linesspaces: number of spacesdouble_spaces: double spacesempty_lines: number of empty linesnon_ascii_characters: number of non ascii charactersascii_characters: number of ascii characters
Parameters:
No parameters
Returns:
{table}: table of stats
Back to module description.
Back to top.